One challenge that large organisations face today is the problem of understanding and predicting which employees are going to leave the business, called employee turnover or workforce attrition.
Employees do not always participate in offboarding processes, may not be truly forthcoming in the HR exit interview, and by the time the exit interview comes around it’s too late to address the issues which caused the employee to leave in the first place.
Furthermore, if you have a large workforce, then you may want to be able to predict which employees are at risk of leaving at any given time, how long they are expected to stay, and get a hint of which interventions may have a chance of reducing attrition.
Fortunately most organisations today will have some form of employee database. This can be a gold mine for data scientists who want to predict or explain employee turnover.
This problem is a little trickier than predicting customer spend. Any employee database is going to have a highly sensitive information. If you are in the UK or EU, the GDPR limits the type of analysis you can do on an employee database, the actions you are allowed to take based on employee data, and even the technology you can use. You may not be able to use external data storage and processors such as cloud services, and if you do you will be restricted to European servers.
So how would we go about predicting employee attrition with data science?
Typically around your organisation you will have a disparate set of databases such as salary databases, employee address databases, onboarding and recruitment records, etc. They are probably maintained by different departments.
The first step would be to find a way to unify the datasets so that for every current and past employee you can easily access all data about them. You want to know when someone joined the organisation, when they advanced pay grades, and when they left.
Here a point about data strategies which you can adopt as an organisation to make this kind of data science experiment easier:
- ideally you have an employee ID used across the organisation so that employees can be tracked. Names can be misspelt or change on marriage, and National Insurance or Social Security numbers can come with privacy and portability issues.
- ideally the organisation would have set up their databases, so that when an employee moves house and updates their address for example, the old address and new address and date of the update would all be stored. Likewise you should store pay grade history rather than just current pay grade.
Once you have found out how to join together each employee’s record, the next step is to try to transform a dataset into a single flat table, which is the easiest format to feed into a machine learning algorithm.
The snapshot approach: a classification model
There are many ways of transforming your employee data into a single table but here is one of the simplest:
You create a single table representing every employee present in the organisation on 1 January 2019, with columns for values such as the time they have spent in the organisation, and a final column set to TRUE or FALSE (a Boolean value), indicating whether they left the organisation by 31 January or not. This can be your training data.
Then create the same snapshot table for 1 January 2020, which can be your test data.
You can be creative with your columns. For example if you have the employees’ home address on the snapshot date then you can calculate their distance to the office, travel time, etc. The important thing is that all values in your table should be the values at the date of the snapshot, so the distance in the training table should be the distance from the office on 1 Jan 2019 to their home on 1 Jan 2019, and likewise the age should be the employee’s age on that date.
This latter point is quite tricky to get right, and if you are doing the joining operation in SQL you will need to use window functions.
Predicting the turnover
Now you have both tables, you can feed them into a machine learning algorithm of your choice. You tell the algorithm these two things:
I know the employee’s age, distance to the office, time at pay grade, time in the organisation
I want to predict the employee’s turnover over the following month
If you like Python I recommend to try a Random Forest or Gradient Boosted Tree, or you can also use a cloud based auto ML tool such as Microsoft Azure or Google Cloud Platform. There are a plethora of tutorials available to get started.
Make sure to exclude the Employee ID from the analysis, since otherwise you run the risk of your model just memorising which employees have left!
You can train your model on the 2019 data and evaluate on the 2020 data. If it performs well then you know your model is robust enough to learn patterns and apply them on a cohort of employees a year in the future.
It means that you can analyse the current cohort of employees and produce a ranking of those who are most at risk of dropping out, allowing your human resources department to target retention measures effectively.
(In practice I would not just make a snapshot on a single date but rather take various snapshots throughout the year, trying to keep even amounts of data from every month to eliminate bias from seasonal effects. There is no reason to limit the monitored period to 1 month either – you can always train it to predict attrition in the next year or decade if you have enough data.)
Of course your model will not have foreseen the Covid-19 pandemic of 2020. This will always be a limitation of machine learning, which involves learning patterns from the past to apply to the future. However you can design any system using your model to allow for a manual ‘adjustment factor’, for example to let you adjust attrition for all employees by a user defined constant during an economic downturn.
Analysing the causes of turnover
Most machine learning models will allow you to look inside and analyse how they are making the decisions that they return. This is called model explainability, or feature importances:
If you find that distance from home to office is a major factor in attrition then you can adjust recruitment policy to prioritise candidates who live close by, or include a relocation package, or put on a company bus or car pooling scheme. Of course you can’t use this information to discriminate on age.
The benefits of applying a model like this extend beyond its pure prediction capabilities towards insights that can modify the operations of the organisation as a whole. The cost savings to the organisation are two-fold as HR professionals can use the model’s explanations to develop retention policies across the business and also target high risk individuals with retention initiatives.
Postscript: Alternative approaches to the classification model for turnover
Other than the classification model described above, I can think of two other ways I can think of that you could try to model turnover.
Firstly – instead of classification, why not try a regression model to predict the total time that an employee will stay in the business from a given date?
Actually I would not recommend to use regression for the following reasons. For the employees on 1 January 2019, we know how much longer everybody stayed in the company up to today, 29 April 2020, i.e. 484 days. For anybody who is still in the business at day 484, we know that their total stay is greater than or equal to 484 but we can’t define it. You would have to think of a workaround for the model. If you set the stay to 484, or any arbitrary large value, then you are introducing a bias which a regression model will not handle correctly. If you simply exclude these people you will introduce another bias. Statisticians would say that our data is right censored.
If you wanted to use machine learning to predict the total remaining time someone will stay in the business, then I would suggest to train a separate classification model to predict attrition within 1 month, 2 months, etc, and combine them when you want to make a prediction.
As another alternative to the classification model, one possible alternative tool we could use from statistics might be survival analysis.
This is used for example in clinical studies on diseases with high fatality rates (mainly cancer, heart attack and stroke patients), to analyse the proportion of a starting cohort or patients who have not died by various points in time.
The survivals can be plotted on a curve called a Kaplan-Meier curve:
You can also calculate a number called the Kaplan-Meier estimator which is an approximation of the survival rate at any point in time.
Survival analysis is robust to right censoring and so could be used to analyse employee attrition on a longer time scale than the machine learning model, however it becomes more complex to use when we are predicting from lots of independent variables (commute length, age, pay grade etc).
I am not aware of any businesses using survival analysis to predict employee attrition but I would be interested to hear if anyone is doing it.
- Wiesweg, Multivariate survival analysis, R project (2020)
- Hulley et al, Designing clinical research, Wolters Kluwer (2013)