10 reasons why data science projects fail

black claw hammer on brown wooden plank 209235

The high failure rate

When I talk to my colleagues in data science about successful projects that we’ve done in the past, one recurring theme comes up. We ask ourselves, which of our data science projects made it through to deployment and are used by the company that commissioned them, and which projects failed?

I think for most of us the reality is that only a minority of what we do ends up making a difference.

According to a recent Gartner report, only between 15% and 20% of data science projects get completed. Of those projects that did complete, CEOs say that only about 8% of them generate value. If these figures are accurate, then this would amount to an astonishing 2% success rate.

80-85% of projects fail before completion. Then there is a further dropoff when organisations fail to implement the data scientists' findings.
80-85% of projects fail before completion. Then there is a further dropoff when organisations fail to implement the data scientists’ findings.

What is the root cause of project failure?

So what is going wrong?

If you talk to the data scientists and analysts, you might hear, I made a great model, it has a wonderful accuracy, why did nobody use it? The business stakeholders and executives were hard to get hold of and unengaged.

If you talk to the stakeholders, they will say, the data scientists made a pretty model, and I was impressed by their qualifications, but it doesn’t answer our question.

Possible causes of failure

On the business side,

  1. there is a champion of data science on the business side, but that person struggled to get traction with the executives to bring in the changes recommended by the data scientists.
  2. the person who commissioned the project has moved on in the organisation and their successor won’t champion the project because they won’t get credit for it.
  3. communication has broken down as the business stakeholders were too busy with day to day operations. Once stakeholders don’t have time to engage, it is very hard to rescue the project. This happens a lot if the data scientists are geographically distant from the core of the business.
  4. data science projects are long term. In that time the business may have changed direction or the executives may have lost patience waiting for an ROI.
  5. although some stakeholders were engaged, the executive whose sign off was needed was never interested in the project. This is often the case in large companies in conservative industries.

On the data science side,

  1. the data scientist lost focus and spent too long experimenting with models as if they were in academia.
  2. the data scientist wasn’t able to communicate their findings effectively to the right people.
  3. the data scientist was chasing the wrong metric.
  4. the data scientist didn’t have the right skills or tools for the problem.

On both sides,

  1. the main objective of the project was knowledge transfer but it never occurred because the business was too busy or the data scientist had inadequate communication skills.

How can we stop data science projects failing?

Recipe for a successful data science project: how to stop your project failing. Pre-project, during the project, and post-project

We need to structure the data science project effectively into a series of stages, so that engagement between the analytics team and the business does not break down.

Business question: First the project should start with a business question instead of focusing on data or technologies. The data scientists and executives should spend time together in a workshop formulating exactly what the question is that they want to solve. This is the initial hypothesis.

Data collection: Secondly the data scientist should move on to collecting only the relevant data that is needed to accept or reject the hypothesis. This should be done as quickly as possible rather than trying to do everything perfectly.

Back to stakeholders: Thirdly the data scientist needs to present initial insights to the stakeholders so that the project can be properly scoped and we can establish what we want to achieve. At this point the business stakeholders should be thoroughly involved and the data scientist should make sure that they understand at this point what the ROI will be if the project proceeds. If at this point the decision makers are not engaged, it would be a waste of money to continue with the project.

Investigation stage: Now the data scientist proceeds with the project. I recommend at least weekly catch ups with the main stakeholder, and slightly less regular catch ups with the high ranking executive whose support is needed for the project. The data scientist should favour simple over complex and choose transparent AI solutions wherever possible. At all stages the data scientist should be striving to keep engagement. Time spent in meetings with the stakeholder is not wasted, it is nurturing business engagement. At all points both parties should keep an eye on whether the investigation is heading towards an ROI for the organisation.

Presentation of insights: Finally at the end of the project the data scientist should present their insights and recommendations for the business to the stakeholder and all other high ranking executives. You can go overboard with materials: produce a presentation, a video recording, a white paper and also hand over source code, notebooks and data, so that both executive summaries and in depth handover data is available for all levels in the commissioning organisation from technical people to the CEO.

If the above steps are followed, by this point the value should be clear for the high ranking executives. The two-way communication between the data science team and the stakeholders should ensure ongoing buy-in and support from the business, and should also keep the data science work on track to delivering value by the end of the project.

References

Measuring the accuracy of AI for healthcare?

Mammo breast cancer
Left: a benign mammogram, right: a mammogram showing a cancerous tumour. Source:
National Cancer Institute

You may have read about the recent Google Health study where the researchers trained and evaluated an AI model to detect breast cancer in mammograms.

It was reported in the media that the Google team’s model was more accurate than a single radiologist at recognising tumours in mammograms, although admittedly inferior to a team of two radiologists.

But what does ‘more accurate’ mean here? And how can scientists report it for a lay audience?

Imagine that we have a model to categorise images into two groups: malignant and benign. Imagine that your model categorises everything as benign, whereas in reality 10% of images are malignant and 90% are benign. This model would be useless but would also be 90% accurate.

This is a simple example of why accuracy can often be misleading.

In fact it is more helpful in a case like this to report two numbers: how many malignant images were misclassified as benign (false negatives), and how many benign images were misclassified as malignant (false positives).

The Google team reported both error rates in their paper:

We show an absolute reduction of 5.7%… in false positives and 9.4%… in false negatives [compared to human radiologists].

McKinney et al, International evaluation of an AI system for breast cancer screening, Nature (2020)

This means that the model improved in both kinds of misclassifications. If only one error rate had improved with respect to the human experts, it would not be possible to state whether the new AI was better or worse than humans.

Calibrating a model

Sometimes we want even finer control over how our model performs. The mammogram model has two kinds of misdiagnoses: the false positive and the false negative. But they are not equal. Although neither kind of error is desirable, the consequences of missing a tumour are greater than the consequences of a false alarm.

For this reason we may want to calibrate the sensitivity of a model. Often the final stage of a machine learning model involves outputting a score: a probability of a tumour being present.

But ultimately we must decide which action to take: to refer the patient for a biopsy, or to discharge them. Should we act if our model’s score is greater than 50%? Or 80%? Or 30%?

If we set our cutoff to 50%, we are assigning equal weight to both actions.

However we probably want to set the cutoff to a lower value, perhaps 25%, meaning that we err on the side of caution because we don’t mind reporting some benign images as malignant, but we really want to avoid classifying malignant images as benign.

However we can’t set the cutoff to 0% – that would mean that our model would classify all images as malignant, which is useless!

So in practice we can vary the cutoff and set it to something that suits our needs.

Choosing the best cutoff is now a tricky balancing act.

ROC curves

If we want to evaluate how good our model is, regardless of its cutoff value, there is a neat trick we can try: we can set the cutoff to 0%, 1%, 2%, all the way up to 100%. At each cutoff value we check how many malignant→benign and benign→malignant errors we had.

Then we can plot the changing error rates as a graph.

We call this a ROC curve (ROC stands for Receiver Operating Characteristic).

google roc smaller

This is the ROC curve of the Google mammogram model. The y axis is true positive rate, and the x axis is false positive rate. Source: McKinney et al (2020)

The nice thing about a ROC curve is that is lets you see how a model performs at a glance. If your model is just a coin toss, your ROC curve would be a straight diagonal line from the bottom left to the top right. The fact that Google’s ROC curve bends up and left shows that it’s better than a coin toss.

If we need a single number to summarise how good a model is, we can take the area under the ROC curve. This is called AUC (area under the curve) and it works a lot better than accuracy for comparing different models. A model with a high AUC is better than one with a low AUC. This means that ROC curves are very useful for comparing different AI models.

You can also put human readers on a ROC curve. So Google’s ROC curve contains a green data point for the human radiologists who were interpreting the mammograms. The fact that the green point is closer to the diagonal than any point on the ROC curve confirms that the machine learning model was indeed better than the average human reader.

Whether the machine learning model outperformed the best human radiologists is obviously a different question.

Can we start using the mammogram AI in hospitals tomorrow?

In healthcare, as opposed to other areas of machine learning, the cost of a false negative or false positive can be huge. For this reason we have to evaluate models carefully and we must be very conservative when choosing the cutoff of a machine learning classifier like the mammogram classifier.

It is also important for a person not involved in the development of the model to evaluate and test the model very critically.

If the mammogram was to be introduced into general practice in healthcare I would expect to see the following robust checks to prove its suitability:

  • Test the model against not only the average human radiologist but also the best neurologist, to see where it is underperforming.
  • Check for any subtype of image where the model consistently gets it wrong. For example images with poor lighting.
  • Look at the explanations of the model’s correct and incorrect decisions using a machine learning interpretability package (see my earlier post on explainable machine learning models).
  • Test the model for any kind of bias with regards to race, age, body type, etc (see my post on bias).
  • Test the model in a new hospital, on a new kind of X-ray machine, to check how well it generalises. The Google team did this by training a model on British mammograms and testing on American mammograms.
  • Collect a series of pathological examples (images that are difficult to classify, even for humans) and stress test the model.
  • Assemble a number of atypical images such as male mammograms, which will have been a minority or nonexistent in the training dataset, and check how well the model generalises.

If you think I have missed anything please let me know. I think we are close to seeing these models in action in our hospitals but there are still lots of unknown steps before the AI revolution conquers healthcare.

Thanks to Ram Rajamaran for some interesting discussions about this problem!

References

Hamzelou, AI system is better than human doctors at predicting breast cancer, New Scientist (2020).

McKinney et al, International evaluation of an AI system for breast cancer screening, Nature (2020).