Automated ML: the end of the data scientist?

What is automated ML?

Automated machine learning is software which in theory allows anybody to design, train, and deploy machine learning models to production environments without needing to write any code. It is often a drag-and-drop experience similar to PowerPoint.

You may have heard a lot about automated machine learning recently. Examples include Microsoft’s Azure ML Studio, Google’s Cloud AutoML and Amazon’s AWS AutoPilot, among others.

automl screenshot
A screenshot of Azure ML Studio’s environment being used to build a text classifier.

On 7th April Forbes even ran the headline AutoML 2.0: Is The Data Scientist Obsolete? (Their conclusion: no they aren’t.)

In fact according to the marketing literature of the companies selling automated ML, there is no need to hire data scientists any more. Automated ML will democratise data science and allow non technical people to build their own models.

My experience using automated ML

However I have tried out a couple of these tools and found that although they are extremely useful, they by no means automate even half of my work.

What’s the catch?

For one, if you look through the examples in the tutorials of any of these platforms you will see that you nearly always need a nice neat table of your customers’ banking history, with a final column of 0’s or 1’s indicating if they were granted a loan.

automl table screenshot
A table of data being imported into Azure ML

Building models is a small part of a data scientist’s job

In real life, the organisation building the model would not have a nice table of clean data lying around like this. A person’s banking or purchase history will be spread over many rows of different tables in different systems. You would have several iterations of finding the different data sources and joining them up into the format that the automated ML tools expect. You will spend a lot of time pestering managers in remote departments of the company to give you access to data. It is this data gathering and cleaning (as well as pestering) which often makes up 90% of a data scientist’s job.

Furthermore, when you dig into the tutorials of these packages, the automated ML tools only allow you to do an extremely limited set of things using the drag and drop interface, and once you get away from the beginners’ examples you find yourself having to start programming in Python to use the automated ML libraries. I think this would always be inevitable: nobody seriously suggests that software development will be replaced by a drag and drop interface so why are we having this conversation about data science?

Auto ML can be useful even for experienced data scientists

Having said that, there are some things that I found automated ML to be extremely useful for. Often once we have done the data preparation step I defined above, we end up doing a painstaking search through many different ML algorithms (Random Forest, Gradient Boosted Tree, Neural Networks etc…) with all different configurations. With one of the automated ML packages, you can be coding in Python and simply train an automated ML model, and under the hood the software will run every algorithm in its toolbox and pick the best performing one.

I have been using Azure ML for my last few projects (predictive models in healthcare) and I found that in terms of accuracy it outperformed the basic models that I was building in Scikit-learn, and was quicker to use as well because I only had to write a few lines of code.

In conclusion I think that automated ML allows data scientists to be more productive and is another useful tool in a data scientist’s repertoire. In addition it provides a degree of democratisation by allowing non-data scientists to see and participate in data science for the first time. But nobody’s job is going to be automated just yet.

References

Ryohei Fujimaki, AutoML 2.0: Is The Data Scientist Obsolete?, Forbes (2020)

Leave a Reply

Your email address will not be published. Required fields are marked *