It is often quite complex and time-consuming to get a data science project off the ground. So I am sharing some of my thoughts and my checklist for what needs to be in place to get a data science project started.
I would advise the following approach: a series of calls to establish requirements, followed by a kickoff meeting for data exploration, followed by a couple of weeks for both sides to get everything together that they need to get the project started.
Initial meetings and calls before starting a data science project
Discuss what the client is trying to achieve. Often a client will want to define which machine learning approach is needed, without first stepping back and asking if machine learning is even necessary and what they want to achieve.
If possible, obtain a sample of data in advance as a sanity check. Without seeing the data we cannot say if the project will be possible or not.
Several key questions we should ask of the client when starting a data science project are:
- What are we predicting?
- How will it help the business?
- Has the business attempted this before? What happened?
- Are we predicting a time series? For example, the volume of purchases per day? In which case what extra information do we have on the previous day that could help us?
- How many data points are there? Let’s assume a company wants to predict something about its users or customers. How many users are in the database? I have been contacted by startups who have fewer than 100 users.
- How much information do we have about each user or customer?
- At what point in time do we want to make the prediction about the user? For example, do we want to predict a user’s purchases one month from now, or a year from now?
- Is there an existing method of making a prediction? For example, we can often predict a customer’s next purchase volume simply by averaging their history. We need to think carefully if machine learning is likely to beat this baseline.
- How long has the organisation been gathering data? For example, if we want to predict purchase patterns over Christmas we would need a dataset of at least three years in order to relate one Christmas to the previous one, and to evaluate on the following Christmas.
- Does the business have a preferred cloud provider (e.g. Microsoft, Google, Amazon)? Often, if a company uses Outlook and other Microsoft products, they will prefer us to use Microsoft Azure for any deployed machine learning models, and their data protection officer may object to an external data scientist using Google or Amazon products for machine learning. A good data scientist should be prepared to work with all three.
On site kickoff meeting – at least a week before the data science project project starts
After investigating these questions we can arrange for an on-site meeting (pandemics notwithstanding). Ideally, we would have access to the main bulk of the data before the on-site meeting. The on-site meeting would ideally be some time before the planned start of the project as it can help to identify anything which would block the project.
- Discuss and agree on the goals of the project.
- Identify the stakeholders in the project, and who the data scientist will be reporting to. I have seen a number of projects fail in large organisations because the reporting chain between the data science and the stakeholders had too many links.
- Define reporting frequency and person to contact in case of blockers.
- Agree on and sign further NDAs if applicable.
- Request physical access to the client’s site and computer systems.
- Request access to all in house data sources, any third party data sources and also any APIs. In most organisations access takes at least a week to be granted.
- Request access to version control, ticketing systems, and cloud computing accounts.
- Using whatever data dump is available, do some basic data exploration. Plot histograms and scatter plots of numeric values. For any categorical or string field find out what is the commonest value and what is the rarest. Eyeball the data to check if any values change over time. Check for unexpected null values, inconsistent data types, and any other problems in the dataset.
- Try to build a very quick and dirty machine learning model. This is a sanity check to ensure that ML really can achieve something on this problem and what level of accuracy we should aim to beat.
- Agree when to reconvene to begin the project.
After the initial on-site meeting, ideally we would leave a couple of weeks for the client to gather data and get all the blockers out of the way so that the project can start.