It is often quite complex and time consuming to get a data science project off the ground. So I am sharing some of my thoughts and my checklist for what needs to be in place to get a project started.
I would advise the following approach: a series of calls to establish requirements, followed by a kickoff meeting for data exploration, followed by a couple of weeks for both sides to get everything together that they need to get the project started.
Initial meetings and calls before starting a data science project
Discuss what the client is trying to achieve. Often a client will want to define which ML approach is needed, without first stepping back and asking if ML is even necessary and what they want to achieve.
If possible, obtain a sample of data in advance as a sanity check. Without seeing the data we cannot say if the project will be possible or not.
Several key questions we should ask of the client:
- What are we predicting?
- How will it help the business?
- Has the business attempted this before? What happened?
- Are we predicting a time series? For example volume of purchases per day? In which case what extra information do we have on the previous day that could help us?
- How many data points are there? Let’s assume a company wants to predict something about its users or customers. How many users are in the database? I have been contacted by startups who have fewer than 100 users.
- How much information do we have about each user or customer?
- At what point in time do we want to make the prediction about the user? For example do we want to predict a user’s purchases one month from now, or a year from now?
- Is there an existing method of making a prediction? For example we can often predict a customer’s next purchase volume simply by averaging their history. We need to think carefully if ML is likely to beat this baseline.
- How long has the organisation been gathering data? For example if we want to predict purchase patterns over Christmas we would need a dataset of at least three years in order to relate one Christmas to the previous one, and to evaluate on the following Christmas.
On site kickoff meeting – at least a week before the data science project project starts
After investigating these questions we can arrange for an on site meeting (pandemics notwithstanding). Ideally we would have access to the main bulk of the data before the on site meeting. The on site meeting would ideally be some time before the planned start of the project as it can help to identify anything which would block the project.
- Discuss and agree on the goals of the project.
- Identify the stakeholders in the project, and who the data scientist will be reporting to. I have seen a number of projects fail in large organisations because the reporting chain between the data science and the stakeholders had too many links.
- Define reporting frequency and person to contact in case of blockers
- Agree on and sign further NDAs if applicable.
- Request physical access to the client’s site and computer systems.
- Request access to all in house data sources, any third party data sources and also any APIs. In most organisations access takes at least a week to be granted.
- Using whatever data dump is available, do some basic data exploration. Plot histograms and scatter plots of numeric values. For any categorical or string field find out what is the commonest value and what is the rarest. Eyeball the data to check if any values change over time. Check for unexpected null values, inconsistent data types, and any other problems in the dataset.
- Try to build a very quick and dirty machine learning model. This is a sanity check to ensure that ML really can achieve something on this problem and what level of accuracy we should aim to beat.
- Agree when to reconvene to begin the project.
After the initial on site meeting ideally we would leave a couple of weeks for the client to gather data and get all the blockers out of the way so that the project can start.