Starting a data science project

· Thomas Wood
Starting a data science project

Elevate Your Team with NLP Specialists

Unleash the potential of your NLP projects with the right talent. Post your job with us and attract candidates who are as passionate about natural language processing.

Hire NLP Experts

It is often quite complex and time-consuming to get a data science project off the ground. So I am sharing some of my thoughts and my checklist for what needs to be in place to get a data science project started.

I would advise the following approach: a series of calls to establish requirements, followed by a kickoff meeting for data exploration, followed by a couple of weeks for both sides to get everything together that they need to get the project started.

Initial meetings and calls before starting a data science project

Discuss what the client is trying to achieve. Often a client will want to define which machine learning approach is needed, without first stepping back and asking if machine learning is even necessary and what they want to achieve.

If possible, obtain a sample of data in advance as a sanity check. Without seeing the data we cannot say if the project will be possible or not.

Several key questions we should ask of the client when starting a data science project are:

  • What are we predicting?

  • How will it help the business?

  • Has the business attempted this before? What happened?

  • Are we predicting a time series? For example, the volume of purchases per day? In which case what extra information do we have on the previous day that could help us?

  • How many data points are there? Let’s assume a company wants to predict something about its users or customers. How many users are in the database? I have been contacted by startups who have fewer than 100 users.

  • How much information do we have about each user or customer?

Fast Data Science - London

Need a business solution?

NLP, ML and data science leader since 2016 - get in touch for an NLP consulting session.
  • At what point in time do we want to make the prediction about the user? For example, do we want to predict a user’s purchases one month from now, or a year from now?

  • Is there an existing method of making a prediction? For example, we can often predict a customer’s next purchase volume simply by averaging their history. We need to think carefully if machine learning is likely to beat this baseline.

  • How long has the organisation been gathering data? For example, if we want to predict purchase patterns over Christmas we would need a dataset of at least three years in order to relate one Christmas to the previous one, and to evaluate on the following Christmas.

  • Does the business have a preferred cloud provider (e.g. Microsoft, Google, Amazon)? Often, if a company uses Outlook and other Microsoft products, they will prefer us to use Microsoft Azure for any deployed machine learning models, and their data protection officer may object to an external data scientist using Google or Amazon products for machine learning. A good data scientist should be prepared to work with all three.

On site kickoff meeting - at least a week before the data science project project starts

A kick-off meeting is essential when starting a data science project.

A kick-off meeting is essential when starting a data science project.

After investigating these questions we can arrange for an on-site meeting (pandemics notwithstanding). Ideally, we would have access to the main bulk of the data before the on-site meeting. The on-site meeting would ideally be some time before the planned start of the project as it can help to identify anything which would block the project.

  • Discuss and agree on the goals of the project.

  • Identify the stakeholders in the project, and who the data scientist will be reporting to. I have seen a number of projects fail in large organisations because the reporting chain between the data science and the stakeholders had too many links.

  • Define reporting frequency and person to contact in case of blockers​.

  • Agree on and sign further NDAs if applicable.

  • Request physical access to the client’s site and computer systems.

  • Request access to all in house data sources, any third party data sources and also any APIs. In most organisations access takes at least a week to be granted.

  • Request access to version control, ticketing systems, and cloud computing accounts.

  • Using whatever data dump is available, do some basic data exploration. Plot histograms and scatter plots of numeric values. For any categorical or string field find out what is the commonest value and what is the rarest. Eyeball the data to check if any values change over time. Check for unexpected null values, inconsistent data types, and any other problems in the dataset.

  • Try to build a very quick and dirty machine learning model. This is a sanity check to ensure that ML really can achieve something on this problem and what level of accuracy we should aim to beat.

  • Agree when to reconvene to begin the project.

After the initial on-site meeting, ideally we would leave a couple of weeks for the client to gather data and get all the blockers out of the way so that the project can start.

We have provided the checkpoints above in a handy checklist on the Resources section of our website, together with an in-browser Gantt chart generator for NLP projects, a data science roadmap planner, a project cost planner spreadsheet, and a project risk tool.

Elevate Your Team with NLP Specialists

Unleash the potential of your NLP projects with the right talent. Post your job with us and attract candidates who are as passionate about natural language processing.

Hire NLP Experts

Semantic similarity with sentence embeddings
Data scienceNatural language processing

Semantic similarity with sentence embeddings

In natural language processing, we have the concept of word vector embeddings and sentence embeddings. This is a vector, typically hundreds of numbers, which represents the meaning of a word or sentence.

How is AI being used in healthcare?
Ai and societyData science

How is AI being used in healthcare?

We often hear about the potential for AI in healthcare, or how it could transform organisations like the UK’s National Health Service.

Large language models (LLM) and NLP: A new era of AI and ML has begun
Data scienceNatural language processing

Large language models (LLM) and NLP: A new era of AI and ML has begun

Large Language Models and NLP: Overview AI has seen remarkable advancements in recent years and one of its most notable ones is the development of LLMs or large language models.

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us