Frequently Asked Questions

General questions about Fast Data Science

What is Fast Data Science?

Fast Data Science is a specialist NLP and data science consultancy based in London. We are a small company and we take on consulting engagements from clients around the world in many industries. We also have a flagship product, the Clinical Trial Risk Tool, which is a software-as-a-service (SaaS) product which analyses clinical trials.

What does Fast Data Science do?

We help companies extract structured information from unstructured datasets, such as PDFs or other documents in natural language. Clients hire us to take on difficult NLP, AI, or data science tasks which they may not have the in-house capacity or specialism to handle.

How do I contact you?

The easiest way is on our contact form or by phoning us on +44 20 3488 5740.

Fast Data Science - London

Need a data science consultancy?

NLP, ML and data science leader since 2016 - get in touch for an NLP consulting session.

Are you hiring?

Sorry, we don’t have any vacancies at the moment. Please follow our page on LinkedIn or X in case something comes up in future:

Do you offer internships?

Unfortunately we don’t have any capacity for internships, but if you would like to get involved in data science we have the Harmony project https://harmonydata.ac.uk/ which is open source and we’re always happy to have more people involved in developing it.

I want to send you my resume

Feel free to send us a resume/CV. Unfortunately we’re not hiring right now. Please follow us in case something comes up in future:

Can you help write an academic paper?

We would be glad to help with your academic project. We have favourable rates for clients in academia. Please get in touch and we can discuss. You can check out all of our publications under https://fastdatascience.com/ai-in-research/publications-and-patents/

Do you store my data?

We use Google Analytics but do not hold any identifying information if you have visited the website. You can read more on our Privacy Policy page.

What software do you use?

We use Python, Scikit-Learn, Plotly Dash, TensorFlow, spaCy, NLTK, and other AI, machine learning and NLP libraries primarily in the Python ecosystem, however we can work with whichever software our clients need. We can use large language models via APIs such as OpenAI and Gemini, and we have also fine-tuned our own models. We are not tied to any particular cloud provider and we work with all major cloud computing platforms as well as on-premises servers. We work preferentially in Microsoft Azure and we are in the Microsoft Partner Network but we can also work in AWS, Google Cloud, or any other platform.

Who works at Fast Data Science?

The Director of the company, Thomas Wood, does most of the consulting work, but other experts work with us on a per-project basis. Check out the team info page for more information.

Who is the Director of Fast Data Science?

The Director of Fast Data Science is Thomas Wood, who does most of the consulting work, but other experts work with us on a per-project basis.

I want to build a predictive model to predict accidents or incidents. I don’t want someone to just connect up ChatGPT. I want the system to leverage the data that my organisation has collected over several years in our CRM. Can you help us?

We can definitely help with a predictive modelling project. We have built a number of predictive models of this kind for companies based on their internal data, which could be contained in a CRM or incident list. We worked for the Office of Rail and Road (UK rail regulator) on predictive modelling on datasets of all rail incidents (e.g. vehicle striking bridge, flooding, etc), and we also worked for Tarion, the Canadian housing regulator on a similar predictive model for housing defects, e.g. electrical, drywall, etc. We’ve also done a number of customer and employee churn projects, e.g. for the National Health Service. You may also be interested in this tool which de-risks a clinical trial: https://clinicaltrialrisk.org/

For example, we could put together a simple score on a scale 0-100 which you could work out with pencil and paper, which would predict the likelihood of an incident occurring in the next month. The machine learning models that we develop can be made completely explainable. It’s a positive that you have several years of data in your CRM, which should be enough to work with.

What should I look for in a data science consultancy?

We recommend checking your data science consultant has the following:

  1. Deep domain expertise - are they familiar with your industry? Do they know the difference between a “clinical trial phase” and a “marketing phase”? Or a “protocol” and a “prototype”? A generalist will treat all text data the same while a specialist knows that medical text requires specific Named Entity Recognition (NER) such as the tools and libraries developed by Fast Data Science.

  2. Proven MLOps capabilities. The consultancy should be able to demonstrate that they have successfully brought projects through to deployment. Inexperienced consultants will often deliver a “notebook” (a static analysis) that gathers dust. They may make models that will run only on their laptop and then consider the job done. Or they might evaluate a model in an entirely inappropriate setting, which doesn’t correspond to real-life usage, and then give you inflated accuracy figures. Fast Data Science has deployed a number of data science projects which are publicly visible (https://harmonydata.ac.uk/search, https://clinicaltrialrisk.org/).

  3. Transparency and explainability. Look for a consultant who can explain the models that they develop. Explainable models are less prone to bias. The consultant should be familiar with techniques like SHAP or LIME for explaining model outputs. The consultant should have a formal process for checking datasets for demographic or historical bias.

  4. Understanding of your business problem, before trying to talk about tools. Lots of consultancies may try to sell you a “Generative AI” solution before they’ve even seen your data or understood what your business needs. A good consultant should start by talking to all relevant stakeholders, which could be the VPs of every division, to understand what the AI needs to do and how it will impact your business’s bottom line and KPIs. Consultants are business people first, and technologists second. Sometimes the best solution isn’t to throw generative AI at everything. You might be fine with a simple yet intuitive regression formula. A trustworthy data science consultant will tell you when you don’t need expensive AI.

  5. Proven IP and case studies. Check out the consultant’s past engagements and look for case studies that mention ROI. (e.g., “Reduced document processing time by 40%” or “Increased clinical trial failure prediction by 15%”). Also check their GitHub account (https://github.com/fastdatascience/). Consultancies that contribute to the community (like Fast Data Science does with clinical tools) usually have a much deeper grasp of the underlying technology as well as the needs of people in your field.

  6. Reasonable scoping of costs and timelines. Your consultant should be able to give you a quote after a couple of meetings and having a cursory look at your data. If they can’t commit to a fixed cost or time scale, how do you know the costs won’t run out of control? At Fast Data Science, we always give a few options of fixed costs, which also works better with many organisations’ accounting processes such as purchase orders (POs). This means we’re incentivised to work efficiently and deliver something useful. We have a lot of repeat customers and long term retainer agreements as well, as we like to keep a long term relationship with our clients.

Watch out for the following red flags:

  1. Vague Timelines: Your consultant should be able to define a “Proof of Concept” and deliver it in 3 months or less.
  2. Obsession with a particular technology: A consultant who has done a PhD in a particular niche area, may be prone to focusing on a particular technology. Your consultant should be willing to work with the technology you have in house.
  3. No handover plan: Your consultant should offer training for your internal staff to take over the project.

I want to identify and address AI generated or AI assisted responses in my online survey. Can you help?

Identifying AI generated text is a difficult problem to solve. You can use stylometric techniques such as Burrows’ Delta. If there are non-text fields then we suggest collecting all values, such as response time, location, click speed, and eyeballing the data. You may see certain patterns which can be used to separate out AI responses from human responses.

Given enough data, and ideally some confirmations that certain responses are from humans or bots, it should be possible to train a machine learning model to discriminate between the two. However, this is a very tricky task and remember that big tech companies have invested a lot in systems like Captcha for this reason. Please get in touch for a consultation and Fast Data Science may be able to help.

We are hosting an academic or industry conference and we would like to invite someone to speak about AI. Do you take on speaking engagements?

We would be very glad to assist. The director, Thomas Wood, has spoken and presented at a number of academic and industry conferences, which you can find under https://fastdatascience.com/blog/events/. Please get in touch and let us know the details.

How much does data science consulting cost?

We recommend to find a consultancy which will charge you a fixed cost for the entire job. Many consultants will charge per hour, but at Fast Data Science we prefer to offer our clients a fixed cost. That means, we define the outcomes of the project and any milestones, and agree on a price. This incentivises us to work efficiently.

I am making a source criticism and I’ve used Fast Data Science. Can you tell me why you’re a reliable source?

There isn’t one single answer. Every blog post you find on our website generally has references at the bottom and links to external website. If something isn’t in a reference it could be from our own discovery work or experience or experimentation.

For example, let’s take this recent blog post as an example:

https://fastdatascience.com/ai-for-business/ai-generated-text/

It discusses the Wikipedia guide for identifying AI generated text and then goes into more detail about the experiments which we conducted.

So you could say some of this is opinion, some of it is original research and some is citing other people. I have tried to use as many reliable citations as possible.

Another thing that hopefully gives credibility is our list of publications, which have been published in peer-reviewed journals:

https://fastdatascience.com/ai-in-research/publications-and-patents/

Some of our articles got picked up by other publications. For example, the New York Times has quoted us: https://www.nytimes.com/2025/05/14/technology/ai-jobs-radiologists-mayo-clinic.html - they cited this article: https://fastdatascience.com/ai-in-healthcare/ai-replace-radiologists-doctors-lawyers-writers-engineers/.

Is there a limit to how many companies your UK Company Details Google Sheets™ plugin can pull

There is not a fixed limit, but the tool may time out if you try to process more than 100 companies in a go. If you need to do thousands I think the best way is to batch it in groups of 100… if you need all UK companies then you can also download a single file from Companies House which has all data

Do you white-label your service?

Yes, we are prepared to enter into partnership agreements like this. Please get in touch with Fast Data Science (https://fastdatascience.com) to discuss the specifics.

What is the cost of the AI strategy consulting service?

Every project is so unique, so we would not be able to give a fixed cost on the website. However, if you can get in touch with specifics of your project, we can come back with a cost estimate.

We are looking for consultants to respond to an RFP for consultancy. Can you do this?

Please send us the RFP and we will submit a response if it is in the right area.

Open source

I tried to use one of your open source libraries and it didn’t work

Please raise an issue in the Github issue board for the respective library. Please include all information about what you did, what the input was, and what went wrong. We need to be able to reproduce the error which you encountered. Then we will get back to you.

I want to contribute to one of your open source libraries. How can I do this?

Please raise a pull request in the appropriate library. We appreciate if you can keep your modifications to a minimum. That is, please ensure that you’re not pushing files which you don’t need to change. Ideally your pull request will only modify 2 or 3 files, otherwise it could be made more atomic. Here is a guide on making a good pull request: https://harmonydata.ac.uk/open-source-for-social-science/contributing-to-harmony-nlp-project/#forking-and-submitting-a-pull-request-pr

I have used your country-named-entity-recognition library (or another open source library) in a research project and I would like to cite the creators appropriately. Would you be able to send the authors and date of release of the library?

All of our open source libraries have a DOI and citation information in the README and CITATION.cff files. However, if you find one that is missing please use the contact form on fastdatascience.com.

For example, in the case of the Drug Named Entity Recognition library (https://github.com/fastdatascience/drug_named_entity_recognition), a citation could look like this:

Wood, T.A., Drug Named Entity Recognition [Computer software], Version 2.0.9, accessed at https://fastdatascience.com/drug-named-entity-recognition-python-library, Fast Data Science Ltd (2024)

You can use a Bibtex format which can be imported and converted into many citation formats:

@unpublished{drugnamedentityrecognition,
    AUTHOR = {Wood, T.A.},
    TITLE  = {Drug Named Entity Recognition (Computer software), Version 2.0.9},
    YEAR   = {2024},
    Note   = {To appear},
    url = {https://zenodo.org/doi/10.5281/zenodo.10970631},
    doi = {10.5281/zenodo.10970631}
}

Again, please check the README and CITATION.cff and if this information is missing or incorrect please let us know. And thank you very much for remembering to cite us! Your citations help us keep our open source projects alive.

Business ethics

How does Fast Data Science handle AI bias?

It is difficult to entirely eliminate AI bias from a solution. We ensure that training data for our machine learning models is free from protected category data such as gender and ethnic origin unless it is explicitly required as part of the solution. We pen-test models to check for inadvertent AI bias.

Does Fast Data Science have ethical guidelines?

All our business operations comply with modern slavery and trafficking laws. Employees, subcontractors, freelancers and suppliers are paid a fair wage. We use low carbon footprint technologies where possible, and avoid LLMs unless there is no alternative. Business meetings are conducted remotely and all travel is by train if possible. Please check out our modern slavery statement and sustainability policy.

Process & Integration

Will my team be able to maintain the AI models after the project ends?

The hand over at the end of the project includes full documentation, code bases, and training and handover sessions to ensure that your internal team can manage whatever has been built longer term.

Do you work with Cloud or On-Premise infrastructure?

Fast Data Science is infrastructure agnostic. Whether you require a secure on-premise deployment for sensitive medical data or a scalable AWS/Azure/Google Cloud solution, we tailor the architecture to your security needs. We are not tied to any particular cloud provider.

Natural language processing

What is natural language processing?

Natural Language Processing is an area of AI. It is everything to do with getting computers to understand and produce human language. That could involve text, audio files, or any kind of documents. At Fast Data Science we take on a lot of consulting work around natural language processing in industries where a lot of text is generated, such as healthcare and pharma. You will interact with an NLP system if you use ChatGPT or Gemini, for example. You can read about more examples of NLP in this blog post: https://fastdatascience.com/natural-language-processing/what-is-natural-language-processing-with-examples/

What does an NLP consultant do in healthcare?

Healthcare and pharma contain lots of opportunities for natural language processing, as large amounts of data, such as clinical trial reports or electronic health records are stored partly in text format. A lot of our projects involve PDFs so we have become adept at pulling structured information out of PDFs. A common ask is anonymisation: for some of our clients we are developing software to identify and automatically redact personally identifiable healthcare information (PHI) in clinical trial narrative reports. For another client, we are analysing electronic health records in HL7 format to identify if a patient can be included in a clinical trial (matches the inclusion criteria), or if a cancer should be reported to a registry. We have made open source libraries such as Drug Named Entity Recognition: https://github.com/fastdatascience/drug_named_entity_recognition which are used by research teams and commercial entities around the world.

If you have a project in healthcare that needs looking at, for example, a large amount of unstructured text in PDF format, please get in touch with Fast Data Science.

We take on consulting engagements where, e.g. there is a dispute over the authorship of a document. We would analyse all documents in question using forensic stylometry, which generates a ‘fingerprint’ of an author’s writing style. We could produce an expert witness report or expert advisor report according to what you require. Please contact us for a quote.

What is the difference between a standard LLM and a ‘Domain-Specific’ model for my business? Do I need a fine-tuned LLM? Or can I make do with generalist models?

It is possible to fine-tune your own large language model. We have provided a tutorial on how to fine tune a model for document similarity: https://fastdatascience.com/generative-ai/train-ai-fine-tune-ai/

However, in most cases, we would not recommend fine-tuning your own LLM. It is time consuming, and requires a lot of data. You are unlikely to have the resources to manually tag enough data for your LLM, so ideally you already possess that data.

Furthermore, the big tech offerings such as ChatGPT, DeepSeek, and Gemini, are improving so rapidly, that you’re unlikely to get an improvement in accuracy over the big players. Even if you do manage to improve this, your edge may disappear in a few months with the next release of an LLM.

If data privacy and sensitive data are your concern, we suggest you try self-hosting a large language model, or using Azure or AWS’s secure environments. You can even deploy models on Azure or AWS and remain GDPR and HIPAA compliant.

Some cases where it’s still worthwhile to train your own LLM are:

  • you are developing a sovereign AI, that is, you have state funding and government initiative behind you
  • you need to fine tune an LLM for a new language

We have worked on a number of natural language interfaces which turn very unstructured text into a structured format, such as:

Our solutions have mostly been deployed on Azure but could be deployed on any other platform.

If we had to approach this project, we would develop a deployed API with code, which receives input text and outputs an Excel or other structured format for your systems, with options for the user to refine their prompt.

As a first pass, we would use rule based systems to identify any entities mentioned, and if constraints allow we could also use structured output formats such as OpenAI’s JSON format.

Which is the best Large Language Model for the [legal/medical/financial etc] domain?

We have tested and evaluated 16 Large Language Models here: https://fastdatascience.com/generative-ai/openai-vs-claude-vs-qwen/

There is not much difference between the most recent state-of-the-art large language models. There is a much greater difference in capability between models that were released 6 months apart, than there is between models from different vendors. We found that Chinese models such as Deepseek and Qwen performed the same as the US models like ChatGPT and Gemini.

In general, the off the shelf models from the big tech companies also outperform any fine-tuned custom model that you or we (outside big tech companies) have the capability of building. The resources that have gone into training an LLM from scratch are comparable to the GDP of a small country. So even if it appears worthwhile to fine tune a mental health, medical, or financial model, it usually isn’t the case.

Our advice when choosing a large language model provider for your application, is to pick whichever one is most convenient for you to integrate into your technology stack, while avoiding vendor lock in. Try to ensure that you will always be able to switch easily to a different provider in future, because an API may become deprecated, or a company may stop offering a particular model, or they may increase their prices. So maximum flexibility is key.

I would like help with a stylometry project where I need to determine an author of a document from 200 years ago. I have several candidate authors to compare against each other to determine the actual author. Is this a service you offer?

You can try out the free open source library that we have developed, Fast Stylometry (https://github.com/fastdatascience/faststylometry). This will output the probability that an unknown document is written by a particular author.

However, please check how long the document is? Are we talking complete books, or just short letters?

Often, stylometry is only effective if you have long documents, i.e. at least a chapter length, to compare. Ideally your documents that you are comparing are of the same type of document, e.g. novels vs novels, speeches vs speeches. Trying to compare across document types can be difficult.

As the field moves on, large language models may reduce the amount of text needed to reach a conclusion. However, a single page, or short email, is usually far too short to be able to reach a conclusion on authorship with any certainty. Please note that high-profile stylometry “detective work”, like the identification of JK Rowling as the author of The Cuckoo’s Calling, involved entire novels. If you only have a couple of pages of text, we are unlikely to be able to do anything. However, feel free to get in touch with Fast Data Science as it may still be worth us taking a look.

I have some typed letters that I don’t believe were written by the same person. Can you prove who wrote the documents?

It is difficult to prove authorship one way or the other, however, the technology you are referring to is forensic stylometry. We have built an open source library which can run stylometry analyses. Ideally you need documents that are at least the length of a book chapter. Emails and letters are generally too short. A stylometry analysis can give you a percentage likelihood of a given person being the author of a document, provided we have enough documents from that person. The algorithm commonly used for this is called Burrows’ Delta and it has been around for quite some time, predating LLMs.

Clinical trials

We are interested in benchmarking clinical trial costs for different phases and disease areas. Can you help?

Our product, the Clinical Trial Risk Tool, can produce benchmarks for all phases and disease areas where there is enough publicly available data. If an area or functionality is required but not covered, we would be keen to discuss with you as we can develop or modify features according to your needs, and we have a number of features in the pipeline which have been requested by users, which we can prioritise.

The tool can be run in the cloud or on premises. If there’s a country or region or type of trial which you want us to cover, we can always discuss this. Please get in touch to discuss your needs.

I am looking for an AI-enabled solution to reduce risk in clinical trial design and execution. Can you help?

Yes, we can help. Please get in touch and we can walk you through the Clinical Trial Risk Tool’s capabilities and discuss trial risk analysis.

Customer churn

For customer churn prediction, what time period should we use to make predictions?

The choice of time period should be whatever is most relevant for the company. Ask yourself, if you were the CEO, is it better to know who will churn in the next year, or the next month? You can always predict both. A time period that is too short will make it hard to train a machine learning model because of data sparsity. For example, if you have 10,000 customers and only 4 churners, that is too little data to learn any meaningful patterns, so you should choose a time period where a significant proportion of customers churn anyway. Find out more in our blog posts on customer churn: https://fastdatascience.com/ai-for-business/predict-customer-churn-machine-learning-ai/

How can we make customer churn predictions for individual customers, considering that some customers may return to the company after the selected time period?

In general your endpoint that you are trying to predict, is “will customer #12312 be active on [date]”. If they leave and re-enter it doesn’t matter, the main thing is will they be still an active paying customer on the date that you care about. Your prediction will always be a probabilistic prediction, e.g. you give customer #12312 a 89% churn score. You can never be completely sure.

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us