Fast Data Science is a specialist NLP and data science consultancy based in London. We are a small company and we take on consulting engagements from clients around the world in many industries. We also have a flagship product, the Clinical Trial Risk Tool, which is a software-as-a-service (SaaS) product which analyses clinical trials.
We help companies extract structured information from unstructured datasets, such as PDFs or other documents in natural language. Clients hire us to take on difficult NLP, AI, or data science tasks which they may not have the in-house capacity or specialism to handle.
The easiest way is on our contact form or by phoning us on +44 20 3488 5740.
Fast Data Science - London
Sorry, we don’t have any vacancies at the moment. Please follow our page on LinkedIn or X in case something comes up in future:
Unfortunately we don’t have any capacity for internships, but if you would like to get involved in data science we have the Harmony project https://harmonydata.ac.uk/ which is open source and we’re always happy to have more people involved in developing it.
Feel free to send us a resume/CV. Unfortunately we’re not hiring right now. Please follow us in case something comes up in future:
We would be glad to help with your academic project. We have favourable rates for clients in academia. Please get in touch and we can discuss. You can check out all of our publications under https://fastdatascience.com/ai-in-research/publications-and-patents/
We use Google Analytics but do not hold any identifying information if you have visited the website. You can read more on our Privacy Policy page.
We use Python, Scikit-Learn, Plotly Dash, TensorFlow, spaCy, NLTK, and other AI, machine learning and NLP libraries primarily in the Python ecosystem, however we can work with whichever software our clients need. We can use large language models via APIs such as OpenAI and Gemini, and we have also fine-tuned our own models. We are not tied to any particular cloud provider and we work with all major cloud computing platforms as well as on-premises servers. We work preferentially in Microsoft Azure and we are in the Microsoft Partner Network but we can also work in AWS, Google Cloud, or any other platform.
The Director of the company, Thomas Wood, does most of the consulting work, but other experts work with us on a per-project basis. Check out the team info page for more information.
The Director of Fast Data Science is Thomas Wood, who does most of the consulting work, but other experts work with us on a per-project basis.
We can definitely help with a predictive modelling project. We have built a number of predictive models of this kind for companies based on their internal data, which could be contained in a CRM or incident list. We worked for the Office of Rail and Road (UK rail regulator) on predictive modelling on datasets of all rail incidents (e.g. vehicle striking bridge, flooding, etc), and we also worked for Tarion, the Canadian housing regulator on a similar predictive model for housing defects, e.g. electrical, drywall, etc. We’ve also done a number of customer and employee churn projects, e.g. for the National Health Service. You may also be interested in this tool which de-risks a clinical trial: https://clinicaltrialrisk.org/
For example, we could put together a simple score on a scale 0-100 which you could work out with pencil and paper, which would predict the likelihood of an incident occurring in the next month. The machine learning models that we develop can be made completely explainable. It’s a positive that you have several years of data in your CRM, which should be enough to work with.
We recommend checking your data science consultant has the following:
Deep domain expertise - are they familiar with your industry? Do they know the difference between a “clinical trial phase” and a “marketing phase”? Or a “protocol” and a “prototype”? A generalist will treat all text data the same while a specialist knows that medical text requires specific Named Entity Recognition (NER) such as the tools and libraries developed by Fast Data Science.
Proven MLOps capabilities. The consultancy should be able to demonstrate that they have successfully brought projects through to deployment. Inexperienced consultants will often deliver a “notebook” (a static analysis) that gathers dust. They may make models that will run only on their laptop and then consider the job done. Or they might evaluate a model in an entirely inappropriate setting, which doesn’t correspond to real-life usage, and then give you inflated accuracy figures. Fast Data Science has deployed a number of data science projects which are publicly visible (https://harmonydata.ac.uk/search, https://clinicaltrialrisk.org/).
Transparency and explainability. Look for a consultant who can explain the models that they develop. Explainable models are less prone to bias. The consultant should be familiar with techniques like SHAP or LIME for explaining model outputs. The consultant should have a formal process for checking datasets for demographic or historical bias.
Understanding of your business problem, before trying to talk about tools. Lots of consultancies may try to sell you a “Generative AI” solution before they’ve even seen your data or understood what your business needs. A good consultant should start by talking to all relevant stakeholders, which could be the VPs of every division, to understand what the AI needs to do and how it will impact your business’s bottom line and KPIs. Consultants are business people first, and technologists second. Sometimes the best solution isn’t to throw generative AI at everything. You might be fine with a simple yet intuitive regression formula. A trustworthy data science consultant will tell you when you don’t need expensive AI.
Proven IP and case studies. Check out the consultant’s past engagements and look for case studies that mention ROI. (e.g., “Reduced document processing time by 40%” or “Increased clinical trial failure prediction by 15%”). Also check their GitHub account (https://github.com/fastdatascience/). Consultancies that contribute to the community (like Fast Data Science does with clinical tools) usually have a much deeper grasp of the underlying technology as well as the needs of people in your field.
Reasonable scoping of costs and timelines. Your consultant should be able to give you a quote after a couple of meetings and having a cursory look at your data. If they can’t commit to a fixed cost or time scale, how do you know the costs won’t run out of control? At Fast Data Science, we always give a few options of fixed costs, which also works better with many organisations’ accounting processes such as purchase orders (POs). This means we’re incentivised to work efficiently and deliver something useful. We have a lot of repeat customers and long term retainer agreements as well, as we like to keep a long term relationship with our clients.
Watch out for the following red flags:
Identifying AI generated text is a difficult problem to solve. You can use stylometric techniques such as Burrows’ Delta. If there are non-text fields then we suggest collecting all values, such as response time, location, click speed, and eyeballing the data. You may see certain patterns which can be used to separate out AI responses from human responses.
Given enough data, and ideally some confirmations that certain responses are from humans or bots, it should be possible to train a machine learning model to discriminate between the two. However, this is a very tricky task and remember that big tech companies have invested a lot in systems like Captcha for this reason. Please get in touch for a consultation and Fast Data Science may be able to help.
We would be very glad to assist. The director, Thomas Wood, has spoken and presented at a number of academic and industry conferences, which you can find under https://fastdatascience.com/blog/events/. Please get in touch and let us know the details.
We recommend to find a consultancy which will charge you a fixed cost for the entire job. Many consultants will charge per hour, but at Fast Data Science we prefer to offer our clients a fixed cost. That means, we define the outcomes of the project and any milestones, and agree on a price. This incentivises us to work efficiently.
There isn’t one single answer. Every blog post you find on our website generally has references at the bottom and links to external website. If something isn’t in a reference it could be from our own discovery work or experience or experimentation.
For example, let’s take this recent blog post as an example:
https://fastdatascience.com/ai-for-business/ai-generated-text/
It discusses the Wikipedia guide for identifying AI generated text and then goes into more detail about the experiments which we conducted.
So you could say some of this is opinion, some of it is original research and some is citing other people. I have tried to use as many reliable citations as possible.
Another thing that hopefully gives credibility is our list of publications, which have been published in peer-reviewed journals:
https://fastdatascience.com/ai-in-research/publications-and-patents/
Some of our articles got picked up by other publications. For example, the New York Times has quoted us: https://www.nytimes.com/2025/05/14/technology/ai-jobs-radiologists-mayo-clinic.html - they cited this article: https://fastdatascience.com/ai-in-healthcare/ai-replace-radiologists-doctors-lawyers-writers-engineers/.
There is not a fixed limit, but the tool may time out if you try to process more than 100 companies in a go. If you need to do thousands I think the best way is to batch it in groups of 100… if you need all UK companies then you can also download a single file from Companies House which has all data
Yes, we are prepared to enter into partnership agreements like this. Please get in touch with Fast Data Science (https://fastdatascience.com) to discuss the specifics.
Every project is so unique, so we would not be able to give a fixed cost on the website. However, if you can get in touch with specifics of your project, we can come back with a cost estimate.
Please send us the RFP and we will submit a response if it is in the right area.
Please raise an issue in the Github issue board for the respective library. Please include all information about what you did, what the input was, and what went wrong. We need to be able to reproduce the error which you encountered. Then we will get back to you.
Please raise a pull request in the appropriate library. We appreciate if you can keep your modifications to a minimum. That is, please ensure that you’re not pushing files which you don’t need to change. Ideally your pull request will only modify 2 or 3 files, otherwise it could be made more atomic. Here is a guide on making a good pull request: https://harmonydata.ac.uk/open-source-for-social-science/contributing-to-harmony-nlp-project/#forking-and-submitting-a-pull-request-pr
All of our open source libraries have a DOI and citation information in the README and CITATION.cff files. However, if you find one that is missing please use the contact form on fastdatascience.com.
For example, in the case of the Drug Named Entity Recognition library (https://github.com/fastdatascience/drug_named_entity_recognition), a citation could look like this:
Wood, T.A., Drug Named Entity Recognition [Computer software], Version 2.0.9, accessed at https://fastdatascience.com/drug-named-entity-recognition-python-library, Fast Data Science Ltd (2024)
You can use a Bibtex format which can be imported and converted into many citation formats:
@unpublished{drugnamedentityrecognition,
AUTHOR = {Wood, T.A.},
TITLE = {Drug Named Entity Recognition (Computer software), Version 2.0.9},
YEAR = {2024},
Note = {To appear},
url = {https://zenodo.org/doi/10.5281/zenodo.10970631},
doi = {10.5281/zenodo.10970631}
}
Again, please check the README and CITATION.cff and if this information is missing or incorrect please let us know. And thank you very much for remembering to cite us! Your citations help us keep our open source projects alive.
It is difficult to entirely eliminate AI bias from a solution. We ensure that training data for our machine learning models is free from protected category data such as gender and ethnic origin unless it is explicitly required as part of the solution. We pen-test models to check for inadvertent AI bias.
All our business operations comply with modern slavery and trafficking laws. Employees, subcontractors, freelancers and suppliers are paid a fair wage. We use low carbon footprint technologies where possible, and avoid LLMs unless there is no alternative. Business meetings are conducted remotely and all travel is by train if possible. Please check out our modern slavery statement and sustainability policy.
The hand over at the end of the project includes full documentation, code bases, and training and handover sessions to ensure that your internal team can manage whatever has been built longer term.
Fast Data Science is infrastructure agnostic. Whether you require a secure on-premise deployment for sensitive medical data or a scalable AWS/Azure/Google Cloud solution, we tailor the architecture to your security needs. We are not tied to any particular cloud provider.
Natural Language Processing is an area of AI. It is everything to do with getting computers to understand and produce human language. That could involve text, audio files, or any kind of documents. At Fast Data Science we take on a lot of consulting work around natural language processing in industries where a lot of text is generated, such as healthcare and pharma. You will interact with an NLP system if you use ChatGPT or Gemini, for example. You can read about more examples of NLP in this blog post: https://fastdatascience.com/natural-language-processing/what-is-natural-language-processing-with-examples/
Healthcare and pharma contain lots of opportunities for natural language processing, as large amounts of data, such as clinical trial reports or electronic health records are stored partly in text format. A lot of our projects involve PDFs so we have become adept at pulling structured information out of PDFs. A common ask is anonymisation: for some of our clients we are developing software to identify and automatically redact personally identifiable healthcare information (PHI) in clinical trial narrative reports. For another client, we are analysing electronic health records in HL7 format to identify if a patient can be included in a clinical trial (matches the inclusion criteria), or if a cancer should be reported to a registry. We have made open source libraries such as Drug Named Entity Recognition: https://github.com/fastdatascience/drug_named_entity_recognition which are used by research teams and commercial entities around the world.
If you have a project in healthcare that needs looking at, for example, a large amount of unstructured text in PDF format, please get in touch with Fast Data Science.
We take on consulting engagements where, e.g. there is a dispute over the authorship of a document. We would analyse all documents in question using forensic stylometry, which generates a ‘fingerprint’ of an author’s writing style. We could produce an expert witness report or expert advisor report according to what you require. Please contact us for a quote.
It is possible to fine-tune your own large language model. We have provided a tutorial on how to fine tune a model for document similarity: https://fastdatascience.com/generative-ai/train-ai-fine-tune-ai/
However, in most cases, we would not recommend fine-tuning your own LLM. It is time consuming, and requires a lot of data. You are unlikely to have the resources to manually tag enough data for your LLM, so ideally you already possess that data.
Furthermore, the big tech offerings such as ChatGPT, DeepSeek, and Gemini, are improving so rapidly, that you’re unlikely to get an improvement in accuracy over the big players. Even if you do manage to improve this, your edge may disappear in a few months with the next release of an LLM.
If data privacy and sensitive data are your concern, we suggest you try self-hosting a large language model, or using Azure or AWS’s secure environments. You can even deploy models on Azure or AWS and remain GDPR and HIPAA compliant.
Some cases where it’s still worthwhile to train your own LLM are:
We have worked on a number of natural language interfaces which turn very unstructured text into a structured format, such as:
Our solutions have mostly been deployed on Azure but could be deployed on any other platform.
If we had to approach this project, we would develop a deployed API with code, which receives input text and outputs an Excel or other structured format for your systems, with options for the user to refine their prompt.
As a first pass, we would use rule based systems to identify any entities mentioned, and if constraints allow we could also use structured output formats such as OpenAI’s JSON format.
We have tested and evaluated 16 Large Language Models here: https://fastdatascience.com/generative-ai/openai-vs-claude-vs-qwen/
There is not much difference between the most recent state-of-the-art large language models. There is a much greater difference in capability between models that were released 6 months apart, than there is between models from different vendors. We found that Chinese models such as Deepseek and Qwen performed the same as the US models like ChatGPT and Gemini.
In general, the off the shelf models from the big tech companies also outperform any fine-tuned custom model that you or we (outside big tech companies) have the capability of building. The resources that have gone into training an LLM from scratch are comparable to the GDP of a small country. So even if it appears worthwhile to fine tune a mental health, medical, or financial model, it usually isn’t the case.
Our advice when choosing a large language model provider for your application, is to pick whichever one is most convenient for you to integrate into your technology stack, while avoiding vendor lock in. Try to ensure that you will always be able to switch easily to a different provider in future, because an API may become deprecated, or a company may stop offering a particular model, or they may increase their prices. So maximum flexibility is key.
You can try out the free open source library that we have developed, Fast Stylometry (https://github.com/fastdatascience/faststylometry). This will output the probability that an unknown document is written by a particular author.
However, please check how long the document is? Are we talking complete books, or just short letters?
Often, stylometry is only effective if you have long documents, i.e. at least a chapter length, to compare. Ideally your documents that you are comparing are of the same type of document, e.g. novels vs novels, speeches vs speeches. Trying to compare across document types can be difficult.
As the field moves on, large language models may reduce the amount of text needed to reach a conclusion. However, a single page, or short email, is usually far too short to be able to reach a conclusion on authorship with any certainty. Please note that high-profile stylometry “detective work”, like the identification of JK Rowling as the author of The Cuckoo’s Calling, involved entire novels. If you only have a couple of pages of text, we are unlikely to be able to do anything. However, feel free to get in touch with Fast Data Science as it may still be worth us taking a look.
It is difficult to prove authorship one way or the other, however, the technology you are referring to is forensic stylometry. We have built an open source library which can run stylometry analyses. Ideally you need documents that are at least the length of a book chapter. Emails and letters are generally too short. A stylometry analysis can give you a percentage likelihood of a given person being the author of a document, provided we have enough documents from that person. The algorithm commonly used for this is called Burrows’ Delta and it has been around for quite some time, predating LLMs.
Our product, the Clinical Trial Risk Tool, can produce benchmarks for all phases and disease areas where there is enough publicly available data. If an area or functionality is required but not covered, we would be keen to discuss with you as we can develop or modify features according to your needs, and we have a number of features in the pipeline which have been requested by users, which we can prioritise.
The tool can be run in the cloud or on premises. If there’s a country or region or type of trial which you want us to cover, we can always discuss this. Please get in touch to discuss your needs.
Yes, we can help. Please get in touch and we can walk you through the Clinical Trial Risk Tool’s capabilities and discuss trial risk analysis.
The choice of time period should be whatever is most relevant for the company. Ask yourself, if you were the CEO, is it better to know who will churn in the next year, or the next month? You can always predict both. A time period that is too short will make it hard to train a machine learning model because of data sparsity. For example, if you have 10,000 customers and only 4 churners, that is too little data to learn any meaningful patterns, so you should choose a time period where a significant proportion of customers churn anyway. Find out more in our blog posts on customer churn: https://fastdatascience.com/ai-for-business/predict-customer-churn-machine-learning-ai/
In general your endpoint that you are trying to predict, is “will customer #12312 be active on [date]”. If they leave and re-enter it doesn’t matter, the main thing is will they be still an active paying customer on the date that you care about. Your prediction will always be a probabilistic prediction, e.g. you give customer #12312 a 89% churn score. You can never be completely sure.
What we can do for you