Guest post by Alex Nikic
In the past few years, Generative AI technology has advanced rapidly, and businesses are increasingly adopting it for a variety of tasks. While GenAI excels at tasks such as document summarisation, question answering, and content generation, it lacks the ability to provide reliable forecasts for future events. GenAI models are not designed for forecasting, and along with the tendancy to hallucinate information, the output of these models should not be trusted when planning key business decisions. For more details, a previous article on our blog explores in-depth the trade-offs of GenAI vs Traditional Machine Learning approaches.
Additionally we may want to enhance our already existing numerical models with information contained in text data, in order to make effective use of all our data sources. This problem came to my attention when completing my Master’s thesis, where I attempted to improve predictions of horse races by combining market information with the text summaries provided by analysts.
So how can we integrate our text data and numerical data together to produce models that provide interpretable and trustworthy outputs? This blog post explores some research papers detailing creative applications of NLP predictive pipelines in a variety of domains, including forecasting of stock prices, sporting events, movie sales, and even clinical trial outcomes.
Social media has undoubtedly become an integral part of our daily lives, allowing us to converse and share ideas on a wide range of topics. It has become ever easier for anyone to post their opinion online, with the ability to reach thousands or even millions of readers. Therefore social media sites provide a large sample of data which can be used to infer public sentiment on nearly any topic, including both recent and upcoming events.
In particular, X (formerly Twitter) is a great source of information for such tasks, due to its large user base and easy-to-use API for compiling datasets. In a paper titled Predicting the Future With Social Media[3], a team of researchers demonstrated that the rate at which tweets are created about particular topics can predict box-office revenues for movies, even outperforming market-based predictors such as the Hollywood Stock Exchange. They created a metric named “tweet-rate”, defined as the number of tweets about a movie per hour, and found that this was strongly positively correlated with opening-weekend box-office revenues. Furthermore they conducted a technique called sentiment analysis on the tweets, which involves classifying each tweet as either “Positive”, “Negative”, or “Neutral”.
The researchers found that additionally incorporating the ratio of positive to negative tweets for each movie, with the tweet rate, improved sale predictions even further. Outside of Hollywood, this method could be used to help your business understand the public’s reaction to your new product or service, forecasting sales and revenue.
Not only is social media a valuable tool for prediction by itself, but when combined with other sources of information we can further enhance the predictive power of our models. An interesting application of this can be seen in the paper FORECASTING WITH SOCIAL MEDIA: EVIDENCE FROM TWEETS ON SOCCER MATCHES, where the researchers evaluated whether content from X (formerly Twitter) can add information to football match probability forecasts produced by the betting exchange, Betfair.[7] A key feature of this study is the hypothesis of the “wisdom of the crowd”, which states that the aggregated opinion of many people is more accurate than that of the individual. By harnessing this data, the researchers showed that an overall positive tone of Tweets indicates that the team in question is 3.39% more likely to win than betting prices imply - and that positive Tweets after significant match events such as goals or red cards lead to a 8.12% improvement over the market. What makes this result fascinating is that spectator sentiment does not influence the outcome of a football match, nor does it predict very accurately on its own - but when combined with other sources of information, we can improve our forecasts for the match.
A clinical trial is a research project which is carried out to test the effectiveness of a new medicine, treatment, or procedure. Clinical trials are essential for finding out whether a potential new treatment works and is safe, and provides evidence for the eventual implementation of the treatment in clinical practice. However these trials are expensive to carry out, and so it is of interest to research organisations to predict the outcome of a trial in order to determine whether it is worth the investment to continue drug development. At Fast Data Science, we are developing a Clinical Trial Risk tool to assist experts in estimating the risk of a clinical trial ending uninformatively. You can try out the tool for yourself here.
Predict the future with NLP
In order to develop predictive models for this task, it is necessary to have a comprehensive dataset of outcomes from past clinical trials to train our models on. In this paper Automatically Labeling Clinical Trial Outcomes: A Large-Scale Benchmark for Drug Development, Chufan Gao et al. addressed this challenge by creating a dataset of 125,000 drug and biologics trials, using language model (LLM) interpretations of publications, trial phase progression tracking, sentiment analysis from news sources, stock price movements of trial sponsors, and additional trial-related metrics to automatically label clinical trial outcomes.[6] Employing humans to manually annotate the dataset would be prohibitively expensive and time-consuming, due to the sheer amount of information that must be meticulously combed through for each trial. By combining different sources of text and numerical information, the team were able to generate accurate labels across the dataset, which can then be used by future researchers and developers to inform decision-making processes for new clinical trials.
An accurate forecast of the stock market is essential for investors, financial institutions, and company executives, for informing business strategy, risk management, and investment decisions. While stock data consists of prices that evolve over time, and can be modelled quantitatively using time series methods, it would be negligent to ignore the effect of company reports, news articles, and even social media posts, on the direction of stock prices. For example Donald Trump and Elon Musk are well-known for their controversial tweets and surprise announcements, and Rachel Reeves is soon to announce her autumn budget, all of which can have an impact on the stock market. Recently, the video game company Valve released patch notes to update their game Counter-Strike 2, which crashed the online market for in-game items and wiped out over $2bn in value overnight. Biotech stocks are well-known to be risky investments, with announcements of a bad drug trial or denial for regulatory approval having the potential to crash their stocks, with some to never recover. So how can we utilise such information into our models to account for these risks?
While there is extensive research on this topic, one of the more recent papers relating to NLP for finance would be FinBERT: A Large Language Model for Extracting Information from Financial Text, where the researchers adapt Google’s popular BERT model to classify the sentiment of financial texts such as corporate filings, analyst reports, and earnings conference call transcripts.[5] By training the model specifically on financial text data, they achieved improvements over the base model on a number of metrics, particularly on discussions relating to environment, social, and governance issues. When implemented in a broader pipeline, FinBERT can provide signals from text data which can be used to inform trading decisions, or even be passed forward into further models for algorithmic trading. As the finance industry is infamously competitive, profitable algorithms are unlikely to be released into the public domain, so we can only speculate how such models are being used by financial firms. However I believe in a market as competitive as today’s, financial companies are definitely looking towards leveraging text data as an alternative data source to gain an edge over their competition.
After receiving a product or a service, many of us are often asked to provide feedback in the form of a rating, typically on the scale of 1 to 5 stars. Examples of such systems can been seen on Google Maps restaurant reviews, Uber customer and driver ratings, and Amazon product reviews. However we now live in a world where these ratings are over-inflated (how many people do you know simply click 5 stars for your Uber drivers regardless of how the ride went?), meaning that these ratings tend to be unrepresentative of the actual product or service.
The paper Large-Scale Cross-Category Analysis of Consumer Review Content on Sales Conversion Leveraging Deep Learning analyses the written reviews of 600 different product categories on an e-commerce site to discover how they affect sales conversion.[4] Customers may rely on these reviews for products which are expensive or for which they are uncertain of their quality. The researchers used a Deep Learning model to extract six metrics from the reviews: aesthetics, conformance, durability, feature, brand, and price, each of which can be viewed as described as either positively or negatively. They find that this information has a higher impact on sales conversion when the market is competitive or other information about the product is limited. Additionally, the paper notes that changing the order in which the reviews appear on the website can yield the same sales conversion as a 1.6% price cut. By examining these results, business owners can measure the effect that reviews are having on their sales - likewise consumers can become cognisant of the psychology that underpins their purchasing behaviours.
Language data can provide us with useful insights that can used for predictive applications across a wide range of problems. Using Natural Language Processing, we are able to convert qualitative text data into quantitative features that we can implement in our models, allowing us to tap into a large source of data that would otherwise be disregarded. So next time you are developing a data science solution, consider what text data you have available and perhaps try incorporating language data into your modelling pipeline.
The first two papers are the source of the sentiment analyser word list.
Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews.”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA,
Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing Opinions on the Web.” Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.
Asur, Sitaram, and Bernardo A. Huberman. “Predicting the future with social media.” 2010 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology. Vol. 1. IEEE, 2010.
Liu, Xiao, Dokyun Lee, and Kannan Srinivasan. “Large-scale cross-category analysis of consumer review content on sales conversion leveraging deep learning.” Journal of Marketing Research 56.6 (2019): 918-943.
Huang, Allen H., Hui Wang, and Yi Yang. “FinBERT: A large language model for extracting information from financial text.” Contemporary Accounting Research 40.2 (2023): 806-841.
Gao, Chufan, et al. “Automatically Labeling Clinical Trial Outcomes: A Large-Scale Benchmark for Drug Development.” arXiv preprint arXiv:2406.10292 (2024).
Brown, Alasdair, et al. “Forecasting with social media: Evidence from tweets on soccer matches.” Economic Inquiry 56.3 (2018): 1748-1763.
Looking for experts in Natural Language Processing? Post your job openings with us and find your ideal candidate today!
Post a Job
After this ruling, will tech companies move all model training to data centres that they consider “copyright safe”? Will we see a new equivalent of a “tax haven” for training AI models on copyrighted content? An “AI haven”? This article is not legal advice.

This new video explains natural language processing: what it is, how it works, and what can it do for your organisation. Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that focuses on giving computers the ability to understand human language, combining disciplines like linguistics, computer science, and engineering.

This free A/B test calculator will help you compare two variants of your website, A and B, and tell you the probability that B is better. You can read more about A/B testing in our earlier blog post on the subject. You may also be interested in our Chi-Squared sample size calculator which will help you calculate the minimum sample size needed to run a Chi-Squared test, given an expected standardised effect size.
What we can do for you