Fast Data Science has developed a unique forensic stylometry model which allows you to identify the author of a text by their unique stylistic and linguistic “fingerprint”. It needs a relatively long text, such as a few chapters of a book, to make an accurate identification.

[enter a text and the identified author will be shown here]

The model predicts the most likely author of a passage of text, out of a number of contemporary and classical authors, including JK Rowling, Philip Pullman and Dan Brown. The model has been trained on two or three novels by each of the authors. The model works using the Burrows Delta algorithm with the Python library faststylometry.

For best results try entering at least two or three paragraphs of text. You can get some free writing samples by visiting Project Gutenberg, or the contemporary authors’ websites (PottermoreDan BrownPhilip Pullman). JK Rowling wrote some books under the pseudonym Robert Galbraith. Try testing if the model can correctly identify the author. If the text has no resemblance to any of the known author’s works, the model will output some very small probabilities.


What does the forensic stylometry model know?

The model was trained on the following books. So, if you give it a book such as Villette by Charlotte Brontë, which is not on the list, it will hopefully identify Brontë as the correct author despite never having seen the book.

Jane Austen: Emma, Northanger Abbey, Persuasion, Pride And Prejudice
Charlotte Brontë: Shirley, Jane Eyre
Lewis Carroll: Alice in Wonderland
Conan Doyle: The Adventures of Sherlock Holmes
Dan Brown: Angels And Demons, Da Vinci Code
Charles Dickens: David Copperfield, Oliver Twist, A Christmas Carol, Great Expectations, Bleak House
JK Rowling: Goblet Of Fire, Philosophers Stone, Order Of The Phoenix, Cuckoos Calling (under pseudonym)
Philip Pullman: Subtle Knife, Belle Sauvage
Jonathan Swift: A Modest Proposal

Jane Austen (Forensic Stylometry Authorship Analysis)

Jane Austen (1775-1817) was most famous for Pride and Prejudice but in fact wrote six novels. The forensic stylometry model has been trained on three of her novels. Can it recognise the others?

Getting started with forensic stylometry

I have written a tutorial on using forensic stylometry with the Python library faststylometry here: Fast Stylometry Tutorial.

There is an excellent introduction to various stylometry techniques here: Introduction to Stylometry with Python.

I also recommend to read the paper Stylometry with R: A Package for Computational Text Analysis by Maciej Eder, Jan Rybicki and Mike Kestemont, which provides an overview on how to conduct scientifically valid stylometric analyses of texts using a graphical package stylo in R.

Using stylo and other scientific packages, researchers in linguistics and the humanities can analyse text in a variety of languages for cases of contested historical authorship. For instance, I had a query about a set of Latin texts where academics disagree on which historian is the true author! The technique is also used by forensic linguists for legal cases.

Note about probabilities and the Softmax layer

The probabilities output by this stylometry model are derived from a probability calibration process, which converts the Burrows Delta statistic to a probability value based on the distribution of Burrows Delta values in the training data. This means that if the text you input is unlike anything by any of the known authors, all authors will be assigned a probability close to zero. Conversely, the probabilities may sum to a number greater than 1, because each probability is calculated independently.

A lot of machine learning models you can try (for example, the iPhone apps you can download which tell you what breed your dog is, or how old your face looks) tend to output unreasonable probabilities, because they rely on a Softmax layer, which forces all output scores to sum to 1. A Softmax layer tends to spit out probabilities which are either very close to 0 or very close to 1.

I chose to use a probability calibration technique for this demo to avoid the odd effects associated with (mis)interpreting a Softmax output as a probability.