Authorship analysis demo (forensic stylometry and linguistics)

Authorship analysis demo (forensic stylometry and linguistics)

Fast Data Science has developed a unique forensic stylometry model which allows you to identify the author of a text by their unique stylistic and linguistic “fingerprint”. It needs a relatively long text, such as a few chapters of a book, to make an accurate identification.





[enter a text and the identified author will be shown here]

For best results try entering at least two or three paragraphs of text. You can get some free writing samples by visiting Project Gutenberg, or the contemporary authors' websites (Pottermore, Dan Brown, Philip Pullman). JK Rowling wrote some books under the pseudonym Robert Galbraith. Try testing if the model can correctly identify the author. If the text has no resemblance to any of the known author’s works, the model will output some very small probabilities.

Authors

What does the forensic stylometry model know?

The model was trained on the following books. So, if you give it a book such as Villette by Charlotte Brontë, which is not on the list, it will hopefully identify Brontë as the correct author despite never having seen the book.

  1. Jane Austen: Emma, Northanger Abbey, Persuasion, Pride And Prejudice
  2. Charlotte Brontë: Shirley, Jane Eyre Lewis Carroll: Alice in Wonderland
  3. Conan Doyle: The Adventures of Sherlock Holmes
  4. Dan Brown: Angels And Demons, Da Vinci Code
  5. Charles Dickens: David Copperfield, Oliver Twist, A Christmas Carol, Great Expectations, Bleak House
  6. JK Rowling: Goblet Of Fire, Philosophers Stone, Order Of The Phoenix, Cuckoos Calling (under pseudonym)
  7. Philip Pullman: Subtle Knife, Belle Sauvage
  8. Jonathan Swift: A Modest Proposal

Jane Austen (1775-1817) was most famous for Pride and Prejudice but in fact wrote six novels. The forensic stylometry model has been trained on three of her novels. Can it recognise the others?

Getting started with forensic stylometry

I have written a tutorial on using forensic stylometry with the Python library faststylometry here: Fast Stylometry Tutorial.

There is an excellent introduction to various stylometry techniques here: Introduction to Stylometry with Python.

I also recommend to read the paper Stylometry with R: A Package for Computational Text Analysis by Maciej Eder, Jan Rybicki and Mike Kestemont, which provides an overview on how to conduct scientifically valid stylometric analyses of texts using a graphical package stylo in R.

Using stylo and other scientific packages, researchers in linguistics and the humanities can analyse text in a variety of languages for cases of contested historical authorship. For instance, I had a query about a set of Latin texts where academics disagree on which historian is the true author! The technique is also used by forensic linguists for legal cases.

Note about probabilities and the Softmax layer

The probabilities output by this stylometry model are derived from a probability calibration process, which converts the Burrows Delta statistic to a probability value based on the distribution of Burrows Delta values in the training data. This means that if the text you input is unlike anything by any of the known authors, all authors will be assigned a probability close to zero. Conversely, the probabilities may sum to a number greater than 1, because each probability is calculated independently.

A lot of machine learning models you can try (for example, the iPhone apps you can download which tell you what breed your dog is, or how old your face looks) tend to output unreasonable probabilities, because they rely on a Softmax layer, which forces all output scores to sum to 1. A Softmax layer tends to spit out probabilities which are either very close to 0 or very close to 1.

I chose to use a probability calibration technique for this demo to avoid the odd effects associated with (mis)interpreting a Softmax output as a probability.

What about generative models?

We can calculate the probability that a text was authored by a generative AI model such as BARD or ChatGPT, by calculating the model’s perplexity (how surprised it is by the document). Recently, many disputes are arising in higher education, where a student may be accused of using AI to write an assignment.

Try our generative AI detector.

What we can do for you

Transform Unstructured Data into Actionable Insights

Contact us