We used natural language processing to uncover the clues that pointed to a rogue journalist’s history of fabricating stories
Is it possible to identify when somebody is not telling the truth? You may be aware of the subtle body language, tics and signals that give away a liar, but what about the written word?
How about fake news? If you are reading a news article by a famous reporter, how can you tell if it’s fake?
I am going to tell you a little about a reporter called Claas Relotius, who was once one of Germany’s most respected reporters, and was later exposed as a fraud and was found to have fabricated hundreds of articles over an eight year period.
Then I will attempt some data science magic on Relotius’ articles, to see what we can learn.
Background: the migrant caravan in the Sonora desert
In 2018, a caravan of several thousand central American migrants were making their way from Honduras through the Sonora desert in Mexico and onwards to the final goal of the United States.
Juan Moreno, a 45 year old freelance reporter, was travelling alongside the migrant caravan and gathering some material for a feature piece for Der Spiegel, a prestigious German news magazine.
Moreno had been tasked with covering the caravan as they travelled through Mexico. He had spent several gruelling weeks in the desert and had already identified two young women who were willing to let him shadow them for a few days.
He was not happy to receive an email from the Spiegel editors saying that his young, successful colleague Claas Relotius would now be working on the article with him and would take editorial control over the final version.
Relotius had won more than 40 prizes in journalism and was widely regarded as a rising star in the field.
Relotius was to travel to Arizona and track down a militia, a group of volunteers who spend their time and money defending the US southern border from the perceived threat of illegal migration, while Moreno would stay in Mexico and continue to report on the migrants.
The suspicious article
After the assignment was finished, Moreno flew back to Germany.
When Moreno received Relotius’ drafts and final article, titled
Jaeger’s border (German: Jaegers Grenze), he felt that something just didn’t feel right. Relotius claimed to have spent a few days in the company of a militia called the Arizona Border Recon. The members of Arizona Border Recon were armed and went by colourful nicknames such as Jaeger, Spartan and Ghost. Relotius even claimed to have witnessed Jaeger shooting at an unidentified figure in the desert. In short, the militia were portrayed as a stereotypical band of hillbillies, and some of the details seemed hard to believe.
Moreno started digging into Jaeger’s Border and Relotius’ articles. He spent his savings on his own private investigation. He travelled in Relotius’ footsteps to Arizona and other locations. It quickly became clear that Relotius had been fabricating stories rather than interviewing the subjects he claimed to have interviewed.
Many of Relotius’ articles relied on stereotypes and the stories seemed far-fetched and too good to be true. For me, the most absurd story centres on a brother and sister from Syria who were working in a Turkish sweatshop. Relotius invented a Syrian children’s song about two orphans who grow up to be king and queen of Syria. According to the article, every Syrian child “from Raqqah to Damascus” is familiar with this traditional song. But none of the Syrians that Moreno spoke to had ever heard of it.
After much persistence on behalf of Moreno, the management at Der Spiegel reluctantly investigated Relotius’ articles, and concluded that he had indeed fabricated the majority of his articles during his 8 year tenure.
Relotius had invented interviews that never took place, and people who never existed. He even wrote an article about rising sea levels in the Pacific island of Kiribati without bothering to take his connecting flight to the country.
Der Spiegel issued a mass retraction of the affected fake news articles and the ‘Relotius Affair’ became a nationwide scandal, making news worldwide and prompting an intervention by the US ambassador to Germany who objected to the “anti-American sentiment” of some of the articles.
The article Jaeger’s Border and Relotius’ other texts can be downloaded as a PDF from Der Spiegel’s website. In total 59 articles are available for download, together with annotations by Der Spiegel indicating what content is genuine and what is pure invention or fake news.
There is a large amount of English language content available online on the Relotius scandal, including English translations of many of the articles.
Analysing Relotius’ texts
I downloaded all 59 available Relotius articles and Der Spiegel’s annotations and tried a few data science experiments on them.
First of all I checked the truth/falsehood status of the articles. You can see that more than half are fictitious, although there are some articles where it was not possible for Der Spiegel to determine if the article was genuine or not. I excluded the latter from my analysis.
The vast majority of Relotius’ articles were written by him alone. Moreno later stated that this was quite unusual at Der Spiegel for a reporter to take on so many lone assignments, but Relotius was the star reporter at the publication and seemed to have acquired a certain privilege in this regard.
Of course we know now it was easier for him to fabricate content when working alone.
There is something else interesting about the above graph. Relotius wrote only one article in a team of two. The other collaborative articles all involved larger teams of up to 14 authors.
The sole two-author article is Jaeger’s Border, the article which got Relotius caught out!
This shows that Relotius had a pattern of either writing articles alone, or in a large team. He managed to get away with this strategy for years until the Jaeger’s Border assignment. Perhaps when you are collaborating in a large group it is also easier to avoid scrutiny.
I tried generating a word cloud of the genuine and fake news articles, to see if there is any discernible difference. A word cloud shows words in different font sizes according to how often they occur in a set of documents.
Unfortunately there is not a huge difference between the two sets.
However I can see some patterns.
- There is more use of sei, würde in the genuine news articles, which are special verb forms that are used often in reported speech. It appears that the fake news involved more description of direct action and less tentative reporting or reported speech.
- The word deutschen (‘German’) is more common in the genuine news articles. In Moreno’s book he explained that Relotius only faked his articles that involved travel outside Germany, as it would presumably be harder to make up fake German news for a German audience and make it sound convincing.
Finding the commonest words that distinguish both groups
I then tried a more scientific approach. I used a tool called a Naive Bayes Classifier to find the words which most strongly indicate that an article is genuine or fictitious.
The Naive Bayes approach assigns a large negative number to words that strongly indicate fake news and a smaller negative number to words that indicate genuine news.
Here are the top 15 words that indicate that an article is genuine, with English translations and the scores from the Naive Bayes classifier:
|sei||is (reported speech)||-8.64|
and here some of the top 15 words that indicate that an article is fictitious:
This is just a snapshot but we can see some more patterns now. The fake news seems to be quite heavy in strong, emotive or very graphic language such as corrupt or mutilated. When I took the top 100 words this effect is still noticeable.
I then tested to see if it was possible to use the Naive Bayes Classifier to predict if an unseen Relotius text was fake or genuine, but unfortunately this was not possible to any degree of accuracy.
It is not possible to build a fake news detector given that we only have 59 articles to work from, but knowing in retrospect that Relotius falsified some texts, it is definitely possible to observe patterns and significant differences between his genuine and fake articles:
- The fake news articles were written when Relotius was reporting as a lone wolf. Relotius was caught out the first time he was assigned to work in a team of two.
- The fake news articles contain more emotive, graphic or strong language.
- There is less reported speech and tentative language in the fake articles.
- Caveat: it’s possible that some of the linguistic differences mentioned above are due to the genuine articles tending to be multi-author pieces.
- The fake articles take place outside of Germany.
Perhaps knowing these effects it may be possible to flag suspicious texts in the future. If a reporter seems overly keen on working alone, travelling abroad, and seems to interview few subjects, but writes using colourful language that would be more appropriate in a novel, then perhaps something is amiss?
Epilogue: Relotius vs Moreno?
Naturally Relotius’ prizes were revoked and returned one by one, and he resigned from his position at Der Spiegel.
Juan Moreno, the whistleblower who discovered Relotius’ fraud, wrote a tell-all book about the Relotius Affair, titled A Thousand Lines of Lies (Tausend Zeilen Lüge). The book is a fascinating exposé of the world of print journalism in the digital age as well as a first hand account of how Relotius’ system unravelled.
Ironically in 2019 Relotius started legal proceedings against Moreno for alleged falsehoods in the book, which are ongoing at the time of writing.
Appendix: technical details on the Naive Bayes classifier
In case you would like more details, I used a multinomial Naive Bayes classifier with tf*idf scores. I evaluated it below using a ROC curve:
Juan Moreno, Tausend Zeilen Lüge: Das System Relotius und der deutsche Journalismus (A thousand lines of lies: the Relotius system and what it means for German journalism) (2019).
Claas Relotius, Bürgerwehr gegen Flüchtlinge: Jaegers Grenze (Militia against refugees: Jaeger’s Border), and all other Claas Relotius texts, Der Spiegel (2018).
Philip Oltermann, The inside story of Germany’s biggest scandal since the Hitler diaries, The Guardian (2019).
Ralf Wiegand, Claas Relotius geht gegen Moreno-Buch vor (Claas Relotius takes action against Moreno’s book), Sueddeutsche Zeitung (2019).
(Multinomial Naive Bayes) C.D. Manning, P. Raghavan and H. Schuetze, Introduction to Information Retrieval, pp. 234-265 (2008).