We used natural language processing to uncover the clues that pointed to a rogue journalist’s history of submitting fake news
Past editions of Der Spiegel, Germany’s most respected news magazine.
Is it possible to identify when somebody is not telling the truth? You may be aware of the subtle body language, tics and signals that give away a liar, but what about the written word?
How about fake news? If you are reading a news article by a famous reporter, how can you tell if it’s fake? Can natural language processing help?
I am going to tell you a little about a reporter called Claas Relotius, who was once one of Germany’s most respected reporters, and was later exposed as a fraud and was found to have fabricated hundreds of articles over an eight year period.
Then I will attempt some data science magic on Relotius' articles, to see what we can learn.
In 2018, a caravan of several thousand central American migrants were making their way from Honduras through the Sonora desert in Mexico and onwards to the final goal of the United States.
Juan Moreno, a 45 year old freelance reporter, was travelling alongside the migrant caravan and gathering some material for a feature piece for Der Spiegel, a prestigious German news magazine.
Moreno had been tasked with covering the caravan as they travelled through Mexico. He had spent several gruelling weeks in the desert and had already identified two young women who were willing to let him shadow them for a few days.
He was not happy to receive an email from the Spiegel editors saying that his young, successful colleague Claas Relotius would now be working on the article with him and would take editorial control over the final version.
Relotius had won more than 40 prizes in journalism and was widely regarded as a rising star in the field.
Relotius was to travel to Arizona and track down a militia, a group of volunteers who spend their time and money defending the US southern border from the perceived threat of illegal migration, while Moreno would stay in Mexico and continue to report on the migrants.
Juan Moreno (Wikipedia) Claas Relotius (Wikipedia)
After the assignment was finished, Moreno flew back to Germany.
When Moreno received Relotius' drafts and final article, titled
Jaeger’s border (German: Jaegers Grenze), he felt that something just didn’t feel right. Relotius claimed to have spent a few days in the company of a militia called the Arizona Border Recon. The members of Arizona Border Recon were armed and went by colourful nicknames such as Jaeger, Spartan and Ghost. Relotius even claimed to have witnessed Jaeger shooting at an unidentified figure in the desert. In short, the militia were portrayed as a stereotypical band of hillbillies, and some of the details seemed hard to believe.
Moreno started digging into Jaeger’s Border and Relotius' articles. He spent his savings on his own private investigation. He travelled in Relotius' footsteps to Arizona and other locations. It quickly became clear that Relotius had been fabricating stories rather than interviewing the subjects he claimed to have interviewed.
Many of Relotius' articles relied on stereotypes and the stories seemed far-fetched and too good to be true. For me, the most absurd story centres on a brother and sister from Syria who were working in a Turkish sweatshop. Relotius invented a Syrian children’s song about two orphans who grow up to be king and queen of Syria. According to the article, every Syrian child “from Raqqah to Damascus” is familiar with this traditional song. But none of the Syrians that Moreno spoke to had ever heard of it.
After much persistence on behalf of Moreno, the management at Der Spiegel reluctantly investigated Relotius' articles, and concluded that he had indeed fabricated the majority of his articles during his 8 year tenure.
Relotius had invented interviews that never took place, and people who never existed. He even wrote an article about rising sea levels in the Pacific island of Kiribati without bothering to take his connecting flight to the country.
Der Spiegel issued a mass retraction of the affected fake news articles and the ‘Relotius Affair’ became a nationwide scandal, making news worldwide and prompting an intervention by the US ambassador to Germany who objected to the “anti-American sentiment” of some of the articles.
The article Jaeger’s Border and Relotius' other texts can be downloaded as a PDF from Der Spiegel’s website. In total 59 articles are available for download, together with annotations by Der Spiegel indicating what content is genuine and what is pure invention or fake news.
There is a large amount of English language content available online on the Relotius scandal, including English translations of many of the articles.
I downloaded all 59 available Relotius articles and Der Spiegel’s annotations and tried a few data science experiments on them.
First of all I checked the truth/falsehood status of the articles. You can see that more than half are fictitious, although there are some articles where it was not possible for Der Spiegel to determine if the article was genuine or not. I excluded the latter from my analysis.
Of Relotius' 59 articles, 32 are definitely fake news, while the remainder are true or unclear. We will analyse the genuine and fake news with natural language processing.
The vast majority of Relotius' articles were written by him alone. Moreno later stated that this was quite unusual at Der Spiegel for a reporter to take on so many lone assignments, but Relotius was the star reporter at the publication and seemed to have acquired a certain privilege in this regard.
The fake news articles tended to be written by Relotius alone. 44 articles were under sole authorship.
Of course we know now it was easier for him to fabricate content when working alone.
There is something else interesting about the above graph. Relotius wrote only one article in a team of two. The other collaborative articles all involved larger teams of up to 14 authors.
The sole two-author article is Jaeger’s Border, the article which got Relotius caught out!
This shows that Relotius had a pattern of either writing articles alone, or in a large team. He managed to get away with this strategy for years until the Jaeger’s Border assignment. Perhaps when you are collaborating in a large group it is also easier to avoid scrutiny.
I tried generating a word cloud of the genuine and fake news articles, to see if there is any discernible difference. A word cloud shows words in different font sizes according to how often they occur in a set of documents.
Word cloud for the genuine news articles. The largest (most common) word is sagt (says).
Word cloud for the fake news articles.
Unfortunately there is not a huge difference between the two sets.
However I can see some patterns.
I then tried a more scientific approach. I used a tool called a Naive Bayes Classifier to find the words which most strongly indicate that an article is genuine or fictitious.
The Naive Bayes approach assigns a large negative number to words that strongly indicate fake news and a smaller negative number to words that indicate genuine news.
Here are the top 15 words that indicate that an article is genuine, with English translations and the scores from the Naive Bayes classifier:
|sei||is (reported speech)||-8.64|
and here some of the top 15 words that indicate that an article is fictitious:
This is just a snapshot but we can see some more patterns now. The fake news seems to be quite heavy in strong, emotive or very graphic language such as corrupt or mutilated. When I took the top 100 words this effect is still noticeable.
I then tested to see if it was possible to use the Naive Bayes Classifier to predict if an unseen Relotius text was fake or genuine, but unfortunately this was not possible to any degree of accuracy.
It is not possible to build a fake news detector given that we only have 59 articles to work from, but knowing in retrospect that Relotius falsified some texts, it is definitely possible to observe patterns and significant differences between his genuine and fake articles:
Perhaps knowing these effects it may be possible to flag suspicious texts in the future. If a reporter seems overly keen on working alone, travelling abroad, and seems to interview few subjects, but writes using colourful language that would be more appropriate in a novel, then perhaps something is amiss?
Naturally Relotius' prizes were revoked and returned one by one, and he resigned from his position at Der Spiegel.
Juan Moreno, the whistleblower who discovered Relotius' fraud, wrote a tell-all book about the Relotius Affair, titled A Thousand Lines of Lies (Tausend Zeilen Lüge). The book is a fascinating exposé of the world of print journalism in the digital age as well as a first hand account of how Relotius' system unravelled.
Ironically in 2019 Relotius started legal proceedings against Moreno for alleged falsehoods in the book, which are ongoing at the time of writing.
In case you would like more details, I used a multinomial Naive Bayes classifier with tf*idf scores. I evaluated it below using a ROC curve:
This is a ROC curve showing the performance of my Naive Bayes classifier under cross validation for predicting unseen Relotius texts. A good classifier would have a line close to the top left hand corner. The fact that the line is on the diagonal shows that my predictions were no better than rolling a dice. That means that if Relotius were still at large today I would have no way of knowing if his latest article were fictitious or not.
Juan Moreno, Tausend Zeilen Lüge: Das System Relotius und der deutsche Journalismus (A thousand lines of lies: the Relotius system and what it means for German journalism) (2019). Tells the story of the Relotius scandal and the growing problem of fake news in journalism.
Claas Relotius, Bürgerwehr gegen Flüchtlinge: Jaegers Grenze (Militia against refugees: Jaeger’s Border), and all other Claas Relotius texts, Der Spiegel (2018).
Philip Oltermann, The inside story of Germany’s biggest scandal since the Hitler diaries, The Guardian (2019).
Ralf Wiegand, Claas Relotius geht gegen Moreno-Buch vor (Claas Relotius takes action against Moreno’s book), Sueddeutsche Zeitung (2019).
(Multinomial Naive Bayes) C.D. Manning, P. Raghavan and H. Schuetze, Introduction to Information Retrieval, pp. 234-265 (2008).
I’m sure you will have seen news articles and social media posts about the recent generation of language models which are able to generate human-like text.
Was wir für Sie tun können