Data science, Natural language processing

Fake news detection with natural language processing: the Relotius affair

· Thomas Wood
Fake news detection with natural language processing: the Relotius affair

Can we detect what is fake news in 59 articles by Claas Relotius?

We used natural language processing to uncover the clues that pointed to a rogue journalist’s history of submitting fake news

Der Spiegel, where Claas Relotius wrote his fake news articles, which we are using for an NLP analysis

Past editions of Der Spiegel, Germany’s most respected news magazine.

Is it possible to identify when somebody is not telling the truth? You may be aware of the subtle body language, tics and signals that give away a liar, but what about the written word?

How about fake news? If you are reading a news article by a famous reporter, how can you tell if it’s fake? Can natural language processing help?

I am going to tell you a little about a reporter called Claas Relotius, who was once one of Germany’s most respected reporters, and was later exposed as a fraud and was found to have fabricated hundreds of articles over an eight year period.

Then I will attempt some data science magic on Relotius' articles, to see what we can learn.

Background: the migrant caravan in the Sonora desert

In 2018, a caravan of several thousand central American migrants were making their way from Honduras through the Sonora desert in Mexico and onwards to the final goal of the United States.

Juan Moreno, a 45 year old freelance reporter, was travelling alongside the migrant caravan and gathering some material for a feature piece for Der Spiegel, a prestigious German news magazine.

Moreno had been tasked with covering the caravan as they travelled through Mexico. He had spent several gruelling weeks in the desert and had already identified two young women who were willing to let him shadow them for a few days.

He was not happy to receive an email from the Spiegel editors saying that his young, successful colleague Claas Relotius would now be working on the article with him and would take editorial control over the final version.

Relotius had won more than 40 prizes in journalism and was widely regarded as a rising star in the field.

Relotius was to travel to Arizona and track down a militia, a group of volunteers who spend their time and money defending the US southern border from the perceived threat of illegal migration, while Moreno would stay in Mexico and continue to report on the migrants.

The suspicious article - was it fake news?












  
    
    
    

    
      
      
    
      
      
    
      
      
    

    
      
      
        
      
      
        
      
      
        
      
      Juan Moreno, the freelancer who exposed Relotius
    
  



Juan Moreno (Wikipedia)












  
    
    
    

    
      
      
    
      
      
    
      
      
    

    
      
      
        
      
      
        
      
      
        
      
      Claas Relotius, the German reporter notorious for his fake news articles
    
  

  
Claas Relotius (Wikipedia)

After the assignment was finished, Moreno flew back to Germany.

When Moreno received Relotius' drafts and final article, titled
Jaeger’s border (German: Jaegers Grenze), he felt that something just didn’t feel right. Relotius claimed to have spent a few days in the company of a militia called the Arizona Border Recon. The members of Arizona Border Recon were armed and went by colourful nicknames such as Jaeger, Spartan and Ghost. Relotius even claimed to have witnessed Jaeger shooting at an unidentified figure in the desert. In short, the militia were portrayed as a stereotypical band of hillbillies, and some of the details seemed hard to believe.

Moreno started digging into Jaeger’s Border and Relotius' articles. He spent his savings on his own private investigation. He travelled in Relotius' footsteps to Arizona and other locations. It quickly became clear that Relotius had been fabricating stories rather than interviewing the subjects he claimed to have interviewed.

Many of Relotius' articles relied on stereotypes and the stories seemed far-fetched and too good to be true. For me, the most absurd story centres on a brother and sister from Syria who were working in a Turkish sweatshop. Relotius invented a Syrian children’s song about two orphans who grow up to be king and queen of Syria. According to the article, every Syrian child “from Raqqah to Damascus” is familiar with this traditional song. But none of the Syrians that Moreno spoke to had ever heard of it.

Relotius exposed

After much persistence on behalf of Moreno, the management at Der Spiegel reluctantly investigated Relotius' articles, and concluded that he had indeed fabricated the majority of his articles during his 8 year tenure.

Relotius had invented interviews that never took place, and people who never existed. He even wrote an article about rising sea levels in the Pacific island of Kiribati without bothering to take his connecting flight to the country.

Der Spiegel issued a mass retraction of the affected fake news articles and the ‘Relotius Affair’ became a nationwide scandal, making news worldwide and prompting an intervention by the US ambassador to Germany who objected to the “anti-American sentiment” of some of the articles.

The article Jaeger’s Border and Relotius' other texts can be downloaded as a PDF from Der Spiegel’s website. In total 59 articles are available for download, together with annotations by Der Spiegel indicating what content is genuine and what is pure invention or fake news.

There is a large amount of English language content available online on the Relotius scandal, including English translations of many of the articles.

Analysing Relotius' texts for fake news with NLP

I downloaded all 59 available Relotius articles and Der Spiegel’s annotations and tried a few data science experiments on them.

First of all I checked the truth/falsehood status of the articles. You can see that more than half are fictitious, although there are some articles where it was not possible for Der Spiegel to determine if the article was genuine or not. I excluded the latter from my analysis.

Relotius wrote 32 fake news articles, the remainder are true or unclear. We will analyse the genuine and fake news with natural language processing

Of Relotius' 59 articles, 32 are definitely fake news, while the remainder are true or unclear. We will analyse the genuine and fake news with natural language processing.

The vast majority of Relotius' articles were written by him alone. Moreno later stated that this was quite unusual at Der Spiegel for a reporter to take on so many lone assignments, but Relotius was the star reporter at the publication and seemed to have acquired a certain privilege in this regard.

The fake news articles tended to be written by Relotius alone. 44 articles were under sole authorship.

Of course we know now it was easier for him to fabricate content when working alone.

There is something else interesting about the above graph. Relotius wrote only one article in a team of two. The other collaborative articles all involved larger teams of up to 14 authors.

The sole two-author article is Jaeger’s Border, the article which got Relotius caught out!

This shows that Relotius had a pattern of either writing articles alone, or in a large team. He managed to get away with this strategy for years until the Jaeger’s Border assignment. Perhaps when you are collaborating in a large group it is also easier to avoid scrutiny.

Word clouds for genuine vs fake news

I tried generating a word cloud of the genuine and fake news articles, to see if there is any discernible difference. A word cloud shows words in different font sizes according to how often they occur in a set of documents.

Word cloud for the genuine news articles. The largest (most common) word is sagt (says).

Word cloud for the fake news articles.

Unfortunately there is not a huge difference between the two sets.

However I can see some patterns.

  • There is more use of sei, würde in the genuine news articles, which are special verb forms that are used often in reported speech. It appears that the fake news involved more description of direct action and less tentative reporting or reported speech.
  • The word deutschen (‘German’) is more common in the genuine news articles. In Moreno’s book he explained that Relotius only faked his articles that involved travel outside Germany, as it would presumably be harder to make up fake German news for a German audience and make it sound convincing.

Finding the commonest words that distinguish fake news from genuine news

I then tried a more scientific approach. I used a tool called a Naive Bayes Classifier to find the words which most strongly indicate that an article is genuine or fictitious.

The Naive Bayes approach assigns a large negative number to words that strongly indicate fake news and a smaller negative number to words that indicate genuine news.

Here are the top 15 words that indicate that an article is genuine, with English translations and the scores from the Naive Bayes classifier:

sagtsays-8.37
seiis (reported speech)-8.64
mehrmore-8.69
immeralways-8.77
gehtgoes-8.79
schonalready-8.84
deutschenGerman-8.85
späterlater-8.88
nienever-8.89
sagtesaid-8.89
seitsince-8.90
gibtgives/there is-8.92
baldsoon-8.92
kommencome-8.92
gutgood-8.93

and here some of the top 15 words that indicate that an article is fictitious:

enthauptetenbeheaded-9.29
verstümmeltenmutilated-9.29
abgeladenunloaded-9.29
gegenwärtigcurrent-9.29
abschreckenscared off-9.29
richtetendirected-9.29
gliedsmember-9.29
öffentlichtenpublished-9.29
umfangreicheextensive-9.29
preisgebendivulge-9.29
zurückgezogenwithdrawn-9.29
hacktenhacked-9.29
korruptecorrupt-9.29
bloggendenblogging-9.29
lebensbedrohlichlife threatening-9.29

This is just a snapshot but we can see some more patterns now. The fake news seems to be quite heavy in strong, emotive or very graphic language such as corrupt or mutilated. When I took the top 100 words this effect is still noticeable.

I then tested to see if it was possible to use the Naive Bayes Classifier to predict if an unseen Relotius text was fake or genuine, but unfortunately this was not possible to any degree of accuracy.

Conclusion: NLP can show features of fake news but it’s hard to detect it accurately

It is not possible to build a fake news detector given that we only have 59 articles to work from, but knowing in retrospect that Relotius falsified some texts, it is definitely possible to observe patterns and significant differences between his genuine and fake articles:

  • The fake news articles were written when Relotius was reporting as a lone wolf. Relotius was caught out the first time he was assigned to work in a team of two.
  • The fake news articles contain more emotive, graphic or strong language.
  • There is less reported speech and tentative language in the fake articles.
  • Caveat: it’s possible that some of the linguistic differences mentioned above are due to the genuine articles tending to be multi-author pieces.
  • The fake articles take place outside of Germany.

Perhaps knowing these effects it may be possible to flag suspicious texts in the future. If a reporter seems overly keen on working alone, travelling abroad, and seems to interview few subjects, but writes using colourful language that would be more appropriate in a novel, then perhaps something is amiss?

Epilogue: Relotius vs Moreno?

Naturally Relotius' prizes were revoked and returned one by one, and he resigned from his position at Der Spiegel.

Juan Moreno, the whistleblower who discovered Relotius' fraud, wrote a tell-all book about the Relotius Affair, titled A Thousand Lines of Lies (Tausend Zeilen Lüge). The book is a fascinating exposé of the world of print journalism in the digital age as well as a first hand account of how Relotius' system unravelled.

Ironically in 2019 Relotius started legal proceedings against Moreno for alleged falsehoods in the book, which are ongoing at the time of writing.

Appendix: technical details on the Naive Bayes fake news classifier

In case you would like more details, I used a multinomial Naive Bayes classifier with tf*idf scores. I evaluated it below using a ROC curve:

This is a ROC curve showing the performance of my Naive Bayes classifier under cross validation for predicting unseen Relotius texts. A good classifier would have a line close to the top left hand corner. The fact that the line is on the diagonal shows that my predictions were no better than rolling a dice. That means that if Relotius were still at large today I would have no way of knowing if his latest article were fictitious or not.

References

Juan Moreno, Tausend Zeilen Lüge: Das System Relotius und der deutsche Journalismus (A thousand lines of lies: the Relotius system and what it means for German journalism) (2019). Tells the story of the Relotius scandal and the growing problem of fake news in journalism.

Claas Relotius, Bürgerwehr gegen Flüchtlinge: Jaegers Grenze (Militia against refugees: Jaeger’s Border), and all other Claas Relotius texts, Der Spiegel (2018).

Philip Oltermann, The inside story of Germany’s biggest scandal since the Hitler diaries, The Guardian (2019).

Ralf Wiegand, Claas Relotius geht gegen Moreno-Buch vor (Claas Relotius takes action against Moreno’s book), Sueddeutsche Zeitung (2019).

(Multinomial Naive Bayes) C.D. Manning, P. Raghavan and H. Schuetze, Introduction to Information Retrieval, pp. 234-265 (2008).

Video: Unlocking business synergies with NLP
Data scienceNatural language processing

Video: Unlocking business synergies with NLP

How can we evaluate generative language models?
Data scienceNatural language processing

How can we evaluate generative language models?

I’m sure you will have seen news articles and social media posts about the recent generation of language models which are able to generate human-like text.

How do neural networks learn?
Data scienceDeep learning

How do neural networks learn?

What are artificial neural networks and how do they learn? What do we use them for? What are some examples of artificial neural networks?

Was wir für Sie tun können

Verwandeln Sie unstrukturierte Daten in umsetzbare Erkenntnisse

Kontaktiere uns