1. Introduction
Recognizing fake news is centered on the automatic detection of intentionally misleading false news stories (Conroy, Rubin, and Chen Reference Conroy, Rubin and Chen2015; Allcott and Gentzkow Reference Allcott and Gentzkow2017). The problem has grown in importance because of changes in the dissemination of information. Traditional news publishers no longer control the distribution of news; information circulates among Internet users at a fast pace owing to the rise of social networks. Once communicated via social media, inaccurate, distorted, or false information is amplified and has a tremendous potential for large-scale, real-world consequences (Vosoughi, Roy, and Aral Reference Vosoughi, Roy and Aral2018).
Research on fake news detection has been especially lively since the 2016 US presidential campaign. This was when the New York Times defined fake news as “a made-up story with an intention to deceive”.Footnote a Some researchers and journalists hypothesized that fake news had a real influence on voter preference during that campaign (Allcott and Gentzkow Reference Allcott and Gentzkow2017). According to a recent Pew Research Center study from mid 2019, Americans rate fake news as a larger problem than racism, climate change, and terrorism.Footnote b Given the above context, establishing methods for fact-checking the veracity of information presented through social media is critical.
The most reliable approach is without a doubt human fact-checking. Unfortunately, it is also the most expensive and difficult solution in terms of labor costs and time needed to find and process reliable sources of information, which are often unstructured. Moreover, not all information needed for checking facts is readily available. For example, when verifying certain claims it may be necessary to have access to expert knowledge contained in scientific databases or which is otherwise publicly unavailable. Due to these reasons and also the sheer amount of information that is generated and shared online, human verification is realistically applicable only to a fraction of news and claims.
Accordingly, many researchers and companies have turned towards investigating automated methods of fake news detection. Firstly, this is reflected by the rise of several start-ups dedicated to fake news detection technology, later acquired by the main players in the social network industry. Secondly, the surging interest in automated fake news recognition has resulted in an increasing amount of research on this topic.
Studies on the methods of automated fake news detection typically fall into the following categories: content-based (detection based solely on text), source-based (based on features of the source), and diffusion-based (based on patterns of spreading the news).
In this article, our main focus is on content-based methods. Namely, we are interested in techniques that predict whether a particular document—usually a news article—is fake or not, using that document as the only source of information. Predicting its veracity is carried out on the level of content, syntax, style, and other implicit features (Newman et al. Reference Newman, Pennebaker, Berry and Richards2003) and is based on selected methods available in the field of natural language processing, especially in the area of deep neural networks. We investigated multiple fake news data sets generated by using different assumptions to see if universal features can be identified and generalized to widely applicable models. We compared the content-based approach to a chosen state-of-the-art automated fact-checking system that uses Wikipedia as its knowledge base.
The definition of fake news remains problematic in the context of research. It is not sufficient to say that it is news that contains false information. The key feature is that this kind of article is also intentionally fabricated (Conroy et al. Reference Conroy, Rubin and Chen2015; Allcott and Gentzkow Reference Allcott and Gentzkow2017), and we follow this understanding in our article.
This often entails the use of slightly different language and stylistic meanings (Newman et al. Reference Newman, Pennebaker, Berry and Richards2003; Rashkin et al. Reference Rashkin, Choi, Jang, Volkova and Choi2017; Levi et al. Reference Levi, Hosseini, Diab and Broniatowski2019). Moreover, we limited our study to manually written content. We did not focus on recognition of AI-generated (neural) fake news, which may soon be a serious danger due to the rapid improvements in automated content generation tools. However, recent studies have revealed that although AI-generated news might be recognized by humans as more-trustworthy, the accuracy in differentiating between human written and AI-generated texts reaches even 92% (Zellers et al. Reference Zellers, Holtzman, Rashkin, Bisk, Farhadi, Roesner and Choi2019).
Content-based approaches make it possible to avoid tedious fact-checking procedures and does not require collections of truthful documents against which to compare the input claim.Footnote c Our method is therefore more resource-efficient, but at the same time more difficult. Since our method does not require any additional resources beyond the text that is being analyzed, it has a potential to be more universally applicable than other approaches such as fact-checkingFootnote d
Our paper is organized as follows. Section 2 presents existing work on the topic of fake news detection; Section 3 introduces the data sets used in our experiments. Section 4 describes feature spaces used for representing texts as well as baseline classification methods used in two scenarios: Section 4.4 describes the performance of models trained and tested on the same data set (or in-domain), while Section 4.5 refers to a scenario in which the models were trained and tested on different data sets (or cross-domain). The high accuracies obtained when training and testing on the same data set can wrongly indicate that the problem of automated fake news detection is nearly solved. Testing the model on a different data set than the training one showed a markedly different picture. In this scenario prediction quality decreased significantly, which demonstrates that creating a universally applicable model for fake news detection is far from easy.
The core part of our paper is focused on possible methods of dealing with this quality decrease. We tested five broad categories of methods:
A leave-one-out approach described in Section 5. In these experiments, we tested the effect of training on several data sets.
Automated fact-checking described in Section 6. This method is based on two stages: retrieval of the relevant article (evidence) from Wikipedia and inference about the veracity of the input text (claim).
Feature selection, which we investigated in Section 7. Section 7.1 describes results obtained from finding features that are relevant to not only one but also many data sets, and therefore potentially more universal (feature intersection), and Section 7.2 presents a classifier on such a feature space.
Methods from the area of machine learning related to the concept of data set shift in Section 8. Data set shift techniques deal with differences between the distributions of training and testing data that may lead to unreliable predictions. The approaches include instance reweighting and common representation space methods such as subspace alignment and geodesic flow kernel.
Deep learning approaches to domain adaptation are presented in Section 9. The scenario that we are referring to is also called transductive transfer learning and should be distinguished from unsupervised and inductive transfer learning. The latter two are more common in Natural Language Processing (NLP), with typical examples being pretraining and fine-tuning of transformer-based language models.Footnote e
We summarize all the results and conclude the paper in Section 10.
2. Related work
Numerous approaches have been proposed to solve the problem of fake news detection. The following review will focus solely on text-based methods: analyzing textual content to predict the veracity of an input text. Broadly speaking, the approaches can be classified either as knowledge-rich (fact-checking) or knowledge-lean.
2.1 Fact-checking
Fact-checking is based on the confirmation or rejection of claims made in a text piece with the use of explicit references to knowledge sources, ideally credible ones such as Wikipedia (Thorne et al. Reference Thorne, Vlachos, Christodoulopoulos and Mittal2018). Automated fact-checking is done in a two-step procedure: in the first step relevant, articles are retrieved from the knowledge base, and in the second step, inference (refuting or supporting the input claim) is performed by a neural network.
The approaches to fact-checking go beyond Wikipedia: Wadden et al. (Reference Wadden, Lin, Lo, Wang, van Zuylen, Cohan and Hajishirzi2020) introduce scientific claim verification and demonstrate that domain adaptation techniques improve performance compared to models trained on Wikipedia or political news. In Augenstein et al. (Reference Augenstein, Lioma, Wang, Chaves Lima, Hansen, Hansen and Simonsen2019), the authors use a data set from fact checking websites and apply evidence ranking to improve veracity prediction. The research by Leippold and Diggelmann (Reference Leippold and Diggelmann2020) introduces a data set for verification of climate change-related claims and adapts the methodology of Thorne et al. (Reference Thorne, Vlachos, Christodoulopoulos and Mittal2018). Another study (Rashkin et al. Reference Rashkin, Choi, Jang, Volkova and Choi2017) probes the feasibility of automatic political fact checking and concludes that stylistic cues can help determine the truthfulness of text.
Fact-checking and fake news detection have been the main topics of CLEF competitions since 2018. In the 2018 edition, the second task “Assessing the veracity of claims” asked to assess whether a given check-worthy claim made by a politician in the context of a debate/speech is factually true, half-true, or false (Nakov et al. Reference Nakov, Barrón-Cedeño, Elsayed, Suwaileh, Màrquez, Zaghouani, Atanasova, Kyuchukov and Da San Martino2018). The data set consists of less than 90 verified statements from a presidential debate. In 2019 (Elsayed et al. Reference Elsayed, Nakov, Barrón-Cedeño, Hasanain, Suwaileh, Da San Martino and Atanasova2019), ‘Task 2 Evidence and Factuality” asked to (A) rank a given set of web pages with respect to a check-worthy claim based on their usefulness for fact-checking that claim, (B) classify these same web pages according to their degree of usefulness for fact-checking the target claim, (C) identify useful passages from these pages, and (D) use the useful pages to predict the claim’s factuality. In CLEF 2020 (Barrón-Cedeño et al. Reference Barrón-Cedeño, Elsayed, Nakov, Da San Martino, Hasanain, Suwaileh, Haouari, Babulkov, Hamdan, Nikolov, Shaar and Ali2020), “Task 4: Claim Verification” asked to predict the veracity of a target tweet’s claim by using a set of web pages and potentially useful snippets contained within. The data sets for the 2019 and 2020 tasks are in Arabic. The CLEF 2021 “Task 3a: Multi-Class Fake News Detection of News Articles” (Nakov et al. Reference Nakov, Da San Martino, Elsayed, Barrón-Cedeño, Mguez, Shaar, Alam, Haouari, Hasanain, Babulkov, Nikolov, Shahi, Struß and Mandl2021b) focused on multi-class fake news detection and topical domain detection of news articles in several languages, including English. The data are not publicly available. The CLEF 2021 “Task 2 Detecting previously fact-checked claims from tweets” used data from Snopes and ClaimsKG (Tchechmedjiev et al. Reference Tchechmedjiev, Fafalios, Boland, Gasquet, Zloch, Zapilko, Dietze and Todorov2019) to rank previously fact-checked claims in order to measure their usefulness.
The fully automated fact-checking pipeline could include steps such as identifying claims worthy of fact-checking, followed by detecting relevant previously checked claims and then only focus on verification of selected, check-worthy claims (Nakov et al. Reference Nakov, Da San Martino, Elsayed, Barrón-Cedeño, Mguez, Shaar, Alam, Haouari, Hasanain, Babulkov, Nikolov, Shahi, Struß and Mandl2021b). Systems like this can facilitate the work of fact-checking professionals (Nakov et al. Reference Nakov, Corney, Hasanain, Alam, Elsayed, Barrón-Cedeño, Papotti, Shaar and Da San Martino2021a).
2.2 Knowledge lean
The second type of methods infer veracity only from the input text and do not use any external sources of information. Such approaches are based on various linguistic and stylometric features aided by machine learning.
Features such as n-grams, LIWC psycholinguistic lexicon (Pennebaker et al. Reference Pennebaker, Boyd, Jordan and Blackburn2015), readability measures, and syntax were all used in Pérez-Rosas et al. (Reference Pérez-Rosas, Kleinberg, Lefevre and Mihalcea2018). Then, an SVM classifier was applied to predict veracity. A similar set of features including n-grams, parts of speech, readability scores, and the General Inquirer lexicon features (Stone et al. Reference Stone, Dunphy, Smith and Ogilvie1966) was used in Potthast et al. (Reference Potthast, Kiesel, Reinartz, Bevendorff and Stein2018). Interestingly, they argue that style-based fake news classification does not generalize well between news of both partisan orientations of the USA. A combination of features at four levels—lexicon, syntax, semantic, and discourse—was used in Zhou et al. (Reference Zhou, Jain, Phoha and Zafarani2020). They applied classification methods such as SVM (with linear kernel), random forest (RF), and XGBoost. In yet another study, (Przybyla Reference Przybyla2020) designed two classifiers: a neural network and a model based on stylometric features to capture the typical fake news style. Style analysis allowed to extract sensational and affective vocabulary that is typical of fake news.
Performance drops in different domains were carefully studied by Silva et al. (Reference Silva, Luo, Karunasekera and Leckie2021) and concluded with a new framework that jointly preserves domain-specific and cross-domain knowledge. This allowed for a new cross-domain classifier to be trained on data selected by an unsupervised technique (and manually labelled).
The problem of excessive dependence on training sets and the low robustness of text-based fake news detection was observed by Janicka et al. (Reference Janicka, Pszona and Wawer2019). The authors used four different fake news corpora to train models in both in-domain and cross-domain settings. They concluded that the results achieved by models trained and tested on the same corpus (in-domain) are unrealistically optimistic with regard to the true performance of the models on real-world data. The performance in the cross-domain setting was on average 20% worse than in the case of the in-domain setting. Compared to Janicka et al. (Reference Janicka, Pszona and Wawer2019), our work is a further step forward: our goal is to find the means to improve the robustness and cross-domain performance of fake news detection models.
3. Data sets
As our work tried to assess a universal tool for fake news recognition, we used several publicly available data sources. The available data sets vary in text length, origin, time span, and the way of defining truthfulness of information. In all cases, the annotation was performed at an article level. The key feature of fake news is that it is an intentionally written disinformation. The classification of the collected data as fake or true was not based on the number/ratio of misleading claims/sentences. Articles classified as fake were often a mixture of true and false sentences. The ways of ascribing news veracity vary among the data sets. The following strategies were used: (1) labelling based on the source reliability (the content was not analyzed), (2) manual verification of text by experts, and (3) generation of synthetic data (only fake news) by means of crowdsourcing. Hence, unintentional inaccuracies in a text did not result in it being classified as fake. A summary of the investigated data sets can be found in Table 1, while samples of texts are provided in Table 2.
In our studies we used the following data sets:
-
(1) Kaggle Fake News Data set Footnote f – consists of 20.8 k full-length news texts labeled as either fake or true. The fake portion of dataFootnote g was collected from websites tagged as unreliable by the BS Detector Chrome extension created by Daniel Sieradski. The fake news mainly stems from the US from 1 month in 2016 around the elections. Since we found some non-English texts in this database, we used the langdetect libraryFootnote h to filter them out. As a result, we ended up with 20,165 samples: 10,384 true and 9781 fake.
-
(2) LIAR (Wang Reference Wang2017) – 12.8 k short statements categorized into six categories (pants-fire, false, barely-true, half-true, mostly-true, and true). The data come from politifact.com and were manually annotated by experts not only for truthfulness, but also for subject, context, and speaker. The statements were collected primarily in the period from 2007 to 2016 from various sources (e.g. news releases, radio interviews, tweets, campaign speeches) and cover diverse topics such as elections, education, healthcare, etc. It is worth noting that the statements often contain numerical values, for example “Close to 30% of our federal prison population consists of illegal immigrants.” Moreover, some of the claims require broader context for their verification. For instance, the claim “Crime is rising” does not provide detailed information about location and time period. This data set was originally divided into three parts—train, validation, and test—which we merged. In our research, we decided to use only texts annotated as pants-fire, false, or true. Finally, we ended up with 5624 samples: 3561 fake and 2063 true.
-
(3) AMT (Pérez-Rosas et al. Reference Pérez-Rosas, Kleinberg, Lefevre and Mihalcea2018)—consists of 240 true and 240 false news articles. The true portion of the data was collected from US news web sites such as ABCNews, CNN, USAToday, NewYorkTimes, FoxNews, etc. and covers six domains: sports, business, entertainment, politics, technology, and education. The fake part was created based on real news via Amazon Mechanical Turk. Each fake news story was written with the aim to imitate the topic and style of a true article, but at the same time presenting fake information.
-
(4) FakeNewsNet (Shu et al. Reference Shu, Mahudeswaran, Wang, Lee and Liu2020) – a data set created by using FakeNewsNet resources and tools with help from politifact.com as a source of information about the veracity of individual texts. It contains the most popular fake news stories that were spreading on Facebook in 2016 and 2017, balanced with some reliable texts. These fall into several categories such as politics, medicine, and business.
-
(5) ISOT Fake News Data set (Ahmed, Traore, and Saad Reference Ahmed, Traore and Saad2017) – the data set consists of 21,417 real news items collected from reuters.com and 23,481 news items obtained from sites flagged as unreliable by Politifact and Wikipedia. The collected articles are from 2016 to 2017 and mainly deal with politics and world news.
We can see that the length and style of news differs between data sets. As the LIAR data set comprises data from a fact-checking portal, it contains short claims that often express a single fact that needs to be verified. Figure 1 illustrates the mean and standard deviation of the number of sentences for each data set. As we can see, LIAR texts are on average the shortest among the data from all the above data sets. We can also observe that the AMT data have a relatively small standard deviation in text length, which can be ascribed to the data preparation process where only article fragments of a given size were considered. The Kaggle data set shows the biggest spread in text length, and at the same time the texts are on average the longest among all data. In most data sets, there is some overlap when it comes to the time span of the origin of the collected texts. We do not have information about AMT time span, but every other data set covers at least some part of the year 2016, with the Kaggle time window being the narrowest at 1 month and LIAR being the broadest at 10 years. It was the year of the presidential election in the United States, which brought greater social awareness of issues connected to the spread of fake news and encouraged researchers to devote more attention to this problem. However, the topics covered in each data set are not limited to politics, but also include other social issues, with some differences in the proposed categorization of the topics.
To identify patterns in the data sets that the models might exploit, we computed the mutual information between bigrams in the text and the labels according to the method described by Schuster et al. (Reference Schuster, Shah, Yeo, Roberto Filizzola Ortiz, Santus and Barzilay2019). Popular bigrams associated with each of the two labels are the names of politicians from both parties. However, there is no clear attribution. Moreover, even in the same data set, different variants may be associated with opposite classes. For example, Trump-related bigrams are associated with true texts in Kaggle (except “Donald Trump”, as this one is linked to fake texts) and AMT, but with the fake class in ISOT (“President Trump”). Clinton-related bigrams are associated with the fake class in Kaggle (except “Mrs Clinton”, which is linked to the true class), FNN, and ISOT. Other bigrams with high scores include social issues (LIAR) or month names (Kaggle). This indicates that even the bigrams most correlated with the true and false classes have limited utility for cross-domain classification.
The highest classification accuracies previously reported for selected fake news data sets are listed in Table 3. The results presented there are not directly comparable to ours as we used cross validation instead of splitting the data into train and test sets.
4. Fake news detection methods
In our work, we define fake news recognition as a text classification task, which reduces the scope of investigation to methods connected with news texts only. Images, sources, authorship, and other features of the news are outside the scope of our analysis. In this section, we introduce various sets of features used to represent texts in further experiments. Then, we describe the application of several methods from the fields of machine learning and deep learning that we used to predict text veracity. The methods applied and described in this section are generic, meaning that no adaptation was introduced to increase their robustness in the cross-domain scenario. Such adaptations are the subject of the subsequent Sections 8 and 9.
4.1 Feature space
-
(1) Bag-of-words (BOW) – in the first approach, for each data set we calculated the term frequency–inverse document frequency (TF-IDF) of uni-, bi-, and tri-grams.
-
(2) Linguistic features – using various sources of information, we created a set of 271 features containing semantic, syntactic, and psycholinguistic information. More than 180 features stem from the General Inquirer (Stone et al. Reference Stone, Dunphy, Smith and Ogilvie1966), a tool providing a wide range of psycholinguistic categories for analyzing text content in terms of sentiment, emotions, as well as multiple other social and cognitive categories. We enriched the data with hand-crafted dictionaries of linguistic hedges and exclusion terms, which are presented in Table 4. Each feature was normalized by the number of all words in the text. Similarly, using the spaCy library,Footnote i we added part-of-speech information in each text. We also added the sentence-level subjectivity measure from an LSTM-based modelFootnote j that was trained on a data set released by Pang and Lee (Reference Pang and Lee2004). Finally, various readability indicators were used: Flesh-Kincaid (Flesch Reference Flesch1948; Kincaid et al. Reference Kincaid, Fishburne, Rogers and Chissom1975), ARI (Smith and Senter Reference Smith and Senter1967), Coleman-Liau (Coleman and Liau Reference Coleman and Liau1975), etc.
-
(3) GloVe (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014) – we used 100-dimensional word embeddings that were pre-trained using aggregated global word co-occurrence statistics found on Wikipedia and the Gigaword corpora.
-
(4) Universal Sentence Encoder (Cer et al. Reference Cer, Yang, Kong, Hua, Limtiaco, St. John, Constant, Guajardo-Cespedes, Yuan, Tar, Strope and Kurzweil2018) – a versatile sentence embedding model based on the Transformer architecture trained in a transfer learning manner. It converts text into 512-dimensional vector representations, which can capture rich semantic information. They can be used in various downstream tasks and have proven to yield good results in many text classification problems.
-
(5) ELMo embeddings (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018) – we used this model to represent each text by a 1024-dimensional vector. These features were used exclusively for visualization, to compare whether they convey more information than the embeddings obtained from the Universal Sentence Encoder.
4.2 Models
As the baseline approach, we used a support vector machine classifier with linear kernel trained on bag-of-words features. The default parameters were used: C=1 and squared hinge loss as the loss function.
To create a classifier based on linguistic features, we tested various machine learning algorithms from the scikit-learn (Pedregosa et al. Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay2011) library, such as support vector machines, stochastic gradient descent, extra trees, and gradient-boosted trees (XGBoost). The default hyperparameter values were used in each case. The best test set performance for all five data sets was obtained by XGBoost, which was used in further experiments.
Next, we tested a bidirectional-LSTM architecture initialized with 100-dimensional GloVe embeddings. The scheme is shown in Figure 2 on the right. According to a benchmark study on fake news detection methods by Khan et al. (Reference Khan, Khondaker, Afroz, Uddin and Iqbal2021), this classifier achieved the best results among the tested approaches. Both the output dimension of the LSTM and the number of time steps were set to 100. The outputs of the last LSTM units (one for each direction) were concatenated and followed by dropout and a dense classification layer with the sigmoid activation function. We used the ADAM optimizer to minimize the binary cross-entropy loss. The model was implemented using the PyTorch library.
As a basic model utilizing USE embeddings, we used the LinearSVC classifier which takes as input a vector representation of whole texts. This approach proved to be very successful in some text classification tasks with limited data thanks to SVM’s robustness and regularization (Xu, Caramanis, and Mannor Reference Xu, Caramanis and Mannor2009). Since one of our goals was to compare different deep domain adaptation methods, we also investigated the performance of a simple neural model utilizing USE embeddings as a baseline. The architecture is presented in Figure 2 (left).
The embeddings were used as the input and were followed by three blocks, each constituted of a dense layer with residual connection and a normalization layer. Finally, the obtained representation was projected to two classes by two dense layers. Hyperparameters such as dropout and dense layer dimensions were fine-tuned using the Kaggle data set, while the epoch count was set to 200 based on the mean scores obtained for all data sets.
Moreover, we tested the BERT language model (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), which is based on the Transformer architecture and has achieved state-of-the-art results in many NLP tasks. We used the TensorFlow implementation and tested sequence lengths of 32, 128, and 512. The number of training epochs was limited to 5, as there was no further improvement in the classification performance. Typically, one or two epochs provided maximum accuracy. Therefore, we finally used a sequence length of 128 and trained the model for 2 epochs.
4.3 Evaluation of classifiers
Comparison of classifier performance is a difficult task and cannot be limited to the analysis of the main score, often calculated using k-fold cross validation. Although the model with the best mean performance is expected to be better, this is not always true. Such a simple approach can be misleading, as the difference in performance might be caused by chance. Therefore, to be sure that one model provides significantly higher accuracy than the others, statistical tests should be performed. Fortunately, there is a variety of statistical methods enabling the selection of best performing machine learning model.
As the first step we decided to use a Cochran’s Q test (Cochran Reference Cochran1950), which is used to determine if there are differences on a dichotomous dependent variable (which can take only one of two possible values) between more than two related groups. It tests the null hypothesis that the proportion of “successes” is the same in all groups. We applied this test to the predictions of the investigated models. If the null hypothesis was rejected, multiple pairwise comparisons between groups were performed as the second step.
We applied the Wilcoxon signed-ranked test (Wilcoxon Reference Wilcoxon1945; Demšar Reference Demšar2006) to verify if two samples differ significantly from each other, testing the null hypothesis that the related paired samples are drawn from the same distribution. It is a non-parametric version of the paired Student’s t-test. Contrary to Student’s t-test, the Wilcoxon signed-rank pair test does not require normally distributed values.
Figure 3 presents how the pair-wise evaluation and comparison of models under different settings were performed. Instead of using raw models predictions like in Cochran’s Q test, we compared the accuracies obtained during k-fold cross validation. In our experiments, we used 10-fold cross validation and set the significance level to 0.05. In the results tables, we marked the groups of models for which the Cochran’s Q test revealed no differences with a gray background. Moreover, within the remaining groups, the pairs of models for which the Wilcoxon signed-ranked test indicated a p-value $> 0.05$ (non-significant differences) are marked with the same letters in the superscript.
4.4 In-domain results
We compared the performance of the above classifiers in data set-specific or in-domain settings using 10-fold cross validation. The obtained results are presented in Table 5. In general, all classifiers performed significantly better on the Kaggle and ISOT data sets when compared with the others. This fact does not seem surprising considering that those collections are several times larger than the others.
a We could not replicate the results of Perez-Rosas et al., who reported an accuracy of 0.62 using similar settings: a linear SVM classifier based on TF-IDF values for unigrams and bigrams with 5-fold cross-validation; with these settings we achieved an accuracy of 0.43. The accuracy is significantly higher for char-level settings - 0.66. The value reported here is for 10-fold cross-validation of an SVM classifier based on TF-IDF values for unigrams, bigrams, and trigrams.
Taking into account the results on all data sets, the best-performing classifier is XGBoost trained on linguistic features. It achieved the highest accuracy on two out of five data sets: Kaggle and AMT. Moreover, its accuracy on the three remaining data sets—ISOT, LIAR, and FNN—was lower than the accuracy of the best-performing classifiers by only 0.01, 0.05, and 0.05, respectively. Finally, its advantage over other models on the AMT data set is enormous: the obtained accuracy is higher by 0.13 when compared to the runner-up. It seems that this data set is the most challenging for the other methods. The accuracies of models using USE embedding are very low: 0.57 obtained by LinearSVC and 0.51 achieved by a neural network based on dense layers. The poor performance of classifiers based on embeddings in the case of the AMT data set is discussed further. Even worse results were provided by LinearSVC using bag-of-words features (0.40). The reason behind such poor results is strong overfitting to the training data. In 10-fold cross validation settings, the accuracy on the training sets regularly reached a value of almost 1.0, while on the test parts it fluctuated around 0.40. In the case of char-level settings (using character-level n-grams), the values are respectively 0.82 and 0.66, but we decided to focus on word-level analysis. The comparison of our results with the best existing ones (as in Table 3) revealed that we managed to outperform the accuracy on the ISOT data set (0.92 vs. 0.99) reported in Ahmed et al. (Reference Ahmed, Traore and Saad2017). Our results are similar in the case of Kaggle; however, we did not obtain a higher accuracy than Zhou et al. (Reference Zhou, Jain, Phoha and Zafarani2020) for the FNN data set. As for the LIAR data set, we obtained significantly higher accuracy than the best results reported by Wang (Reference Wang2017), Popat et al. (Reference Popat, Mukherjee, Yates and Weikum2018). However, these values cannot be directly compared. Six classes were used in the original analysis, while we narrowed it down to two. It should be noted that we used the same method for all data sets (without hyperparameter tuning for a specific data set). Hence, in our opinion, the results are satisfactory.
We also investigated which linguistic features are the most important for classifying fake news. The results are shown in Table 6 and indicate that among the linguistic features, part-of-speech tags are the most informative. The addition of other linguistic features (the first row of Table 6) did not significantly improve the accuracy—only by 0.03 for AMT and 0.04 for FNN. In the case of LIAR data set, it even led to a slight decrease in performance.
The obtained results indicate that designing a high-accuracy classifier is more challenging for some data sets than for others. To get insight into the arrangement of data in both high-dimensional feature spaces—USE embeddings (512) and linguistic features (271)—we applied t-Distributed Stochastic Neighbor Embedding (t-SNE). The vectors containing linguistic features were standardized before dimensionality reduction. Figure 4 presents the obtained results for three selected data sets.
This visualization can explain why the classification based on USE embeddings failed in the case of AMT. For a large number of points related to true texts, points related to fake texts occurred in a very close distance. This might be caused by the way in which this data set was created. Initially, legitimate news pieces were collected from a variety of websites (their veracity was confirmed by manual fact-checking). However, the fake news collection did not originate in any existing source, but was written specifically for this data set. Amazon Mechanical Turk workers were asked to write a fake version for every legitimate news story. At the same time, they were requested to imitate the journalistic style and preserve the names mentioned in the original news. As a result, both the originals and their fake counterparts were closely related. Therefore, their USE embeddings were often similar (USE embeddings proved to be successful in finding semantic similarity).
For a comparison, we also visualized more expressive embeddings—ELMo (Figure 4b). The t-SNE plots for both USE and ELMo embeddings appear quite similar. Still, the ELMo representations for paired true and fake texts were often located close to each other. However, in the case of linguistic representation the samples in AMT data sets were much more scattered, that is vectors for related original and fabricated texts were more separated. It indicates that USE and ELMo embeddings are more related to the content of the text (its semantics), while the linguistic features are not strongly related to its meaning but contain information about part-of-speech tags, readability indices, etc. Even if both texts cover the same topic they are written differently, which results in a larger change in linguistic features than their embeddings. Therefore, the linguistic feature vectors of the fake and true texts are better separated, which allows for more accurate classification.
In the case of LIAR, linguistic representations of the data tended to form many well-separated clusters, while USE embeddings seemed more dispersed. This might be due to the length of the texts in the LIAR data set. As they contain single short sentences, their linguistic representations contain many zeros. Unfortunately, true and fake texts do not form separate clusters, which makes classification very difficult. The visualization of the data arrangement suggests that USE embeddings might work better for this data set.
The distributions of USE embeddings and linguistic features obtained for the Kaggle data differed significantly. Most strikingly, the linguistic representations of fake news formed one large cluster and a few smaller ones. As for USE embeddings, they formed smaller clusters. The points related to both classes rarely appeared in the same clusters. The classifiers utilizing USE embeddings reached an accuracy of 0.93. An even higher accuracy of 0.99 was achieved when linguistic representation was used.
To conclude, the use of linguistic and psycholinguistic features proved to be the most versatile approach for the detection of fake news. However, it only works for longer texts and not for very short ones, like in the LIAR data set.
As the t-SNE plots revealed large differences between the studied data sets, we decided to see how the classification accuracy relates to the number of training samples.
The goal of this analysis was to show that variability in classification accuracy was not solely a result of different data set sizes. We also hoped to observe how quickly the classification models learn and to approximate how accuracy increases when adding training samples. Lastly, the behavior of classifiers with limited ground truth access was examined.
The classifiers were trained on randomly selected samples that maintained class representations. The remaining samples were used as the test set. Due to large variance in the results, especially for a small number of training samples, the whole procedure was repeated 10 times and the obtained accuracy scores were averaged.
Figure 5 shows the results obtained for two classifiers: LinearSVC and XGBoost. We compared two representations: USE embeddings and linguistic features. Interestingly, the obtained plots reveal significant differences. The most expected are the plots for Kaggle and ISOT: the larger the training set, the higher the accuracy accompanied by lower variance. As it turned out, LinearSVC performed slightly better than XGBoost when trained on USE embeddings. The behavior of models trained on FakeNewsNet was very similar to the classifiers trained on Kaggle and ISOT. Unfortunately, due to the much smaller size of FakeNewsNet, it could not be compared in larger ranges. Nevertheless, the observed similarity suggests that the addition of new labeled samples would improve classification accuracy.
The plots obtained for the AMT data set look strikingly different. The most surprising is the behavior of the XGBoost classifier trained on USE embeddings. Its accuracy did not exceed 0.5 and decreases with the increasing size of the training set. Such unusual behavior occurs exclusively for the AMT data set, which indicates that it is related to the nature of USE embeddings of texts contained in that data set (as previously discussed). t-SNE visualization of these embeddings (Figure 4) shows that they often occur in highly similar pairs: the embeddings of the original (true) text and the fake one. As a reminder, the USE embeddings for original and fabricated texts are often close to each other, as they cover the same topic. Interestingly, LinearSVC deals significantly better with such specific data.Footnote k The situation was different in the case of classifiers trained on linguistic features: the accuracy for both LinearSVC and XGBoost increased upon expansion of the training set for the AMT data set. Just as for ISOT and Kaggle, LinearSVC also performed slightly better than XGBoost on a small training set. When its size exceeded 50 samples, XGBoost gained a large advantage.
Models trained on LIAR exhibited a still different behavior. The increase of the training set size had little impact on classification accuracy. Moreover, for both types of representations, the XGBoost model provided better performance.
The obtained results clearly show that the investigated fake news corpora differ significantly. Only Kaggle and ISOT are similar to each other. It also seems that FakeNewsNet has a lot in common with these data sets, but its size is two orders of magnitude lower. Most likely this is the reason why models trained on this corpus did not achieve high accuracy.
Moreover, for really small data sets, ranging from 5 to 50 samples, the representation by linguistic and psycholinguistic features provided better performance. It indicates that such representations are more universal. The advantage of linguistic representation over USE embeddings on small data sets might result from differences in their dimensionalities: 271 and 512, respectively. Furthermore, in the case of AMT, models trained on USE embeddings proved ultimately unsuccessful.
4.5 Cross-domain results
We also tested cross-domain performance: the models were trained on one data set and tested on other data sets. The results of 10-fold cross validation are presented in Table 7, where we can observe a substantial performance decrease of the classifiers within this test setting.
The data set-specific accuracy of classifiers trained on Kaggle and ISOT exceeded 0.9. However, when the same classifiers were tested on data sets different from the one used for training, the accuracy decreased to even slightly below 0.5 in some cases. This shows that those models are good in recognizing fake news within one data set, but not fake news in general, that is independently from training data. The most universal method which achieved the highest accuracy averaged over all train–test pairs turned out to be LinearSVC on BOW. In general, the highest scores in cross-domain setting were obtained for the ISOT-Kaggle pair due to the similarity of both data sets. The lowest accuracy of models trained on Kaggle and ISOT was achieved by XGBoost with linguistic features tested on the AMT data set. While Kaggle and ISOT data sets are similar in terms of size and structure, they significantly differ form the AMT data set. However, the best method was LinearSVC on BOW. It does not mean that the topics were the same. Most likely, similar stylistic means were used in fake news across various data sets, which is recognized by this classifier.
Models trained on AMT showed the most unpredictable results. LinearSVC with BOW features, which obtained very low accuracy in data set-specific settings, turned out to perform very well on ISOT data. This result is surprising, with 0.4 accuracy on its test data and 0.78 on the ISOT test set. We examined these results, looking for the words that have the greatest impact on the predicted label. The AMT’s most significant n-gram features were neutral words such as “their”, “be”, “this”, “they”, and “are”. This was due to the way the data set was created. For each true text, there is a corresponding false one on the same topic, as we explained in Section 3. These influential words are not as abundant in the AMT test set, but they are consistent with the fake news style in other data sets, which explains why the results vary so much.
As far as the linguistic features are concerned, the classifier trained and tested on AMT obtained good results (0.80). At the same time, when tested on Kaggle and ISOT, their accuracy was 0.42 and 0.38, respectively, which were the lowest cross-domain results. Also, we can observe that some classifiers trained on different data sets managed to perform better on AMT than classifiers trained on AMT. For instance, LinearSVC using bag-of-words achieved an accuracy of 0.57 on AMT when trained on FakeNewsNet, and 0.61 when trained on Kaggle, which was higher by 0.21 than its results for data set-specific settings. This shows that the small size of the AMT data set hinders its effective training.
On the LIAR data set, the highest cross-domain accuracy was 0.63, achieved by several classifiers.
In general, the decline in the quality of the model predictions in this scenario compared to the in-domain setting turned out to be very large. This phenomenon is at least partly due to the heterogeneous content of the data sets, containing texts of different nature, topics, and structure. Different ways of defining and creating false and true texts, as well as the time of their writing, may also explain this decline in quality.
5. Leave-one-out settings
Another interesting variation of the domain adaptation setting we decided to investigate is the leave-one-out scenario. In this strategy, a classifier is trained on all but one selected data set, which is further used for validation of the system. Such an approach might help smooth out some of the differences between the data sets as well as provide more training data. As the number of samples in the investigated data sets differs substantially, we decided to consider two approaches:
-
(1) All: all samples from data sets are used for training. Therefore, for a different selection of the train and validation data sets, their sizes differ.
-
(2) Balanced: the training corpus is composed of the same number of samples from each data set. As the smallest data set—FakeNewsNet—contains only 182 samples, such a number of randomly selected samples from each data set was used to create the training data. Therefore it always contains $4\times182 = 728$ samples;
The results of this approach are presented in Table 8. Moreover, this table contains two columns—Average and Best—which refer to the results presented in Table 7: the best and averaged results obtained for corresponding models and test sets in the default cross-domain setting. In general, the results of the balanced setting are better than for models trained on all samples from data sets selected for training. It is worth noting that the amount of training data is relatively small when using this configuration. Typically the accuracies of models trained on a balanced mixture of data sets are higher than the average accuracies obtained in cross-domain settings reported in Table 7. The leave-one-out settings resulted in the highest accuracy in only two cases: XGBoost classifier using linguistic features tested on FNN, and LinearSVC classifier utilizing USE embeddings validated on ISOT.
6. Fact-checking
A completely different approach to detecting fake news is automated fact-checking. A discussion of possible approaches is presented in Section 2.1. Here, we discuss the method we applied to fact-check the data sets described in Section 3. The system consists of a knowledge base that is considered to be true, an information retrieval module to look up relevant articles, and a neural network trained on the FEVER data to perform inference (Thorne et al. Reference Thorne, Vlachos, Christodoulopoulos and Mittal2018).
The selected system is the one presented by Team Domlin (Stammbach and Neumann Reference Stammbach and Neumann2019) during the second edition of the FEVER competition (Thorne et al. Reference Thorne, Vlachos, Cocarascu, Christodoulopoulos and Mittal2019). The system was built using the FEVER data set (Thorne et al. Reference Thorne, Vlachos, Christodoulopoulos and Mittal2018) for the FEVER competition, which requires for a given claim to either find evidence to support/reject that claim or, if insufficient evidence is found, classify it as ‘Not Enough Info’ (NEI). The database for evidence searching is Wikipedia. The Domlin system uses BERT representations (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) to select from Wikipedia sentences that are most relevant to the given claim, and then another BERT model to decide whether gathered evidence supports or refutes the claim. This model achieved one of the best results in the FEVER competition and is able to find evidence for claims outside of the FEVER data set (is able to generalize), although the usage of Wikipedia as a source of ground truth data has severe limitations.
Fact-checking systems often process sentence-level information. The data sets in Section 3 (except LIAR) typically contain many sentences per text and the veracity is attributed to the whole text, not individual sentences. In order to compute text-level accuracy of fact-checking, we implemented the following principle: if a text contains at least one sentence labeled as ‘Refuted’ (presumably a false sentence), classify the whole text as false. If a text contains at least one sentence labeled as ‘Supported’ (presumably a true sentence) and there is no sentence labeled as ‘Refuted’, classify the text as true. Texts with contradictory labels were classified as fake due to the fact that fake pieces of news are often mixtures of true and fake information.
Due to long processing times, we had to limit the size of the large data sets (ISOT, Kaggle, and LIAR). Each was randomly sampled for 1000 true and 1000 fake texts. Smaller data sets (FakeNewsNet and AMT) were used in full. Table 9 reports the numbers of texts with at least one sentence labeled other than ‘Not Enough Info’ (NEI) by the Team Domlin system (Texts w/o NEI-only), as well as the accuracy achieved on those texts (Accuracy w/o NEI-only). In other words, for computing accuracy we removed all texts where all sentences obtained NEI labels.
The results achieved on these data sets are far from satisfactory. The biggest problem is that most of the sentences obtained ‘Not Enough Info’ (NEI) labels by the Team Domlin system. The number of sentences with labels other than NEI, suitable for veracity inference, was between 1% and 2% depending on the data set. This translates to a low number of texts with at least one non-NEI sentence, far too low to consider this form of veracity verification satisfactory for real-world application. One can hypothesize that the biggest obstacle is that the tested fact-checking system is limited to the content of Wikipedia, which is neither sufficient nor suitable for real claims on the internet.
7. Feature selection
Feature selection is often used to design more accurate models and seek explainable architectures. It also can serve as a regularization technique since it decreases model dimensionality. In this section, we describe feature selection algorithms that were applied to improve cross-domain performance. First, we analyzed a subset of features that were important across all data sets, and next we designed a classifier based on the intersection of the top features.
We compared two algorithms of feature selection: Mutual Information (MI) (Kraskov, Stögbauer, and Grassberger Reference Kraskov, Stögbauer and Grassberger2004) and Minimum Redundancy Maximum Relevance (MRMR) (Peng, Long, and Ding Reference Peng, Long and Ding2005). The former is a univariate filter method: each feature is evaluated individually according to specific criteria, and no interactions between features are considered. Then, the most relevant features are selected, which is called a maximum-dependency or maximum-relevance scheme. Correlation or mutual information is typically used as a measure of a feature’s importance, based on which features are ranked. We decided to use Mutual Information, which measures the dependency between variables (in our case between each feature and target) based on an entropy estimation of the distance to the k nearest neighbors. We used the implementation provided by the scikit-learn library (Pedregosa et al. Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay2011).
However, certain studies (Peng et al. Reference Peng, Long and Ding2005) have shown that combining individually good features does not always result in good classification performance. MRMR is a heuristic algorithm that aims at finding a subset of features that is close to the optimal one. Contrary to MI, it also takes into account the correlations between variables. Features are selected in such a way that they correlate very strongly with the target (maximum-relevance), but are mutually as dissimilar to each other as possible (minimum-redundancy condition). The introduction of the latter condition often enriches the information provided by MI.
In the first experiment, we selected n top features for each combination of feature space, selection method, and data set. Next, we checked which features were common among all data sets. We tested different values of n on both linguistic and bag-of-words features. The whole process along with the achieved results is presented in Subsection 7.1.
The second experiment utilized the cross-domain feature selection mechanism described above to enhance the performance of the model. The model was trained on one data set and tested on another, but only features from the intersection of the top features from the remaining three data sets were used for training. For the rationale behind such an approach as well as a detailed description of the procedure, see Subsection 7.2.
7.1 Intersection of top features
During the search for features that are important for fake news detection across all data sets, we investigated the intersection of top n (n = 20, 50, 100, 150, 200) linguistic features from each data set. As Table 10 shows, the Mutual Information method—compared with MRMR—resulted in a considerably lower number of important common features across all five data sets. The intersection of the top 20 features turned out to be empty and there was only one feature in the top-50 intersection: TIME. This category comes from the General Inquirer dictionary and describes words indicating time consciousness. Selected features are shown in Table 11.
When the MRMR method was used, two POS features turned out to be present in the intersection of the top 20 features: the percentage of nouns and adjectives in the text. The third feature found (complex_words) was the ratio of complex words to all words in the document. This is one of the indicators used to measure text readability.
Figure 6 shows the frequencies of three relevant parts of speech: nouns, adjectives, and numerals. The importance of the percentage of nouns in fake news classification is in line with the research by Horne and Adali (Reference Horne and Adali2017). Their study revealed that the number of nouns is lower in fake news than in real news, and close to that found in satire. Figure 6a illustrates that all data sets except LIAR showed this trait. The LIAR data set consists of claims collected from transcripts, speeches, news stories, press releases, and campaign brochuresFootnote l by PolitiFact journalists, which makes the LIAR texts different in length and style from the other data sets.
When it comes to the average percentage of adjectives in a sentence (Figure 6b), we can observe differences in feature distribution between classes for Kaggle and ISOT data sets. However, in the Kaggle data set fake news had more adjectives than in real news, while in the ISOT data set it was the opposite—real news had on average more adjectives than fake news. It is therefore hard to draw any conclusions about the role of adjectives in distinguishing fake and real news based on these facts alone. One possible continuation of this research may be to perform expert analysis on journalistic style traits present in different data sets, which may explain some discrepancies in feature distribution. However, our aim is to explore the possibility of a universal classification model in the area of fake news detection, so we put aside differences in news writing styles to be examined in more depth in future research.
Aside from the number of complex words, other readability indicators also turned out to be important when it comes to fake news classification. The intersection of the top 50 features from each data set (30 features) contained 23% readability-related features, whereas readability indicators made up only 7% of all features. The well-known readability measures are shown in Table 11 and consist of the Gunning–Fog index (Gunning Reference Gunning1969), the number of long words, and the type-token ratio (lexical diversity), among others.
We can hypothesize that to fulfill their role of misinformation, fake messages need to be easily understood, whereas real news seeks to express the complexity of the reported events and thus may require more sophisticated language. The importance of readability in fake news detection is also confirmed by Pérez-Rosas et al. (Reference Pérez-Rosas, Kleinberg, Lefevre and Mihalcea2018) and Horne and Adali (Reference Horne and Adali2017). What is more, it also proved to be true for different languages (Santos et al. Reference Santos, Pedro, Leal, Vale, Pardo, Bontcheva and Scarton2020). However, the hypothesis about the source of differences in text readability between fake and real news should be further investigated via sociolinguistic studies.
Analogically, we investigated the intersection of bag-of-words features. We checked how many of the top n features (n=20, 50, 100, 150, 200, 300, 500) were common for all data sets. The results are presented in Table 10. As we can see, the two feature selection methods yielded distinctly different results. Mutual Information provided intersections ranging between 39 and 57% of the top features. For example, considering the top 20 features selected for each data set, half of them (10) were common for all data sets. This may be surprising when we consider that the BOW feature space is very vast, containing thousands of n-grams. For MRMR, we decided to take into consideration only the 5000 most frequent features in a given corpus to prevent the algorithm from selecting the rare ones. Nevertheless, the set of common features selected with the MRMR algorithm was empty for every n value.
In Figure 6c, we can see the general tendency of fake news to contain fewer cardinal numerals, which holds for every data set. In Horne and Adali (Reference Horne and Adali2017), the authors observed that fake news contains fewer technical and analytical words. Our findings are in line with their study showing that fake news is less likely to present concrete numerical data, which are rather common in technical or analytical writing.
The results of the MI feature selection on BOW data are consistent with the observation made by investigating linguistic features (see Table 11), where the percentage of pronouns, prepositions, and conjunctions per text was selected as an important linguistic feature that was common for every data set.
Furthermore, we checked the distribution of these features within each class and observed the tendency for fake news to have on average more pronouns than real news, which was especially visible in the AMT, ISOT, and FNN data sets. A more in-depth analysis of pronoun distributions in fake news versus real news conducted by Yang et al. (Reference Yang, Zheng, Zhang, Cui, Li and Yu2018) showed fewer first- and second-person pronouns in fake news, but a higher number of third-person pronouns. Considering the average number of prepositions, we may observe that they were dominant within the real news class in LIAR, Kaggle, and ISOT data sets, and only AMT showed the opposite pattern. According to Santos et al. (Reference Santos, Pedro, Leal, Vale, Pardo, Bontcheva and Scarton2020), prepositions are indicators of text cohesion, which is further studied in the paper for the Brazilian Portuguese language. Our result can therefore be a premise to conduct such research for the English language as well.
The purpose of this experiment was to investigate if there is a way to create a classifier that will show similar accuracy independent of the origin of the test set based on five different data sources. We used feature selection methods to find out which features are good indicators of fake or real news across all five data sets. Further, we compared selected feature distributions between classes within the studied data sets. In some cases (nouns, pronouns, cardinal numbers, prepositions), we found traits common for most of the data sets, whereas in other cases (e.g. adjectives) features played an important role for most data sets but the class they indicated differed between data sets. One of the plausible continuations of this research is to investigate the differences between journalism styles and compare them to the style of fake news or possibly cluster fake news based on linguistic features and try to uncover some differences in the style of fake news itself.
7.2 Intersection-based classifier
In this subsection, we present the results of classifiers based on the intersection of the top features. We trained a model on one data set using selected features and tested it on another. The procedure of feature selection was similar to the one presented in the previous subsection; however, only three data sets were used to create the top feature intersection instead of five. The test set was not supposed to take part in the creation of the model, so it could not be a part of the process of feature selection. Also, the training set was excluded since selecting top features common for several data sets is meant to reduce the bias towards training data and provide more universal results.
Moreover, a lower number of data sets in the process of feature selection resulted in a larger number of selected features. For example, the five-data set intersection of the top 50 linguistic features selected with the MI algorithm had only one feature (see Table 10), whereas for three data sets the number of features varied between 1 and 7 (depending on the selected data sets). Similarly, when the same settings were considered for bag-of-words features, the intersection of all five data sets contained 30 items (see Table 10), while the number of common features for three data sets varied between 31 and 42.
For the intersection-based classifiers, we chose the same algorithms as those described in Subsection 4.2: XGBoost on linguistic features and LinearSVC on BOW. We tested them in a cross-domain manner, as in Subsection 4.5, but using only selected features. For each train-test pair, a set of features was selected based on the intersection of the top features of the three remaining data sets. For linguistic classifiers, we considered two methods of selecting top features: Mutual Information and Minimum Redundancy Maximum Relevance. With the bag-of-words feature space the MRMR method selected very different features for each data set, so the intersection was empty (even for only three data sets). Therefore, we could not create an intersection-based classifier for BOW using MRMR and we used only the Mutual Information algorithm.
We adopted average gain as the measure of the overall performance of a given feature selection setting. To calculate the overall performance, we first subtracted full-feature model cross-domain results (Table 7) from the results of the intersection-based classifiers for each train–test pair. Then, we added up the individual gains and divided the result by the total number of pairs (20) to obtain the average gain.
Figure 7 shows average gains for varying number of top features (n). We can observe that for the MRMR method, none of the investigated values of the n parameter provided any improvement in the results. As for the Mutual Information method, n=100 provided the best results for both linguistic and BOW features. On average, the accuracy improved by 4% and 1%, respectively.
Table 12 presents the cross-domain results of classifiers trained on the intersection of the top 100 features selected with the Mutual Information method. In the last column, we present the average gain for all pairs with a given data set used for training, while the last row presents the average gain for all pairs when the data set was used for testing.
Linguistic classifiers showed an overall greater improvement (4% averaged over all train-test pairs) than the ones based on BOW (average gain 1%). The accuracy improved for almost every training set, with the most significant gain for Kaggle (6%) and AMT (11%). This proves that a carefully designed feature space, which takes into account feature importance computed for several different data sets, can improve generalization of a model trained on a particular data set. Although the BOW classifier had a lower average gain than the linguistic one, it showed higher accuracy, especially when tested on Kaggle, ISOT, and AMT. As the BOW classifier performed better from the beginning, there was less room for improvement. This may explain why this method was less effective in shifting the cross-domain performance.
8. Cross-domain performance enhancements
We also investigated more complex methods designed to compensate for different distributions of train and test data. There is a whole group of domain adaptation techniques specifically addressing this problem which is closely associated with machine learning and transfer learning. The goal of such algorithms is to train a model with labeled data in a source domain that will perform well on different (but related) target data.
In supervised learning (without domain adaptation), we usually assume that both training and test set examples were drawn from an identical or very similar distribution. In the domain adaptation scenario, we considered two different (but related) distributions: one for the source data and one for the target data. The domain adaptation task then consisted of a transfer of knowledge from the source domain to the target domain, minimizing errors on the target domain.
In the machine learning community, issues similar to that of domain adaptation have been studied under the term “dataset shift” (Quionero-Candela et al. Reference Quionero-Candela, Sugiyama, Schwaighofer and Lawrence2009). This well-known problem of predictive modeling occurs when the joint distribution of inputs and outputs differs between training and test stages. A data set shift may have multiple causes, ranging from a bias introduced by the experimental design to the irreproducibility of the testing conditions at the training stage. In our case, the data set shift was related to different data sources and time periods of the data sets as well as varying topics and structure of the texts. Ensuring domain adaptation (preventing a data set shift) often involves matching distributions so that the training (source) data distribution more closely matches that of the test (target) data. As our aim was to improve the cross-domain performance, we used one data set as the source domain (train set) and another as the target domain (test set). The results of our experiments are described in Sections 8 and 9. We used a 10-fold cross-validation and performed statistical significance tests.
The selected machine learning approaches that achieved this goal are grouped into the following types and described in-depth below:
Instance re-weighting – The goal of instance reweighting (IRW) is to use the source data for training while optimizing performance on the target data (Jiang and Zhai Reference Jiang and Zhai2007). Such optimizations can be based on assigning higher weights to training instances that are close to the target instances; the assumption being that they are of greater importance to the classification of the target data. Re-weighting each source instance in such a way approximates risk minimization under the target distribution. We applied this approach using an XGBoost classifier on linguistic features and LinearSVC on USE embeddings. We used the implementation from libTLDA libraryFootnote m and selected the optimal loss functions.
Common representation space – Another approach to domain adaptation is based on finding a domain-invariant feature space. This can be achieved by creating subspaces for both domains and aligning the source one with the target one. We tested two methods belonging to this class:
Subspace alignment (SA) (Fernando et al. Reference Fernando, Habrard, Sebban and Tuytelaars2013): one of the most straightforward methods of finding common subspaces. For each domain, the first d principal components, $C_{src}$ and $C_{trg}$ , are computed. A linear transformation matrix that aligns source components to target components is defined as $M=C_{src}^{T}C_{trg}$ . The adaptive classifier initially projects data from each domain to their corresponding components and is trained on the projected source data transformed using matrix M. Again, we used the implementation from libTLDA library and tested various loss functions and a number of components.
Geodesic flow kernel (GFK) (Gong et al. Reference Gong, Shi, Sha and Grauman2012): a more sophisticated method based on the assumption that a manifold of transformations exists between the source and the target domains. This path is defined by the projection matrices, $\Phi(t)$ , where $t\in[0,1]$ . At $t=0$ the projection consists purely of the source components $C_{src}$ , while at $t=1$ it consists exclusively of the target components $C_{trg}$ . Geodesic flow kernel incorporates the entire path of these transformations forming a kernel:
\begin{equation*} G(x_i, x_j) = \int_0^1x_i\Phi(t)\Phi(t)^Tx_j^T \mathrm{d} t, \end{equation*}where $x_i$ and $x_j$ are two feature vectors. The resulting kernel can be used to construct any kernelized classifier such as a support vector machine.
8.1 Results of domain adaptation methods
The results of the above described domain adaptation machine learning techniques are presented in Tables 13 and 14. They were applied to two feature spaces: linguistic features and embeddings obtained from the Universal Sentence Encoder. Figure 8 visualizes the change in classification accuracy for each train–test pair upon application of domain adaptation methods. Positive values are marked in red, while negative values are marked in blue. As a reference, XGBoost results were used for classifiers trained on linguistic features. LinearSVC was used for models utilizing USE embeddings.
Unfortunately, there seems to be no universal method that would guarantee an improvement for all configurations of the train–test pairs. While there was a significant increase in accuracy for some pairs, others showed a decrease in performance. In general, the domain adaptation methods provided a higher improvement for linguistic classifiers than for the ones based on USE embeddings. The highest overall improvement (averaged over all train–test pairs), equal to 3%, was achieved by subspace alignment applied to a classifier based on linguistic features.
9. Deep domain adaptation
We also explored selected deep domain adaptation methods on the task of fake news recognition. Neural networks are very powerful tools due to their ability to learn and recognize patterns. Thanks to their high capacity, they have gained immense popularity and achieved state-of-the-art results in numerous NLP tasks such as machine translation, named entity recognition, language modeling, text classification, etc. However, the performance of these deep learning models can also suffer from domain shift. Therefore, much research has been devoted to the development of methods aiming to adapt neural networks trained on a large number of labeled source data to a target domain for which no labels are available. Thanks to such an approach, the labeling of data from the target domain, which often requires considerable resources, can be avoided. Most of the deep domain adaptation methods were originally designed and tested for computer vision tasks such as object detection and classification (digit, traffic signs, etc.). These methods aimed at adaptation of models that were trained on one type of images to perform well on other types. For instance, synthetic images and real pictures taken in different conditions (presence of background, different lighting, etc.) were involved. An extensive review of deep domain adaptation methods was conducted by Wilson and Cook (Reference Wilson and Cook2018). Domain adaptation algorithms were also tested in the field of natural language processing. They were applied mainly for text classification (Liu, Qiu, and Huang Reference Liu, Qiu and Huang2017), including sentiment analysis, but also for relation extraction (Fu et al. Reference Fu, Nguyen, Min and Grishman2017) and machine translation (Chu and Wang Reference Chu and Wang2018).
We investigated the following deep domain adaptation techniques:
Deep correlation alignment (CORAL, Sun and Saenko Reference Sun and Saenko2016): unsupervised domain adaptation method that aligns the second-order statistics of the source and target distributions. The nonlinear transformation is found using a deep neural network by introducing additional loss (CORAL loss), which measures the distance between covariances of the learned source and target features.
Adversarial dropout regularization (ADR, Saito et al. Reference Saito, Ushiku, Harada and Saenko2018): extension of the concept of reducing the discrepancy between the source and target feature distributions by adversarial training, proposed independently by Tzeng et al. (Reference Tzeng, Hoffman, Zhang, Saenko and Darrell2014) and Ganin and Lempitsky (Reference Ganin and Lempitsky2015). The initially proposed model consists not only of a feature extractor (neural network) and label predictor, but also of a domain classifier (critic), which is connected to the feature extractor. The network parameters are optimized to minimize the loss of the label classifier while maximizing the loss of the domain classifier. This forces the feature extractor to generate domain-invariant features. Saito et al. (Reference Saito, Ushiku, Harada and Saenko2018) replaced the domain critic with dropout on the label predictor network. Hence, the label predictor acts as both the main classifier and domain classifier. Since the dropout is random, two instances of the label classifier with different nodes zeroed by the dropout may provide various predictions. The difference between these predictions can be viewed as a critic. During the adversarial training, the domain classifier tries to maximize this difference, while the feature generator tries to minimize it. Hence, the feature extractor is encouraged to generate more discriminative features for the target domain away from the region near the decision boundary.
Virtual adversarial domain adaptation (VADA, Shu et al. Reference Shu, Bui, Narui and Ermon2018): a technique based on the cluster assumption, which states that the data that tend to form clusters and samples from the same cluster are likely to share the same label. It indicates that the decision boundaries should not cross the high-density region. The proposed model combines domain adversarial training with a penalty term that punishes violation of the cluster assumption.
9.1 Results of deep domain adaptation methods
The selected deep domain adaptation methods were applied to two models whose architectures are depicted in Figure 2. One of them utilized USE text embeddings, while the other was based on bi-directional LSTM using GloVe embeddings. In the context of domain adaptation, these models (except the part devoted to the final classification) are considered as features extractors. The performance of deep domain adaptation methods was evaluated for each source–target (train–test) pair of data sets. The entirety of one data set was used as the source, while the other was used as the target.
The adaptation methods were implemented in such a way that one batch of the source data and one batch of the target data were processed simultaneously. After each pair of batches, the model’s loss was updated. Therefore, the batch sizes were adjusted in order to make the number of batches for both collections equal. The batch sizes were calculated according to the formula: $\left\lceil {0.9\,dataset\_size/num\_of\_batches} \right\rceil$ for source and $\left\lceil {dataset\_size/num\_of\_batches} \right\rceil$ for target. The factor 0.9 originates from using a 10-fold cross validation procedure. In our experiments, we set the parameter $num\_of\_batches$ for 30. In the case of domain adaptation methods that were applied to the neural network based on USE embeddings, the number of epochs was set for 30. For bi-LSTM utilizing GloVe embeddings, we used 10 epochs due to the long training time.
The results of the deep domain adaptation methods are presented in Tables 15 and 16. Figure 9 illustrates the improvement achieved by applying different deep domain adaptation methods for each source–target pair of data sets. As a reference, we used the results that were obtained by the same classifiers, but without implementing any deep domain adaptation techniques.
On the whole, the domain adaptation methods did not provide any consistent improvement in classification accuracy for all pairs of data sets. In some cases, the results turned out to be lower than before domain adaptation. Such instances are marked in blue.
In the case of the model built on bidirectional LSTM architecture that takes GloVe embeddings as the input, the domain adaptation techniques provided some improvement for less than half of the source–target pairs. The overall gain for all the tested methods (ADR, CORAL, VADA) was close to zero.
The maps visualizing the gain in accuracy vary depending on the adaptation method and the model that was used; still, some common features can be observed.
First, all the investigated deep domain-adaptation methods improved classification accuracy for the two smallest data sets—AMT and FakeNewsNet—when Kaggle was used as a source. At the same time, the highest decrease in accuracy is observed for the ISOT-Kaggle pair, the two largest data sets. Secondly, when LIAR was used as a source, hardly any change was observed.
In the case of a model utilizing USE embeddings, the results were slightly better. They could be improved by selecting the number of epochs depending on the source–target pair and methods that are used.
Both VADA and adversarial dropout regularization contributed to improved accuracy when large data sets such as Kaggle and ISOT were used as a source.
To conclude, domain adaptation methods achieved higher improvement when applied to models utilizing USE embeddings.
10. Conclusions
10.1 Method ranking
Comparing all the results presented in our article is difficult: there are many combinations of training and test sets, and the situation is further complicated by the multiplicity of classification algorithms and feature spaces. To enable comparison, this section presents an aggregated view of our results. This is done by a ranking that summarizes the performance of methods presented in this article. We constructed these ranks as follows. For every train–test pair, we checked which three methods and feature space types worked best. We awarded three points for the best method or feature space, two points for the second-best score, and one point for the third-best score. The sum of these points is our general ranking score.
To account for a different number of occurrences, we also calculated how many times a given method and feature space was used, and for normalization we divided the general score by that number. This normalized score is arguably the best view of our ranking results.
We only evaluated three types of features: GloVe, USE, and linguistic features, as these were the most important feature space types for the different methods that were used.
Tables 17, 18, and 19 present rankings for methods applied to three feature types. Feature Intersection represents the classifier based on the intersection of the top features as in Section 7.2.
Table 20 compares all three feature types according to the ranking procedure. The results reveal linguistic features as the best performing feature type. Combined with the Feature Intersection classifier, it is likely the most versatile cross-domain approach for fake news detection.
10.2 General discussion
This paper investigated the possibility of designing a universal system for fake news detection purely from text that does not use fact-checking or knowledge bases. Promising preliminary results obtained for classifiers trained and tested on five publicly available fake news data sets (in-domain) turned out to be misleading. Simple cross-domain experiments revealed a significant decrease in accuracy: models trained on texts from one data set provided significantly lower results when evaluated on other data sets.
Therefore, we investigated a variety of methods to suppress this unwanted behavior. We compared domain adaptation techniques from three categories: feature selection and methods from the area of machine learning and deep learning. The feature selection approach is based on training models on a common set of features identified as important for several data sets. The goal of the cross-domain machine learning approaches tested in this paper is to address the phenomenon of data set shift. Deep learning methods are the newest ones; they originate from cross-domain scenarios and have been applied mostly in computer vision and graphics. To our knowledge, this paper is the first to compare all of these approaches on a fake news detection task.
We trained models on four types of features: bag-of-words vectors, 271 linguistic and psycholinguistic features, GLoVe embeddings, and text embeddings obtained from the Universal Sentence Encoder (USE).
In the case of the in-domain scenario, the best accuracy varied between 0.68 (LIAR) and 0.99 (Kaggle, ISOT). The accuracy depended on the data set size and text length. In order to illustrate the differences between the data sets, we compared the relationship between classification accuracy and the number of training samples. This analysis revealed that models learn differently not only on different data sets, but also when various text representations are used. This is most pronounced for AMT: only models trained on linguistic representations achieve reasonable results. Furthermore, the accuracy of models trained and evaluated on short single claims, such as those in LIAR, did not exceed 0.68. It seems that the veracity of such short texts can be accessed exclusively by fact-checking systems. This can be explained by the fact that the extraction of stylometric and psycholinguistic features from short statements can be prone to large errors.
The cross-domain scenario, in which models were trained on one data set and tested on another, revealed sharp drops in accuracy. An accuracy of over 0.9 becomes in many cases 0.6 or even 0.5. The best-performing solutions were LinearSVC on BOW followed by Bi-LSTM on GloVe and Dense on USE. Contrary to our initial expectations, linguistic features were the worst to generalize to other data sets.
Our feature selection results emphasized the role of adjectives and numbers in detecting fake news. Perhaps the most optimistic observation was that by using Mutual Information and the intersection of the top 100 linguistic features we can raise cross-domain performance of XGBoost models by 4%. Linguistic classifiers showed an overall greater improvement (4% averaged over all train–test pairs) than BOW (average gain equal to 1%).
Machine learning adaptation methods provided higher improvement for linguistic classifiers than for the ones based on USE embeddings. However, the obtained results varied significantly between the pairs of data sets. The highest overall improvement (averaged over all train–test pairs), equal to 3%, was achieved by subspace alignment applied to a classifier based on linguistic features. The success of adaptation methods on classifiers based on linguistic features may be ascribed to the initially good performance of this method followed by a considerable drop in performance in cross-domain settings. Good data set-specific results of these classifiers suggest the possibility to represent text features desired in the fake news detection task. Hence, domain adaptation methods were able to adjust features to fit different data sets.
Deep domain adaptation techniques turned out to be successful only when applied to some source–target pairs. Unfortunately, none of the tested deep domain adaptation techniques provided any improvement in the case of the model built upon a bidirectional LSTM architecture using GloVe embeddings as input; the average gain was close to zero.
The main goal of our paper was to explore the limits of cross-domain fake news detection. In summary, the conclusion from cross-domain scenarios is that the best versatility can be achieved when training models on the Kaggle data. This data set is the second largest in terms of the number of documents and the largest in terms of document size. Thanks to these factors, the accuracy for models trained on it is the highest, even when tested on other data sets. This holds for data that is very different in terms of topics, length and structure, including different definitions of “fakeness”.
The most promising techniques are based on linguistic features and either training models on the intersection of the top 100 features (4% average gain and the best performance according to the ranking approach) or the subspace alignment (4% average gain).
From a broader perspective, the lessons learned on the path towards versatile fake news detection are as follows. Due to extremely low recall, one should not consider automated fact-checking (this applies to systems similar to those submitted to the FEVERFootnote n competition: based on Wikipedia and neural models for natural language inference) as a feasible option. Even as one of several modules, its impact will be marginal. In the role of a universally applicable system based on a single model, one can envision a model trained on the intersection of linguistic features.Footnote o
One can extend it further by utilizing data set-specific results and designing a multi-model system. For instance, on the AMT data set linguistic features achieved an in-domain accuracy of 0.80. Articles similar to AMT (in terms of topics and length) could be handled by the AMT-trained model using linguistic features only, applying one of the two cross-domain adaptation methods mentioned previously. A similar procedure can lead to many dedicated models handling various types of texts, perhaps preserving the Kaggle-trained model as a default or back-off one. These are the hypothetical research directions that can be taken up in future studies but are not within the direct scope of our paper. Testing these ideas requires a new data set that includes samples of many possible forms of fake news.
It is worth noting that models without a fact-checking component may be vulnerable to attacks by adversarial generative models trained on how to modify fake or deceiving articles to make them appear like true ones.
Overall, overcoming the problem of differences between fake news data sets poses a major challenge that is much less studied than other areas in natural language processing, which makes our paper an important contribution.
Acknowledgements
This work was supported by the Poznan Supercomputing and Networking Center grant number 442.