1. Introduction
Research in areas that require reasoning over and understanding unstructured, natural language text, is advancing at an unprecedented rate. Novel neural architectures (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) enable efficient unsupervised training on large corpora to obtain expressive contextualised word and sentence representations as a basis for a multitude of downstream NLP tasks (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019). They are further fine-tuned on task-specific, large-scale datasets ((Bowman et al. Reference Bowman, Angeli, Potts and Manning2015; Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016); Williams, Nangia, and Bowman Reference Williams, Nangia and Bowman2018) which provide sufficient examples to optimise large neural models that are capable of outperforming human-established baselines on multiple Natural Language Understanding (NLU) benchmarks (Raffel et al. Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020; Lan et al. Reference Lan, Chen, Goodman, Gimpel, Sharma and Soricut2020). This seemingly superb performance is used as a justification to accredit those models various NLU capabilities, such as numeric reasoning (Dua et al. Reference Dua, Wang, Dasigi, Stanovsky, Singh and Gardner2019b), understanding the temporality of events (Zhou et al. Reference Zhou, Khashabi, Ning and Roth2019) or integrating information from multiple sources (Yang et al. Reference Yang, Qi, Zhang, Bengio, Cohen, Salakhutdinov and Manning2018).
Recent work, however, casts doubts on the capabilities obtained by models optimised on these data. Specifically, they may contain exploitable superficial cues, for example, the most frequent answer to questions of the type ‘How many…’ is ‘2’ in a popular numeric reasoning dataset (Gardner et al. Reference Gardner, Artzi, Basmov, Berant, Bogin, Chen, Dasigi, Dua, Elazar, Gottumukkala, Gupta, Hajishirzi, Ilharco, Khashabi, Lin, Liu, Liu, Mulcaire, Ning, Singh, Smith, Subramanian, Tsarfaty, Wallace, Zhang and Zhou2020) or the occurrence of the word ‘no’ is correlated with non-entailment in large Recognising Textual Entailment (RTE) datasets (Gururangan et al. Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2018). Models are evaluated following the usual machine learning protocol, where a random subset of the dataset is withheld for evaluation under a performance metric. Because the subset is drawn randomly, these correlations exist in the evaluation data as well and models that learn to rely on them obtain a high score. While exploiting correlations is in itself not a problem, it becomes an issue when they are spurious, that is, they are artefacts of the collected data rather than representative of the underlying task. As an example, answering ‘2’ to the question ‘How many…’ is evidently not representative of the task of numeric reasoning.
A number of publications identifies weaknesses of training and evaluation data and whether optimised models inherit them. Meanwhile, others design novel evaluation methodologies that are less prone to the limitations discussed earlier and therefore establish more realistic estimates of various NLU capabilities of state-of-the-art models. Yet others propose improved model optimisation practices which aim to ignore flaws in training data. The work by McCoy, Pavlick and Linzen (Reference McCoy, Pavlick and Linzen2019) serves as an example for the coherence of these research directions: first, they show that in crowdsourced RTE datasets, specific syntactic constructs are correlated with an expected class. They show that optimised models rely on this correlation, by evaluating them on valid counterexamples where this correlation does not hold. Later, they show that increasing the syntactic diversity of training data helps to alleviate these limitations (Min et al. Reference Min, McCoy, Das, Pitler and Linzen2020).
In this paper, we present a structured survey of this growing body of literature. We survey 121 papers for methods that reveal and overcome weaknesses in data and models and categorise them accordingly. We draw connections between different categories, report the main findings, discuss arising trends and cross-cutting themes, and outline open research questions and possible future directions. Specifically, we aim to answer the following questions:
-
(1) Which NLU tasks and corresponding datasets have been investigated for weaknesses?
-
(2) Which types of weaknesses have been reported in models and their training and evaluation data?
-
(3) What types of methods have been proposed to detect and quantify those weaknesses and measure their impact on model performance, and what methods have been proposed to overcome them?
-
(4) How have the proposed methods impacted the creation and publication of novel datasets?
The paper is organised as follows: we first describe the data collection methodology and describe the collected literature body. We then synthesise the weaknesses that have been identified in this body and categorise the methods used to reveal those. We highlight the impact of those methods on the creation of new resources and conclude with a discussion of open research questions as well as possible future research directions for evaluating and improving the NLU capabilities of NLP models.
2. Methodology
To answer the first three questions, we collect a literature body using the ‘snowballing’ technique. Specifically, we initialise the set of surveyed papers with Tsuchiya (Reference Tsuchiya2018), Gururangan et al. (Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2018), Poliak et al. (Reference Poliak, Haldar, Rudinger, Hu, Pavlick, White and Van Durme2018) and Jia and Liang (Reference Jia and Liang2017), because their impact helped to motivate further studies and shape the research field. For each paper in the set, we follow its citations and any work that has cited it according to Google Scholar and include papers that describe methods and/or their applications to report any of: (1) qualitative and quantitative investigation of flaws in training and/or test data and the impact on models optimised/evaluated thereon; (2) systematic issues with task formulations and/or data collection methods; (3) analysis of specific linguistic and reasoning phenomena in data and/or models’ performance on them or (4) proposed improvements in order to overcome data-specific or model-specific issues, related to the phenomena and flaws described earlier. We exclude a paper if its target task is not concerning NLU and was published before the year 2014, or the language of the investigated data is not English. We set 2014 as lower boundary, because it precedes the publication of most large-scale crowdsourced datasets that require NLU.
With this approach, we obtain a total of 121 papers (as of 17 October 2020) from the years 2014 to 2017 (8), 2018 (18), 2019 (42) and 2020 (53). Almost two-thirds (76) of the papers were published in venues hosted by the the Association for Computational Linguistics. The remaining papers were published in other venues (eight in AAAI, four in LREC, three in ICLR, two in ICML and COLING, respectively, five other) or are available as an arXiv preprint (21). The papers were examined by the first author; for each paper the target task and dataset(s), the method applied and the result of the application was extracted and categorised.
To answer the fourth question, we selected those publications introducing any of the datasets that were mentioned by at least one paper in the pool of surveyed papers and extended that collection by additional state-of-the-art NLU dataset resource papers (for detailed inclusion and exclusion criteria, see Appendix A). This approach yielded a corpus of 91 papers that introduce 95 distinct datasets. For those papers, we examine whether any of the previously collected methods were applied to report spurious correlations or whether the dataset was adversarially pruned against some model.
Although related, we deliberately do not include work that introduces adversarial attacks on NLP systems or discusses their fairness, as these are out of scope of this survey. For an overview thereof, we refer the interested reader to respective surveys conducted by Zhang et al. (Reference Zhang, Sheng, Alhazmi and Li2020b) or Xu et al. (Reference Xu, Ma, Liu, Deb, Liu, Tang and Jain2020) for the former, and by Mehrabi et al. (Reference Mehrabi, Morstatter, Saxena, Lerman and Galstyan2021) for the latter. Furthermore, we do not include works that concern wider technical issues, such as performance variance due to different software environments (Crane and Cheriton Reference Crane and Cheriton2018) or stochastic instability (Dodge et al. Reference Dodge, Ilharco, Schwartz, Farhadi, Hajishirzi and Smith2020).
3. Investigated tasks and datasets
We report the tasks and the corresponding datasets that we have investigated. We supply a full list of these investigated datasets and the type(s) of method(s) applied in Appendix B. Figure 1 depicts all investigated datasets as a bar chart. The distribution roughly follows the popularity of the investigated resources themselves, with the papers presenting MNLI, SQuAD and SNLI being the three most cited among the investigated datasets.Footnote a
Almost half of the surveyed papers (57) are focused on the RTE task, where the goal is to decide, for a pair of natural language sentences (premise and hypothesis), whether given the premise the hypothesis is true (Entailment), certainly false (Contradiction), or whether the hypothesis might be true, but there is not enough information to determine that (Neutral) (Dagan et al. Reference Dagan, Roth, Sammons and Zanzotto2013).
Many of the papers analyse the Machine Reading Comprehension (MRC) task (50 papers), a special case of Question Answering (QA) which concerns finding the correct answer to a question over a passage of text. Note that the tasks are related: answering a question can be framed as finding an answer that is entailed by the question and the provided context (Demszky, Guu and Liang Reference Demszky, Guu and Liang2018). Inversely, determining whether a hypothesis is true given a premise can be framed as question answering.
Other tasks (eight papers) involve finding the most plausible cause or effect for a short prompt among two alternatives (Roemmele, Bejan and Gordon Reference Roemmele, Bejan and Gordon2011), fact verification (Thorne et al. Reference Thorne, Vlachos, Christodoulopoulos and Mittal2018) and argument reasoning (Habernal et al. Reference Habernal, Wachsmuth, Gurevych and Stein2018). Seven papers investigated multiple tasks. Note, that weaknesses reported in the surveyed papers were also reported on data and models representing tasks typically not associated with NLU, such as sentiment analysis (Ko et al. Reference Ko, Lee, Kim, Kim and Kang2020a). To keep this paper within the scope we set out, we do not include them in the following discussions.
In general, 18 RTE and 37 MRC datasets were analysed or used at least once; we attribute this difference in number to the existence of various MRC datasets and the tendency of performing multi-dataset analyses in papers that investigate MRC datasets (Kaushik and Lipton Reference Kaushik and Lipton2018; Si et al. Reference Si, Wang, Kan and Jiang2019; Sugawara et al. Reference Sugawara, Stenetorp, Inui and Aizawa2020). SQuAD (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016) for MRC and MNLI (Williams et al. Reference Williams, Nangia and Bowman2018) and SNLI (Bowman et al. Reference Bowman, Angeli, Potts and Manning2015) for RTE are the most utilised datasets in the surveyed literature (with 32, 43 and 29 papers investigating or using them).
4. Identified weaknesses in NLU data and models
In this section, we present the types of weaknesses that have been reported in the surveyed literature. State-of-the-art approaches to solve the investigated tasks are predominantly data-driven. We distinguish between issues identified in their training and evaluation data on the one hand, and the extent to which these issues affect the trained models on the other hand.
4.1 Weaknesses in data
Spurious correlations: Correlations between input data and the expected prediction are ‘spurious’ if there exists no causal relation between them with regard to the underlying task but rather they are an artefact of a specific dataset. They are also referred to as ‘(annotation) artefacts’ (Gururangan et al. Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2018) or ‘(dataset) biases’ (He et al. Reference He, Zha and Wang2019) in literature.
In span extraction tasks, where the task is to predict a continuous span of token in text, as is the case with MRC, question and passage wording, as well as the position of the answer span in the passage are indicative of the expected answer for various datasets (Rychalska et al. Reference Rychalska, Basaj, Wróblewska and Biecek2018a; Kaushik and Lipton Reference Kaushik and Lipton2018) such that models can solve examples correctly even without being exposed to either the question or the passage. In the ROCStories dataset (Mostafazadeh et al. Reference Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli and Allen2016) where the task is to choose the most plausible ending to a story, the writing style of the expected ending differs from the alternatives (Schwartz et al. Reference Schwartz, Sap, Konstas, Zilles, Choi and Smith2017). This difference is noticeable even by humans (Cai, Tu and Gimpel Reference Cai, Tu and Gimpel2017).
For sentence pair classification tasks, such as RTE, Poliak et al. (Reference Poliak, Haldar, Rudinger, Hu, Pavlick, White and Van Durme2018) and Gururangan et al. (Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2018) showed that certain n-grams, lexical and grammatical constructs in the hypothesis as well as its length correlate with the expected label for a multitude of RTE datasets. For example, the word ‘no’ in the premise occurs more often with the label Contradiction than with the label Entailment in the SNLI and MNLI datasets. This correlation is spurious because although true for the datasets, the appearance of the word ‘no’ in the premise is not indicative of contradiction. For example, ‘No cats are in the room.’ entails ‘No cats are under the bed.’. Similarly, McCoy et al. (Reference McCoy, Pavlick and Linzen2019) showed that lexical features like word overlap and common subsequences between the hypothesis and premise are highly predictive of the entailment label in the MNLI dataset. These correlations, too, are artefacts of the data collection method rather than robust indicators of entailment – for example ‘I almost went to Vienna.’ does not entail ‘I went to Vienna’ despite a high lexical overlap.
Beyond RTE, the choices in the COPA dataset (Roemmele et al. Reference Roemmele, Bejan and Gordon2011) where the task is to finish a given passage (similar to ROCStories) and ARCT (Habernal et al. Reference Habernal, Wachsmuth, Gurevych and Stein2018) where the task is to select whether a statement warrants a claim contain words that correlate with the expected prediction (Kavumba et al. Reference Kavumba, Inoue, Heinzerling, Singh, Reisert and Inui2019; Niven and Kao Reference Niven and Kao2019).
Other data quality issues: Pavlick and Kwiatkowski (Reference Pavlick and Kwiatkowski2019) argue that when training data are annotated using crowdsourcing, a fixed label representing the ground truth, usually obtained by majority vote between annotators, is not representative of the uncertainty, which can be important to indicate the complexity of an example. Sometimes, data annotations are factually wrong (Pugaliya et al. Reference Pugaliya, Route, Ma, Geng and Nyberg2019; Schlegel et al. Reference Schlegel, Valentino, Freitas, Nenadic and Batista-Navarro2020), for example, due to limitations of the annotation protocol, such as when only one ground truth label is expected but the data contain multiple plausible alternatives. In ‘multi-hop’ datasets, such as HotPotQA and WikiHop where the task is to find an answer after aggregating evidence across multiple documents, this process can be circumvented in the case of examples where the location of the final answer is cued by the question (Min et al. Reference Min, Wallace, Singh, Gardner, Hajishirzi and Zettlemoyer2019). For example, consider Figure 2: while initially this looks like a complex question that requires spatial reasoning over multiple documents, the keyword combination ‘2010’ and ‘population’ in the question is unique to the answer sentence across all 10 documents, allowing to find the answer to the question without ever reading Passage 1. The initially complex question can be substituted by the much easier question ‘What is the 2010 population?’ which does not require any reasoning and has a unique answer that coincides with the expected answer to the original question. This is especially true for the multiple-choice task formulation, as the correct answer can often be ‘guessed’ by excluding implausible alternatives (Chen and Durrett Reference Chen and Durrett2019), for example, by matching the interrogative pronoun with the corresponding lexical answer type. Sugawara et al. (Reference Sugawara, Inui, Sekine and Aizawa2018) show that multiple MRC benchmarks contain numerous questions that are easy to answer, as they do require little comprehension or inference skills and can be solved by looking at the first few tokens of the question. This property appears ubiquitous among multiple datasets (Longpre, Lu and DuBois Reference Longpre, Lu and DuBois2021). Finally, Rudinger, May and Van Durme (Reference Rudinger, May and Van Durme2017) show the presence of gender and racial stereotypes in crowdsourced RTE datasets.
There are multiple reasons for data quality issues, among them the carelessness of the annotators, usually hired via a crowdsourcing platform (Brühlmann et al. Reference Brühlmann, Petralito, Aeschbach and Opwis2020). This also provides a possible explanation to the existence of dataset artefacts: as annotators are paid per annotation, they often adapt simple strategies to maximise the output. For example, deriving a contradicting premise by simply negating the hypothesis might lead to the spurious correlation discussed before. Thus, it is important to establish the quality of data during and after collection, by filtering low-quality examples, an increasingly employed practice, as we discuss in Section 5.1.
In any case, these data quality issues diminish the explanatory power of observations about models evaluated on these data: the presence of cues casts doubts on the requirements of various NLU capabilities, if a simpler model can perform reasonably well by exploiting these cues. The situation is similar, when expected answers are factually wrong.
4.2 Model weaknesses
In this section, we discuss whether data-driven approaches to NLU are in fact affected by these data quality issues. We discuss multiple works that reveal their dependence on dataset-specific artefacts. This is further evidenced by their poor generalisation on data that stem from a different distribution than their training data, suggesting that they in fact overfit to the patterns inherent to different datasets rather than reliably learning the underlying task. A suspected reason for this is that state-of-the-art NLP approaches rely increasingly on no-assumption architectures – little to no expert knowledge is encoded into the models a priori and all necessary information is expected to be derived from the data.
Dependence on dataset-specific artefacts: Given the data-related issues discussed earlier, it is worth knowing whether models optimised on this data actually inherit them. In fact, multiple studies confirm this hypothesis (McCoy et al. Reference McCoy, Pavlick and Linzen2019; Niven and Kao Reference Niven and Kao2019; Kavumba et al. Reference Kavumba, Inoue, Heinzerling, Singh, Reisert and Inui2019). This is usually evidenced by the poor performance of models evaluated on a balanced version of the data where the spurious correlations have been balanced. To illustrate it by means of the previous example, a balanced set would have equally many instances with the word ‘no’ appearing with the label Contradiction and Entailment. Note, that the act of balancing the data alters the underlying generative process; thus, this type of evaluation is performed on data outside of the training data distribution; therefore, the same considerations regarding poor out-of-distribution generalisation, discussed in more detail in the following section, apply.
Neural models tend to disregard syntactic structure (Rychalska et al. Reference Rychalska, Basaj, Wróblewska and Biecek2018b,a) and important words (Mudrakarta et al. Reference Mudrakarta, Taly, Sundararajan and Dhamdhere2018), making them insensitive towards small but potentially meaningful perturbations in inputs. This results in MRC models that are negatively impacted by the presence of lexically similar but semantically irrelevant ‘distractor sentences’ (Jia and Liang Reference Jia and Liang2017; Jiang and Bansal Reference Jiang and Bansal2019), give inconsistent answers to semantically equivalent input (Ribeiro, Singh and Guestrin Reference Ribeiro, Singh and Guestrin2018) or fail to distinguish between semantically different inputs with similar surface form (Gardner et al. Reference Gardner, Artzi, Basmov, Berant, Bogin, Chen, Dasigi, Dua, Elazar, Gottumukkala, Gupta, Hajishirzi, Ilharco, Khashabi, Lin, Liu, Liu, Mulcaire, Ning, Singh, Smith, Subramanian, Tsarfaty, Wallace, Zhang and Zhou2020; Welbl et al. Reference Welbl, Minervini, Bartolo, Stenetorp and Riedel2020). For RTE, they may disregard the composition of the sentence pairs (Nie, Wang and Bansal Reference Nie, Wang and Bansal2019).
Poor generalisation outside of training distribution: Mediocre performance when evaluated on RTE (Glockner, Shwartz and Goldberg Reference Glockner, Shwartz and Goldberg2018; Naik et al. Reference Naik, Ravichander, Sadeh, Rose and Neubig2018; Yanaka et al. Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b) and MRC data (Talmor and Berant Reference Talmor and Berant2019; Dua et al. Reference Dua, Gottumukkala, Talmor, Gardner and Singh2019a) that stems from a different generative process than the training data (leading to out-of-distribution examples) reinforces the fact that models pick up spurious correlations that do not hold between different datasets, as outlined earlier. Limited out-of-distribution generalisation capabilities of state-of-the-art models suggest that they are ‘lazy learners’: when possible, they infer simple decision strategies from training data that are not representative of the corresponding task, instead of learning the necessary capabilities to perform inference. Nonetheless, recent work shows that the self-supervised pre-training of transformer-based language models allows them to adapt to the new distribution from few examples (Brown et al. Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever and Amodei2020; Schick and Schütze Reference Schick and Schütze2021).
No-assumption architectures: Note that these weaknesses arise because state-of-the-art end-to-end architecturesFootnote b (Bahdanau, Cho and Bengio Reference Bahdanau, Cho and Bengio2015), such as the transformer (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), are designed with minimal assumptions. As little as possible prior knowledge is encoded into the model architecture – all necessary information is expected to be inferred from the (pre-)training data. The optimisation objectives reflect this assumption as well: beyond the loss function accounting for the error in prediction, hardly any regularisation is used. As a consequence, there is no incentive for models to distinguish between spurious and reliable correlations, so they follow the strongest signal present in data. In fact, one of the main themes discussed in Section 5.3 is to inject additional knowledge, for example, in the form of more training data or heavier regularisation, as a countermeasure, in order to make the optimised model rely less on potentially biased data. For example, models that operate over syntax trees rather than sequences tend to be less prone to syntactic biases (McCoy et al. Reference McCoy, Pavlick and Linzen2019).
5. Categorisation of methods that reveal and overcome weaknesses in NLU
In the following section, we categorise the methodologies collected from the surveyed papers, briefly describe the categories and exemplify them by referring to respective papers. On a high level, we distinguish between methods that: (a) reveal systematic issues with existing training and evaluation data, such as the spurious correlations mentioned earlier, (b) investigate whether they translate to models optimised on these data with regard to acquired inference and reasoning capabilities and (c) propose architectural and training procedure improvements in order to alleviate the issues and improve the robustness of the investigated models. A schematic overview of the taxonomy of the categories is shown in Figure 3. The quantitative results of the categorisation are shown in Figure 4.
5.1 Data-investigating methods
Methods in this category analyse flaws in data such as cues in input that are predictive of the output (Gururangan et al. Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2018). As training and evaluation data from state-of-the-art NLU datasets are assumed to be drawn from the same distribution, models that were fitted on those cues achieve high performance in the evaluation set, without being tested on the required inference capabilities. Furthermore, methods that investigate the evaluation data in order to better understand the assessed capabilities (Chen, Bolton and Manning Reference Chen, Bolton and Manning2016) fall under this category as well. In the analysed body of work, we identified the types of methods discussed in the following paragraphs. In Table 1, we summarise them with their corresponding investigation goal.
Partial Baselines are employed in order to verify that all input provided by the task is actually required to make the right prediction (e.g., both question and passage for MRC, and premise and hypothesis for RTE). If a classifier trained on partial input performs significantly better than a random guessing baseline, it stands to reason that the omitted parts of the input are not required to solve the task. On the one hand, this implies that the input used to optimise the classifier might exhibit cues that simplify the task. On the other hand, if the omitted data represent a specific capability, the conclusion is that this capability is not evaluated by the dataset, a practice we refer to as Data Ablation. Examples for the former include training classifiers that perform much better than the random guess baseline on hypotheses only for the task of RTE (Gururangan et al. Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2018; Poliak et al. Reference Poliak, Haldar, Rudinger, Hu, Pavlick, White and Van Durme2018) and on passages only for MRC (Kaushik and Lipton Reference Kaushik and Lipton2018).Footnote c For the latter, Sugawara et al. (Reference Sugawara, Stenetorp, Inui and Aizawa2020) drop words that are required to perform certain comprehension abilities (e.g., dropping pronouns to evaluate pronominal coreference resolution capabilities) and reach performance comparable to that of a model that is trained on the full input on a variety of MRC datasets. Nie et al. (Reference Nie, Wang and Bansal2019) reach near state-of-the-art performance on RTE tasks when shuffling words in premise and hypothesis, showing that understanding the compositional nature of language is not required by these datasets. A large share of work in this area concentrates on evaluating datasets with regard to the requirement to perform ‘multi-hop’ reasoning (Min et al. Reference Min, Wallace, Singh, Gardner, Hajishirzi and Zettlemoyer2019; Chen and Durrett Reference Chen and Durrett2019; Jiang and Bansal Reference Jiang and Bansal2019; Trivedi et al. Reference Trivedi, Balasubramanian, Khot and Sabharwal2020) by measuring the performance of a partial baseline that exhibits an Architectural Constraint to perform single-hop reasoning (e.g., by processing input sentences independently).
Insights from partial baseline methods bear negative predictive power only – their failure does not necessarily entail that the data are free of cues, as they can exist in different parts of the input. As an example, consider an MRC dataset, where the three words before and after the answer span are appended to the question. Partial baselines would not be able to pick up this cue, because it can only be exploited by considering both question and passage. Feng, Wallace and Boyd-Graber (Reference Feng, Wallace and Boyd-Graber2019) show realistic examples of this phenomenon in published datasets. Furthermore, above-chance performance of partial baselines merely hints at spurious correlations in the data and suggests that models learn to exploit them; it does not reveal their precise nature.
Heuristics and Correlations are used to unveil the nature of cues and spurious correlations between input and expected output. For sentence pair classification tasks, modelling the co-occurrence of words or n-grams with the expected prediction label by means of Pointwise Mutual Information (Gururangan et al. Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2018) or conditional probability (Poliak et al. Reference Poliak, Haldar, Rudinger, Hu, Pavlick, White and Van Durme2018; Tan et al. Reference Tan, Shen, Huang and Courville2019) shows the likelihood of an expression being predictive of a label. Measuring coverage (Niven and Kao Reference Niven and Kao2019) further indicates what proportion of the dataset is affected by this correlation. Manually inspecting these correlations can help to identify whether they can simplify the task. For example, if an expression perfectly correlates with a label and covers some subset of the data, then this subset can be solved correctly by relying on the appearance of this expression alone. They are further spurious, if they are not indicative of the underlying task but rather artefacts of the dataset.
These exploratory methods require no apriori assumptions about the kind of bias they can reveal. Other methods require more input, such as qualitative data analysis and identification of syntactic (McCoy et al. Reference McCoy, Pavlick and Linzen2019) and lexical (Liu et al. Reference Liu, Zheng, Chang and Sui2020b) patterns that correlate with the expected label. Furthermore, Nie et al. (Reference Nie, Wang and Bansal2019) use the confidence of a logistic regression model optimised on lexical features to predict the wrong label to rank data by their requirements to perform comprehension beyond lexical matching.
It is worth highlighting that there is comparatively little work analysing MRC data (4 out of 18 surveyed methods) with regard to spurious correlations. We attribute this to the fact that it is hard to conceptualise the correlations of input and expected output for MRC beyond very coarse heuristics such as sentence position (Si et al. Reference Si, Yang, Cui, Ma, Liu and Wang2020) or lexical overlap (Sugawara et al. Reference Sugawara, Inui, Sekine and Aizawa2018), as the input is a whole paragraph and a question and the expected output is typically a span anywhere in the paragraph. Furthermore, the prediction labels (paragraph indices for answer spans or the number of the chosen alternative for multiple-choice type of questions) do not bear any semantic meaning, so correlation between input and predicted raw output such as those discussed earlier can only unveil positional bias. For RTE, in contrast, the input consists of two sentences and the expected output is one of three fixed class labels that carry the same semantics regardless of the input; therefore, possible correlations are easier to unveil.
Manual Analyses are performed to qualitatively analyse the data, if automated approaches as those mentioned earlier are unsuitable due to the complexity of the phenomena of interest or the output space discussed earlier. We posit that this is the reason why most methods in this category concern analysing MRC data (seven out of nine surveyed methods). Qualitative annotation frameworks were proposed to investigate the presence of linguistic features (Schlegel et al. Reference Schlegel, Valentino, Freitas, Nenadic and Batista-Navarro2020) and cognitive skills required for reading comprehension (Sugawara, Yokono and Aizawa Reference Sugawara, Kido, Yokono and Aizawa2017a).
5.2 Model-investigating methods
Rather than analysing data, approaches described in this section directly evaluate models in terms of their inference capabilities with respect to various phenomena of interest. Released evaluation resources are summarised in Table 2.
Challenge Sets make for an increasingly popular way to assess various capabilities of optimised models. Challenge sets feature a collection of (typically synthetically generated) examples that exhibit a specific phenomenon of interest. Bad performance on the challenge set indicates that the model has failed to obtain the capability to process the phenomenon correctly. Similar to partial baselines, a good result does not necessarily warrant the opposite, unless guarantees can be made that the challenge set is perfectly representative of the investigated phenomenon. Naik et al. (Reference Naik, Ravichander, Sadeh, Rose and Neubig2018) automatically generate RTE evaluation data based on an analysis of observed state-of-the-art model error patterns, introducing the term ‘stress-test’. Challenge sets have since been proposed to evaluate RTE models with regard to the acquisition of linguistic capabilities such as monotonicity (Yanaka et al. Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019a), lexical inference (Glockner et al. Reference Glockner, Shwartz and Goldberg2018), logic (Richardson et al. Reference Richardson, Hu, Moss and Sabharwal2019) and understanding language compositionality (Nie et al. Reference Nie, Wang and Bansal2019). With respect to MRC, we note that there are few (11) challenge sets concerning rather broad categories such as prediction consistency (Ribeiro et al. Reference Ribeiro, Guestrin and Singh2019; Gardner et al. Reference Gardner, Artzi, Basmov, Berant, Bogin, Chen, Dasigi, Dua, Elazar, Gottumukkala, Gupta, Hajishirzi, Ilharco, Khashabi, Lin, Liu, Liu, Mulcaire, Ning, Singh, Smith, Subramanian, Tsarfaty, Wallace, Zhang and Zhou2020), acquired knowledge (Richardson and Sabharwal Reference Richardson and Sabharwal2020) or transfer to different datasets (Dua et al. Reference Dua, Gottumukkala, Talmor, Gardner and Singh2019a; Miller et al. Reference Miller, Krauth, Recht and Schmidt2020).
Notably, these challenge sets are well suited to evaluate the investigated capabilities, because they perform a form of out-of-distribution evaluation. Since the evaluation data stem from a different (artificial) generative process than the crowdsourced training data, possible decision rules based on cues are more likely to fail. The drawback of this, however, is that in this way the challenge sets evaluate both the investigated capability and the performance under distribution shift. Liu, Schwartz and Smith (Reference Liu, Schwartz and Smith2019a) show that for some of the challenge sets, after fine-tuning (‘inoculating’) on small portions of it, the challenge set performance increases, without sacrificing the performance on the original data. However, Rozen et al. (Reference Rozen, Shwartz, Aharoni and Dagan2019) show that good performance after fine-tuning cannot be taken as evidence of the model learning the phenomenon of interest – rather the model adapts to the challenge-set-specific distribution and fails to capture the general notion of interest. This is indicated by low performance when evaluating on challenge sets that stem from a different generative process but focus on the same phenomenon. These results suggest that the ‘inoculation’ methodology is of limited suitability to disentangle the effects of domain shift from evaluating the capability to process the investigated phenomenon.
Furthermore, a line of work proposes to evaluate the systematic generalisation capabilities of RTE models (Geiger et al. Reference Geiger, Cases, Karttunen and Potts2019; Geiger, Richardson and Potts Reference Geiger, Richardson and Potts2020; Goodwin et al. Reference Goodwin, Sinha and O’Donnell2020), concretely the capability to infer and understand compositional rules that underlie natural language. These studies concern mostly artificial languages, however.
Adversarial Evaluation introduces evaluation data that were generated with the aim to ‘fool’ models. Szegedy et al. (Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2014) define ‘adversarial examples’ as (humanly) imperceptible perturbations to images that cause a significant drop in the prediction performance of neural models. Similarly for NLP, we refer to data as ‘adversarial’ if it is designed to minimise prediction performance for automated approaches, while not impacting the human baseline. Thus, patterns in the behaviour of a certain model or a range of models of certain architectures serve as a starting point for the development of adversarial evaluation methods. This is different from challenge sets, which focus on phenomena of interest without assumptions about the evaluated model. Notably, these two terms are sometimes used interchangeably in literature.
Adversarial methods are used to show that models rely on superficial, dataset-specific cues, as discussed in Section 4.2. This is typically done by creating a balanced version of the evaluation data, where the previously identified spurious correlations present in training data do not hold anymore (McCoy et al. Reference McCoy, Pavlick and Linzen2019; Kavumba et al. Reference Kavumba, Inoue, Heinzerling, Singh, Reisert and Inui2019; Niven and Kao Reference Niven and Kao2019), or by applying semantic preserving perturbations to the input (Jia and Liang Reference Jia and Liang2017; Ribeiro et al. Reference Ribeiro, Singh and Guestrin2018). Note that this is yet another method that alters the distribution of the evaluation data with respect to the training data.
Adversarial techniques are further used to understand model behaviour (Sanchez, Mitchell and Riedel Reference Sanchez, Mitchell and Riedel2018), such as identifying training examples (Han, Wallace and Tsvetkov Reference Han, Wallace and Tsvetkov2020) or neuron activations (Mu and Andreas Reference Mu and Andreas2020) that contribute to a certain prediction. Among those we highlight the work by Wallace et al. (Reference Wallace, Feng, Kandpal, Gardner and Singh2019) who showed that malicious adversaries generated against a target model tend to be universal for a whole range of neural architectures.
5.3 Model-improving methods
Here, we report methods that improve the robustness of models against adversarial and out-of-distribution evaluation, by either modifying the training data or making adjustments to model architecture or the training procedure. We group the methods by their conceptual approach and present them together with their applications in Table 3. In line with the literature (Wang and Bansal Reference Wang and Bansal2018; Jia et al. Reference Jia, Raghunathan, Göksel and Liang2019), we call a model ‘robust’ against a method that alters the underlying distribution of the evaluation data (hence making it substantially different from the training data) through for example, adversarial or challenge sets, if the out-of-distribution performance of the model is similar to that on the original evaluation set. They have become increasingly popular: 30%, 35% and 51% of the surveyed methods published in the years 2018, 2019 and 2020, respectively, fall into this category (and none before 2018). We attribute this to the public availability of evaluation resources discussed in Section 5.2 as they facilitate the rapid prototyping and testing of these methods.
Data Augmentation and Pruning combat the issues arising from low-bias architecture by injecting the required knowledge, in the form of (usually synthetically generated) data, during training. There is ample evidence that augmenting training data with examples featuring a specific phenomenon increases the performance on a challenge set evaluating that phenomenon (Wang et al. Reference Wang, Singh, Michael, Hill, Levy and Bowman2018; Jiang and Bansal Reference Jiang and Bansal2019; Zhou and Bansal Reference Zhou and Bansal2020)—for example, Yanaka et al. (Reference Yanaka, Mineshima, Bekki, Inui, Sekine, Abzianidze and Bos2019b) propose an automatically constructed dataset as an additional training resource to improve monotonicity reasoning capabilities in RTE. As these augmentations come at the cost of lower performance on the original evaluation data, Maharana and Bansal (Reference Maharana and Bansal2020) propose a framework to combine different augmentation techniques such that the performance on both is optimised.
More interesting are approaches that augment data without focussing on a specific phenomenon. By increasing data diversity, better performance under adversarial evaluation can be achieved (Talmor and Berant Reference Talmor and Berant2019; Tu et al. Reference Tu, Lalwani, Gella and He2020). Similarly, augmenting training data in a meaningful way, for example, with counterexamples, by asking crowdworkers to apply perturbations that change the expected label (Kaushik et al. Reference Kaushik, Hovy and Lipton2020; Khashabi et al. Reference Khashabi, Khot and Sabharwal2020), helps models to achieve better robustness beyond the training set distribution.
An alternative direction is to increase data quality, by removing data points that exhibit spurious correlations. After measuring the correlations with methods discussed in Section 5.1, those training examples exhibiting strong correlations can be removed. The AFLite algorithm (Sakaguchi et al. Reference Sakaguchi, Bras, Bhagavatula and Choi2020) combines both of these steps by assuming that a linear correlation between embeddings of inputs and prediction labels is indicative of biased data points. This is an extension of the Adversarial Filtering algorithm (Zellers et al. Reference Zellers, Bisk, Schwartz and Choi2018), whereby multiple-choice alternatives are automatically generated until a target model can no longer distinguish between human-written (correct) and automatically generated (wrong) options.
A noteworthy trend is the application of adversarial data generation against a target model that is employed during the construction of a new dataset. In crowdsourcing, humans act as adversary generators and an entry is accepted only if it triggers a wrong prediction by a trained target model (Dua et al. Reference Dua, Wang, Dasigi, Stanovsky, Singh and Gardner2019b; Nie et al. Reference Nie, Williams, Dinan, Bansal, Weston and Kiela2020). Mishra et al. (Reference Mishra, Arunkumar, Sachdeva, Bryan and Baral2020) combine both directions in an interface which aims to assist researchers who publish new datasets with different visualisation, filtering, and pruning techniques.
Architecture and Training Procedure Improvements deviate from the idea of data augmentation and seek to train robust models from potentially biased data. Adversarial techniques (Goodfellow et al. Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio2014), in which a generator of adversarial training examples (such as those discussed in Section 5.2, e.g., perturbing the input) is trained jointly with the discriminative model that is later used for inference, have been applied to different NLU tasks (Stacey et al. Reference Stacey, Minervini, Dubossarsky, Riedel and Rocktäschel2020; Welbl et al. Reference Welbl, Minervini, Bartolo, Stenetorp and Riedel2020).
Specific knowledge about the type of bias present in data can be used to discourage a model from learning from it. For example, good performance (as indicated by a small loss) of a partial input classifier is interpreted as an indication that data points exhibit spurious correlations. This information can be used to train an ‘unbiased’ classifier jointly (Clark et al. Reference Clark, Yatskar and Zettlemoyer2019b; He et al. Reference He, Zha and Wang2019; Belinkov et al. Reference Belinkov, Poliak, Shieber, Van Durme and Rush2019). Alternatively, their contribution to the overall optimisation objective can be rescaled (Schuster et al. Reference Schuster, Shah, Yeo, Roberto Filizzola Ortiz, Santus and Barzilay2019; Zhang et al. Reference Zhang, Sheng, Alhazmi and Li2020b; Mehrabi et al. Reference Mehrabi, Morstatter, Saxena, Lerman and Galstyan2021). The intuition behind these approaches is similar to Adversarial Filtering which is mentioned earlier: the contribution of biased data to the overall training is reduced. For lexical biases, such as cue words, Utama et al. (Reference Utama, Moosavi and Gurevych2020) show that a biased classifier can be approximated by overfitting a regular model on a small portion of the training set. For RTE, Zhang et al. (Reference Zhang, Bai, Liang, Bai, Zhu and Zhao2020a) compare the effects of different proposed debiasing variants discussed in this paragraph. They find that these approaches yield moderate improvements in out-of-distribution performance (up to 7% using the method by He et al. Reference He, Zha and Wang2019).
In an effort to incorporate external knowledge into the model to increase its robustness, multi-task training frameworks with semantic role labelling (SRL) (Cengiz and Yuret Reference Cengiz and Yuret2020) and explanation reconstruction (Rajagopal et al. Reference Rajagopal, Tandon, Clark, Dalvi and Hovy2020) have been proposed. It is interesting to note that SRL is a popular choice for incorporating additional linguistic information (Wu et al. Reference Wu, Huang, Wang, Feng, Yu and Wang2019; Chen and Durrett Reference Chen and Durrett2020) due to the fact that it exhibits syntactic and semantic information independent of the specific dataset. Additional external resources encoded into the models during training are named entities (Mitra et al. Reference Mitra, Shrivastava and Baral2020), information from knowledge bases (Wu and Xu Reference Wu and Xu2020) or logic constraints (Minervini and Riedel Reference Minervini and Riedel2018).
Interestingly, inconsistency on counterexamples, such as those used for training data augmentation, can be explicitly utilised as a regularisation penalty, to encourage models to detect meaningful differences in input data (Teney et al. Reference Teney, Abbasnedjad and van den Hengel2020a; Asai and Hajishirzi Reference Asai and Hajishirzi2020). Countermeasures for circumventing multi-hop reasoning are providing labels as strong supervision signal for spans that bridge the information between multiple sentences (Jiang and Bansal Reference Jiang and Bansal2019) or decomposing and sequentially processing compositional questions (Tang et al. Reference Tang, Ng and Tung2021).
6. Impact on the creation of new datasets
Finally, we report whether the existence of spurious correlations is considered when publishing new resources, by applying any quantitative methods such as those discussed in Section 5.1, or whether some kind of adversarial pruning discussed in Section 5.3 was employed. The results are shown in Figure 5. We observe that the publications we use as our seed papers for the survey (c.f. Section 2) in fact seem to impact how novel datasets are presented, as after their publication (in years 2017 and 2018), a growing number of papers report partial baseline results and existing correlations in their data (four in 2018 and five in 2019). Furthermore, newly proposed resources are increasingly pruned against state-of-the-art approaches (nine in 2018 and 2019 cumulative). However, for nearly half (46 out of 96) of the datasets under investigation, there is no information about potential spurious correlations yet. The scientific community would benefit from an application of the quantitative methods that have been presented in this survey to those NLU datasets.
7. Discussion and conclusion
We present a structured survey of methods that reveal flaws in NLU datasets, methods that show that neural models inherit those correlations or assess their capabilities otherwise, and methods that mitigate those weaknesses. Due to the prevalence of simple, low-bias architectures, the lack of data diversity and existence of data specific artefacts result in models that fail to discriminate between spurious and reliable correlation signals in training data. This, in turn, confounds the hypotheses about the capabilities they acquire when trained and evaluated on these data. More realistic, lower estimates of their capabilities are reported when evaluated on data drawn from a different distribution and with focus on specific capabilities. Efforts towards more robust models include injecting additional knowledge by augmenting training data or introducing constraints into the model architecture, heavier regularisation and training on auxiliary tasks, or encoding more knowledge-intensive input representations.
Based on these insights, we formulate the following recommendations for possible future research directions:
Most methods discussed in this survey bear negative predictive power only, but the absence of negative results cannot be interpreted as positive evidence. This can be taken as a motivation to put more effort into research that verifies robustness (Shi et al. Reference Shi, Zhang, Chang, Huang and Hsieh2020), develops model ‘test suites’ inspired by good software engineering practices (Ribeiro et al. Reference Ribeiro, Wu, Guestrin and Singh2020) or provides worst-case performance bounds (Raghunathan, Steinhardt, and Liang Reference Raghunathan, Steinhardt and Liang2018; Jia et al. Reference Jia, Raghunathan, Göksel and Liang2019). Similar endeavours are pursued by researchers that propose to overthink the empirical risk minimisation (ERM) principle where the assumption is that the performance on the evaluation data can be approximated by the performance on training data, in favour of approaches that relax this assumption. Examples include optimising worst-case performance on a group of training sets (Sagawa et al. Reference Sagawa, Koh, Hashimoto and Liang2020) or learning features that are invariant in multiple training environments (Teney, Abbasnejad and Hengel Reference Teney, Abbasnedjad and van den Hengel2020b).
While one of the main themes for combatting reliance on spurious correlations is by injecting additional knowledge, there is a need for a systematic investigation of the type and amount of prior knowledge on neural models’ out-of-distribution adversarial and challenge set evaluation performance.
Partial input baselines are conceptually simple and cheap to employ for any task, so researchers should be encouraged to apply and report their performance when introducing a novel dataset. While not a guarantee for the absence of spurious correlations (Feng et al. Reference Feng, Wallace and Boyd-Graber2019), they can hint at their presence and provide more context to quantitative evaluation scores. The same holds true for methods that report existing correlations in data.
Training set-free, expert-curated evaluation benchmarks that focus on specific phenomena (Linzen Reference Linzen2020) are an obvious way to evaluate capabilities of NLP models without the confounding the effects of spurious correlations between training and test data. Challenge sets discussed in this work, however, measure the performance on the investigated phenomenon on out-of-distribution data and provide informal arguments on why the distribution shift is negligible. How to formally disentangle this effect from the actual capability to process the investigated phenomenon remains an open question.
Specifically for the area of NLU as discussed in this paper, we additionally outline the following recommendations:
Adapting methods applied to RTE datasets or developing novel methodologies to reveal cues and spurious correlations in MRC data is a possible future research direction.
The growing number of MRC datasets provides a natural test bed for the evaluation of out-of-distribution generalisation. However, studies concerning this (Talmor and Berant Reference Talmor and Berant2019; Fisch et al. Reference Fisch, Talmor, Jia, Seo, Choi and Chen2019; Miller et al. Reference Miller, Krauth, Recht and Schmidt2020) mostly focus on empirical experiments. Theoretical contributions, for example, by using the causal inference framework (Magliacane et al. Reference Magliacane, van Ommen, Claassen, Bongers, Versteeg and Mooij2017), could help to explain their results.
Additionally, due to its flexibility, the MRC task allows for the formulation of problems that are inherently hard for the state of the art, such as systematic generalisation (Lake and Baroni Reference Lake and Baroni2017). Experiments with synthetic data, such as those discussed in this paper, need to be complemented with natural datasets, such as evaluating the understanding of and appropriate reactions to new situations presented in the context. Talmor et al. (Reference Talmor, Tafjord, Clark, Goldberg and Berant2020) make a step in this direction.
While RTE is increasingly becoming a popular task to attribute various reading and reasoning capabilities to neural models, the transfer of those capabilities to different tasks, such as MRC, remains to be seen. Additionally, the MRC task requires further capabilities that cannot be tested in an RTE setting conceptually, such as selecting the relevant answer sentence from distracting context or integrating information from multiple sentences, both shown to be inadequately tested by current state-of-the-art gold standards (Jia and Liang Reference Jia and Liang2017; Jiang and Bansal Reference Jiang and Bansal2019). Therefore, it is important to develop those challenge sets for MRC models as well in order to gain a more focused understanding of their capabilities and limitations.
It is worth mentioning, that – perhaps unsurprisingly – neural models’ notion of complexity does not necessarily correlate with that of humans. In fact, after creating a ‘hard’ subset of their evaluation data that is clean of spurious correlations, Yu et al. (Reference Yu, Jiang, Dong and Feng2020) report an increase in human performance, directly contrary to the neural models they evaluate. Partial baseline methods suggest a similar conclusion: without the help of statistics, humans will arguably not be able to infer whether a sentence is entailed by another sentence they never see, whereas neural networks excel at it (Poliak et al. Reference Poliak, Haldar, Rudinger, Hu, Pavlick, White and Van Durme2018; Gururangan et al. Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2018).
We want to highlight that the availability of multiple large-scale datasets, albeit exhibiting flaws or spurious correlations, together with the methods such as those discussed in this survey are a necessary prerequisite to gain empirically grounded understanding of what the current state-of-the-art NLU models are learning and where they still fail. This gives targeted suggestions when building the next iteration of datasets and model architectures, and therefore advance the research in NLP. While necessary, it remains to be seen whether this iterative process is sufficient to yield systems that are robust enough to perform any given natural language understanding task, the so-called ‘general linguistic intelligence’ (Yogatama et al. Reference Yogatama, D’Autume, Connor, Kocisky, Chrzanowski, Kong, Lazaridou, Ling, Yu, Dyer and Blunsom2019).
Appendix A. Inclusion Criteria for the Dataset Corpus
We expand the collection of papers introducing datasets that were investigated or used by any publication in the original survey corpus (e.g., those shown in Figure 1 by a Google Scholar search using the queries shown in Table A1. We include a paper if it introduces a dataset for an NLI task according to our definition and the language of that dataset is English, otherwise we exclude it.
Appendix B. Detailed Survey Results
Table B1 shows those 45 datasets from Figure 5 broken down by year, where no quantitative methods to describe possible spurious correlations have been applied yet.
The following table shows the full list of surveyed papers, grouped by dataset and method applied. As papers potentially report the application of multiple methods on multiple datasets, they can appear in the table more than once: