1. Introduction
An important aspect of hypothesis-driven research is preregistration, an open science practice that consists of the specification of research question(s), method(s) and analysis plan(s) before data collection. Preregistration is a relatively simple yet powerful tool for improving transparency in bilingualism research, and we suggest that, in keeping with current best practices in the experimental sciences, bilingualism researchers include preregistration as an essential component of hypothesis-driven research, along with other open science practices such as releasing materials, data and code alongside publications (Chambers, Feredoes, Muthukumaraswamy & Etchells, Reference Chambers, Feredoes, Muthukumaraswamy and Etchells2014; Nosek, Ebersole, DeHaven & Mellor, Reference Nosek, Ebersole, DeHaven and Mellor2018b; Nosek & Lakens, Reference Nosek and Lakens2014; Nosek, Ebersole, DeHaven & Mellor, Reference Nosek, Ebersole, DeHaven and Mellor2018a; Open Science Collaboration, 2015).
There are several positions regarding the goals of preregistration. Many researchers view it as a tool specific to confirmatory research because it can help assess the falsifiability of an experimental study's predictions, control for false positive error probability in null hypothesis significance testing (NHST), and mitigate researcher biases (e.g., Lakens, Reference Lakens2019; Chambers, Reference Chambers2019; Nosek, Beck, Campbell, Flake, Hardwicke, Mellor, van 't Veer & Vazire, Reference Nosek, Beck, Campbell, Flake, Hardwicke, Mellor, van 't Veer and Vazire2019). Under this view, preregistration helps implement the distinction between confirmatory analyses (used for hypothesis testing) and exploratory analyses (used for hypothesis generation) (e.g., de Groot, Reference de Groot1956/2014; Chambers, Reference Chambers2019; Nosek et al., Reference Nosek, Ebersole, DeHaven and Mellor2018b; Nosek et al., Reference Nosek, Beck, Campbell, Flake, Hardwicke, Mellor, van 't Veer and Vazire2019; Wagenmakers, Wetzels, Borsboom, van der Maas & Kievit, Reference Wagenmakers, Wetzels, Borsboom, van der Maas and Kievit2012). More recently, preregistration has also been considered for qualitative research with the aim to make documentation of research plans more transparent (Haven & Grootel, Reference Haven and Grootel2019). Other research groups acknowledge the contribution of preregistration to scientific transparency, but call into question the validity of the distinction between confirmatory and exploratory research, and the usefulness of preregistration to help implement this distinction (e.g., Devezer, Navarro, Vandekerckhove & Buzbas, Reference Devezer, Navarro, Vandekerckhove and Buzbas2020; Szollosi et al., Reference Szollosi, Kellen, Navarro, Shiffrin, van Rooij, Van Zandt and Donkin2020; Szollosi & Donkin, Reference Szollosi and Donkin2019, cf. Wagenmakers, Reference Wagenmakers2019). From this point of view, a shift to the development of more explicit theories would make preregistration unnecessary.
In this paper, we take the position that preregistration is crucial to separate confirmatory from exploratory analyses. In our view, the preregistration of confirmatory hypotheses can counter questionable research practices and unconscious biases (Box 1). Consequently, it can enhance research transparency in confirmatory bilingualism (L2) research. Concerns about (non-)transparency and researcher biases are well-known in psychological science (Wicherts, Borsboom, Kats & Molenaar, Reference Wicherts, Borsboom, Kats and Molenaar2006; Simmons, Nelson & Simonsohn, Reference Simmons, Nelson and Simonsohn2011). L2 research is similarly affected by a lack of clarity about pre-data collection hypotheses and analysis plan choices. This problem is compounded by the fact that L2 studies rarely release their research materials (Derrick, Reference Derrick2016; Marsden, Thompson & Plonsky, Reference Marsden, Thompson and Plonsky2018c) or their data (Larson-Hall & Plonsky, Reference Larson-Hall and Plonsky2015; Bolibaugh, Vanek & Marsden, Reference Bolibaugh, Vanek and Marsden2020).
To address these issues, two journals in the field of bilingualism, Language Learning and Bilingualism: Language and Cognition, have introduced a new type of article, Registered Reports, which allows researchers to submit their hypotheses, methods, and analysis protocols for peer review prior to data collection (Marsden, Morgan-Short, Trofimovich & Ellis, Reference Marsden, Morgan-Short, Trofimovich and Ellis2018b).
• The garden of forking paths
In hypothesis-driven research, there are many possible data analysis paths, and one of several potential paths can be selectively chosen and reported (Gelman & Loken, Reference Gelman and Loken2013, Reference Dillon, Mishler, Sloggett and Phillips2014). For example, one could choose a particular measure, region of interest or time-window that was not originally selected for analysis, or delete outliers based on an arbitrary criterion. Such multiple analysis paths cumulatively create so many researcher degrees of freedom that one can describe them using a decision tree. This bias is often an unconscious one (Gelman & Loken, Reference Gelman and Loken2013, pp. 9-10):
It's not that the researchers performed hundreds of different comparisons and picked ones that were statistically significant. Rather, they start with a somewhat-formed idea in their mind of what comparison to perform, and they refine that idea in light of the data. (...) they are using their scientific common sense to formulate their hypotheses in a reasonable way, given the data they have. The mistake is in thinking that, if the particular path that was chosen yields statistical significance, that this is strong evidence in favor of the hypothesis.
• Multiple testing
For purely statistical reasons, if one conducts enough statistical tests, some test will eventually come out significant. For example, in psycholinguistic eye-tracking reading research, one can easily end up conducting dozens of statistical tests to evaluate a single hypothesis. Simulations in von der Malsburg and Angele (Reference von der Malsburg and Angele2017) demonstrate that multiple analyses in eye-tracking dramatically inflate Type I error, leading to a large proportion of false positive rejections of the null hypothesis.
• Post-hoc hypothesizing
When data is analyzed without having explicitly stated the predictions, one may easily convince oneself that an unforeseen result was expected all along, and subsequently report this unexpected finding as a confirmatory one. This bias is commonly referred to as ‘hypothesizing after the results are known’ (HARKing) (Simmons et al., Reference Simmons, Nelson and Simonsohn2011; Kerr, Reference Kerr1998). This can skew the scientific record with less well-grounded theories, cherry-picked after the fact (Chambers, Reference Chambers2019).
Here, we discuss a different approach: non-peer reviewed preregistration using open science platforms such as the Open Science Framework (OSF, https://osf.io/ ) or AsPredicted (https://aspredicted.org/ ). On these platforms, researchers have the opportunity to create a public or private, time-stamped, non-modifiable record of a planned study prior to data inspection, either before or during data collection. Here, we argue that non-peer reviewed preregistration can counteract the questionable research practices presented below. We first illustrate them with an example from our own work on native (L1) sentence processing. Then, we discuss correlates in the L2 literature and explain how non-peer reviewed preregistrations can improve L2 research.
2. Possible pitfalls of hypothesis-driven research: An example from L1 sentence processing
We briefly introduce our study, which attempted to replicate the findings of an eye-tracking reading study that compared the processing of two different syntactic dependencies (Dillon, Mishler, Sloggett & Phillips, Reference Dillon, Mishler, Sloggett and Phillips2013; Jäger, Mertzen, Van Dyke & Vasishth, Reference Jäger, Mertzen, Van Dyke and Vasishth2020). This example can be easily translated to bilingualism settings where, similar to our example, processing patterns are investigated for different syntactic constructions, but also for different speaker groups, such as native vs. non-native speakers (Felser & Cunnings, Reference Felser and Cunnings2012; Grüter, Lew-Williams & Fernald, Reference Grüter, Lew-Williams and Fernald2012), or successive vs. simultaneous learners (Lemmerth & Hopp, Reference Lemmerth and Hopp2019; Sabourin & Vīnerte, Reference Sabourin and Vīnerte2015).
Our example concerns a phenomenon called agreement attraction. For subject-verb agreement dependencies, previous work has shown that a processing disruption elicited by an ungrammatical plural verb can be weakened if a plural noun (an “attractor”) intervenes between the subject and the verb (as in 1a vs. 1b; Wagers, Lau & Phillips, Reference Wagers, Lau and Phillips2009; Pearlmutter, Garnsey & Bock, Reference Pearlmutter, Garnsey and Bock1999; Dillon et al., Reference Dillon, Mishler, Sloggett and Phillips2013). Dillon and colleagues used a within-subjects design to examine whether the attraction effect extended to ungrammatical antecedent-reflexive dependencies, where an attractor matched the reflexive in number (1c vs. 1d).
(1)
a. Subject-verb agreement; attraction
*The amateur bodybuilder who worked with the personal trainers amazingly were competitive for the gold medal.
b. Subject-verb agreement; no attraction
*The amateur bodybuilder who worked with the personal trainer amazingly were competitive for the gold medal.
c. Reflexive; attraction
*The amateur bodybuilder who worked with the personal trainers amazingly injured themselves on the lightest weights.
d. Reflexive; no attraction
*The amateur bodybuilder who worked with the personal trainer amazingly injured themselves on the lightest weights.
Building on work by Sturt (Reference Sturt2003), they argued that, unlike subject-verb agreement configurations, the processing of antecedent-reflexive dependencies should be syntactically constrained (Chomsky, Reference Chomsky1981). If so, attraction effects were expected in subject-verb dependencies but not in antecedent-reflexive dependencies, yielding an interaction between dependency type and attraction.
Dillon et al. (Reference Dillon, Mishler, Sloggett and Phillips2013) analyzed multiple reading measures and observed the predicted interaction only in total reading time. This result was taken as support for the hypothesis that subject-verb agreement and reflexives show different susceptibility to agreement attraction, and thus are differentially constrained by syntactic principles. In our large-sample replication study (Jäger et al., Reference Jäger, Mertzen, Van Dyke and Vasishth2020), the goal was to replicate the statistically significant interaction in total reading time from the original study. Our confirmatory analysis of total reading time showed no effect, while the exploratory analyses of first-pass regressions and regression-path durations did (Table 1).
The study by Dillon and colleagues and our attempted replication serve to illustrate the potential issues of the garden of forking paths, multiple testing and posthoc theorizing. First, even for a confirmatory replication study, where one analyzes the same region and reading measure that showed the interaction in the original study, garden of forking paths scenarios arise if an analysis path is not defined prior to data inspection. For example, different decisions regarding statistical tests and outlier treatment could still be made after data inspection.
Second, for the analyses of the Dillon et al. study and our replication study, six statistical tests were conducted. Testing six eye-tracking measures increases the Type I error probability from 5% to 26.5% (i.e., 1 − 0.956 = 0.265) (Bonferroni, Reference Bonferroni1936). It is possible to correct for multiple testing. For example, a Bonferroni correction would require an adjusted Type I error of 0.05/6 for the six statistical tests we conducted, which implies that the absolute critical t-/z-value would be 2.64. If this criterion were used, there would be no significant effects in either the original study or the replication attempt (see observed z/t-values in Table 1). A better solution to the multiple testing problem may be to avoid it altogether by having precise predictions about the dependent measure(s), and focus on (Bayesian) estimation of effects rather than NHST (e.g., Norouzian, Reference Norouzian2020; Gelman & Carlin, Reference Gelman and Carlin2014; Gelman et al., Reference Gelman, Carlin, Stern, Dunson, Vehtari and Rubin2014; Kruschke, Reference Kruschke2014).
Third, suppose that the effect that was expected a priori at the critical auxiliary verb or the reflexive had been found further downstream in the sentence or even before the critical region. Without specifying the critical region in advance, one could easily have found a post-hoc theory for the effect showing up in another region and reported this as if it had been predicted all along.
Finally, both the original and the replication study show some evidence of the effect of interest. However, the effect occurs in different measures across the two studies. Because of the exploratory nature of the first-pass regression and regression-path duration results in the replication attempt, we cannot treat these hypothesis tests as confirmatory ones. Exploratory analyses per se are an important part of doing science, but they should be presented as such (e.g., Bishop, Reference Bishop2020; de Groot, Reference de Groot1956/2014; Nosek et al., Reference Nosek, Ebersole, DeHaven and Mellor2018b).
3. Problematic research practices in L2 research
The issues above can also arise in L2 research. Two common examples of forks in the analysis path are outlier treatment and the selection of interest regions in reading studies. For example, a synthesis of methodological decisions in L2 self-paced reading (SPR) research showed a variety of outlier removal criteria across 64 studies, such as standard deviations around the mean, reading time cutoffs, or both (Marsden et al., Reference Marsden, Thompson and Plonsky2018c; see Nicklin & Plonsky, Reference Nicklin and Plonsky2020, for discussion of outlier treatment). Moreover, L2 reading studies on the same grammatical phenomena can vary substantially in their selection of interest regions. For a subset of the L2 SPR studies on local ambiguity processing synthesized in Marsden et al. (Reference Marsden, Thompson and Plonsky2018c), some studies reported statistical analyses for the ambiguous sentence region, and other studies for some, or all, of the subsequent regions. In addition, the critical regions varied between studies, consisting of a single word or several words combined.
A closely related problem to the selective reporting of interest regions is conducting statistical tests for many different regions, and/or eye-tracking measures. Godfroid (Reference Godfroid2020) reported that an average of 3.4 eye-tracking measures per study are analyzed in the L2 eye-tracking literature, further inflating Type I error probability. The Type I error issue might be particularly prevalent in L2 studies because many of them use frequentist NHST and only report binary decisions about the presence or absence of an effect without also reporting effect estimates (Marsden et al., Reference Marsden, Thompson and Plonsky2018c). One unfortunate consequence is that other researchers cannot gain knowledge about the magnitude of an effect across studies, or conduct meta-analyses due to the lack of information from previous studies (Plonsky, Reference Plonsky2013; Larson-Hall & Plonsky, Reference Larson-Hall and Plonsky2015; Plonsky & Oswald, Reference Plonsky and Oswald2014; Al-Hoorie & Vitta, Reference Al-Hoorie and Vitta2019; for an introduction to meta-analyses in bilingualism research, see Plonsky & Oswald, Reference Plonsky, Oswald and Plonsky2015; Plonsky, Sudina & Hu, Reference Plonsky, Sudina and Hu2020).
Finally, as in our example on L1 processing, post-hoc hypothesizing, i.e., changing a hypothesis to match the findings, may reduce the reproducibility of L2 research (Marsden et al., Reference Marsden, Morgan-Short, Thompson and Abugaber2018a; Marsden, Morgan-Short, Thompson & Abugaber, Reference Marsden, Morgan-Short, Trofimovich and Ellis2018b; Chambers, Reference Chambers2019). Possibly partly due to the issues raised above, and low statistical power (Cohen, Reference Cohen1962, Reference Cohen1988; Brysbaert, Reference Brysbaert2020), inconsistent findings also occur in L2 research. Some examples include the role of crosslinguistic influence in syntactic processing (Dussias, Dietrich & Villegas, Reference Dussias, Dietrich, Villegas and Schwieter2015; Lago, Mosca & Stutter Garcia, Reference Lago, Mosca and Stutter Garcia2020), the existence of a bilingual advantage in attentional systems (Bialystok, Reference Bialystok2017; Paap, Anders-Jefferson, Mason, Alvarado & Zimiga, Reference Paap, Anders-Jefferson, Mason, Alvarado and Zimiga2018), and the role of morphological decomposition in inflected vs. derived forms during word recognition in native vs. non-native speakers (Clahsen & Veríssimo, Reference Clahsen and Veríssimo2016; Feldman & Kroll, Reference Feldman and Kroll2019). Next, we discuss how a non-peer reviewed preregistration can be implemented to improve L2 research.
4. Non-peer reviewed preregistration in psycholinguistic research
For preregistration to counter questionable research practices and biases, it is not sufficient to a priori specify the dependent measure(s), because many researcher degrees of freedom remain. A complete preregistration requires a full description of the research questions and hypotheses, study design, methods, speaker group selection criteria, data collection procedure, participant sample size or stopping rule, outcome variable(s), as well as an analysis plan including statistical models, information on data exclusion and statistical inference criteria. This does not only ensure greater transparency, but it can also keep in check one's biases because analysis decisions are made public prior to data analysis, preventing selective reporting of effects. For example, assume that for a planned study we preregister no outlier exclusions, but later find an effect only when removing certain data points. This could be reported as an exploratory finding. Without preregistration, it may be tempting to report the most ‘interesting’ result as confirmatory, preventing other researchers from evaluating the findings in light of the analysis choices. In addition, if our published preregistration committed to a predicted effect for a particular region and measure, based on theory or previous findings, we can no longer convince ourselves that a surprising result was originally predicted and restate the hypotheses post-hoc.
One may argue that if one has strong theoretical predictions, preregistration is redundant because the analysis choices are predetermined by the theory. However, Silberzahn et al. (Reference Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey and Nosek2018) convincingly illustrated that different analysis choices can be made even under highly constraining conditions. Their study recruited 29 research groups in the psychological sciences to answer the same research question for one particular dataset. Of the 29 groups, 20 observed a significant and nine a non-significant result. Strikingly, the range of effect estimates reported by the different research groups allowed for different conclusions.
Although we take the view that preregistration without peer review can be an effective way to reduce unconscious biases in one's work, the lack of peer review means that the preregistration of a study can be as thorough or as vague as the researcher deems appropriate. Vaguely specified research plans still allow for many possible analysis paths, and selective reporting of effects. Consequently, it is up to the scientific community to make non-peer reviewed preregistration a success or a failure: only a thoroughly implemented preregistration and a precisely followed research plan can reduce unconscious biases and help to separate confirmatory hypothesis tests from exploratory ones.
4.1 Selecting dependent measures for a preregistration
If one wants to preregister a study, but lacks prior knowledge of a particular phenomenon, an experiment could be piloted and exploratory analyses conducted to identify which measure(s) show the predicted effect. One could then generate hypotheses from this and test them in a confirmatory study (e.g., Nicenboim, Vasishth, Engelmann & Suckow, Reference Nicenboim, Vasishth, Engelmann and Suckow2018; Nicenboim, Vasishth & Rösler, Reference Nicenboim, Vasishth and Rösler2020). If, on the other hand, there are previous findings on a phenomenon, these could serve as the basis for a preregistration. However, when the literature shows equivocal results as discussed above, what steps could be taken to consolidate the support in favor of or against a theory? This is not straightforward. For example, in the Dillon et al. (Reference Dillon, Mishler, Sloggett and Phillips2013) study and our replication study, the effect of interest was observed in different reading measures. If, based on linguistic theory, we believe that the effect of interest should be found in earlier reading measures (first-pass regression and regression-path duration as in our replication study), the only way to test this is by conducting a replication study. This replication should aim for a sufficiently large participant sample and a sufficiently precise effect estimate, and specify the dependent measure(s) and critical region(s) in advance. Otherwise, in a future study we may find some other dependent measure showing the effect, which may again tempt us to draw a bullseye around the arrow that happened to land where it did.
4.2 How to get started with a non-peer reviewed preregistration
Preregistration templates are available on OSF and AsPredicted for novel studies as well as for replication studies (e.g., https://bit.ly/OSFtemplates ; https://bit.ly/ AsPredtemplate ). If one prefers to create a Registered Report-type preregistration (i.e., in manuscript format), it is possible to upload a preregistration manuscript on OSF. It is not enough to upload this document to the project's public repository, because the preregistration could be removed or replaced at any point. Rather, one needs to create a time-stamped, non-editable version which can be made public either immediately or it can be embargoed until, for example, the associated paper is submitted or published. If the preregistration is withdrawn at any stage after creating a “frozen” version of it, some meta data (title, authors, description, reason for withdrawing preregistration) will remain publicly available. A new version of the preregistration can be made available before the data are inspected. We have previously made attempts at such manuscript-style preregistrations, e.g., for Vasishth, Mertzen, Jäger and Gelman (Reference Vasishth, Mertzen, Jäger and Gelman2018) (see https://osf.io/dgewb for the non-editable preregistration).
5. Conclusion
We have used examples from L1 sentence processing and the L2 literature to illustrate some of the problems that can arise during the research process. We then discussed how preregistration allows researchers to better separate confirmatory and exploratory analyses, which can help them counter questionable research practices and unconscious biases. Our view is that, if done thoroughly, non-peer reviewed preregistration would greatly benefit the bilingualism community. We suggest that the hypothesis-driven L2 research process should standardly include preregistration, in addition to the release of materials, data and code upon publication to increase research transparency and reproducibility.
Acknowledgements
We thank João Veríssimo, Laura de Ruiter, Cylcia Bolibaugh, and Luke Plonsky for their valuable feedback on the earlier version of this paper. This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 317633480 – SFB 1287, Projects B03 and Q (PIs: Shravan Vasishth and Ralf Engbert).
Competing interests
The authors declare none.
Supplementary materials
For data and code accompanying this paper, visit https://osf.io/5ab7d/.