Introduction
A recent commentary on the bilingual advantage in executive function (Duñabeitia & Carreiras, Reference Duñabeitia and Carreiras2015) optimistically concludes that veritas est temporis filia, truth is the daughter of time. The phrase captures the notion that the scientific enterprise is cumulative, and though false pistes might be taken, these are ultimately corrected. Nonetheless, there are reasons to hold a more sober view (Ioannidis, Reference Ioannidis2012). As Duñabeita and Carreiras highlight, one precondition for progress is an unbiased publishing system in which the robustness of research is the primary criterion for publication. Another is the complete disclosure of all steps and processes underlying published outputs. Unfortunately, complete disclosure has been the exception rather than the norm (Young, Ioannidis & Al-Ubaydli, Reference Young, Ioannidis and Al-Ubaydli2008).
Bilingualism research, and some areas within bilingualism research in particular, have not made the progress that one might expect, given ‘a global research effort of unprecedented magnitude’ (Hartsuiker, Reference Hartsuiker2015, p.336). In the present piece, we discuss ways in which minimum standards of methodological transparency, necessary for both reproducibility and replicabilityFootnote 1, can overcome the crisis of confidence in bilingualism research. We argue that these minimum standards are not only necessary to distinguish between ‘helpful’ and ‘unhelpful’ replication attempts (National Academies of Sciences & Medicine, 2019) and thus build a cumulative scientific enterprise, but that they also enable a series of methodological innovations that have the potential to accelerate the research cycle. To briefly preview our argument, full disclosure of data and code is necessary not only to assess the reproducibility of original findings, but also to test the robustness of these findings to different analytic specifications. Similarly, full provision of experimental materials and protocols underpins assessment of both the replicability of original findings, as well as their generalisability to different contexts and samples. We illustrate each section of the review with recent impactful examples and follow with pointers for those looking to share their data and code, and materials and protocols.
Open data and analytic code
Sharing of data and code (such as R scripts, or SPSS syntax that can be generated through the graphical user interface) underpins computational reproducibility, and is necessary for the verification of individual studies, but also confers other benefits which we elaborate below.
Computational reproducibility
In many cases, exact replication of a study can be prohibitive or difficult. The reasons underlying this difficulty may be related to the characteristics of a particular sample of participants (e.g., Kindertransport survivors in Schmid, Reference Schmid2002; adult international adoptees in Pallier, Dehaene, Poline, LeBihan, Argenti, Dupoux & Mehler, Reference Pallier, Dehaene, Poline, LeBihan, Argenti, Dupoux and Mehler2003), or the design of the study itself (e.g., the Barcelona Age Factor which exploited a change in curricular language provision; Muñoz, Reference Muñoz2006), among other factors. Longitudinal and panel studies (e.g., Xavier Vila, Ubalde, Bretxa & Comajoan-Colomé, Reference Xavier Vila, Ubalde, Bretxa and Comajoan-Colomé2018) may be particularly difficult to replicate. In these cases, an “attainable minimum standard” (Peng, Reference Peng2011) for verifying scientific claims is via an assessment of the computational reproducibility of the analyses.
Providing the data and computer code necessary to re-run analyses and re-create the results in published outputs can be key to catching potentially harmful errors at an early stage. Surveys of statistical errors at the reporting stage (Nuijten, Hartgerink, van Assen, Epskamp & Wicherts, Reference Nuijten, Hartgerink, van Assen, Epskamp and Wicherts2016), as well as the coding stage (Ziemann, Eren & El-Osta, Reference Ziemann, Eren and El-Osta2016) have found that these appear in up to half of sampled articles, and frequently have implications for the substantive conclusions drawn (see Herndon, Ash & Pollin, Reference Herndon, Ash and Pollin2014 for a notable coding error).
The extent of computational reproducibility within bilingualism research is currently unknown, but efforts from adjoining disciplines may be indicative of general trends. Plonsky, Egbert and Laflair (Reference Plonsky, Egbert and Laflair2015) solicited datasets from 255 candidate studies published between 2002 and 2012 in Language Learning and Studies in Second Language Acquisition, and received 37 (approximately 15%). Two similar studies reported only slightly higher figures in journals with mandatory data sharing policies: Stodden, Seiler and Ma (Reference Stodden, Seiler and Ma2018) estimated that 44% of the 204 articles they sampled from Science had at least some recoverable data and code, and that 26% of the sample were potentially reproducible. Hardwicke, Mathur, MacDonald, Nilsonne, Banks, Kidwell, Hofelich Mohr, Clayton, Yoon, Henry Tessler, Lenne, Altman, Long and Frank (Reference Hardwicke, Mathur, MacDonald, Nilsonne, Banks, Kidwell, Hofelich Mohr, Clayton, Yoon, Henry Tessler, Lenne, Altman, Long and Frank2018) found that nearly half of articles sampled from Cognition (85/174) had datasets which were likely to be reusable. The authors were able to reproduce published values in 63% of a subset of these articles, though author assistance was needed for half the cases. Thus despite growing numbers of calls for sharing of data as a matter of course, the realities of data sharing in related disciplines suggest that it is still relatively uncommon, and the actual reproducibility of results likely to be low.
Though reanalyses of existing studies in bilingualism are relatively few to date, they have the potential to make significant impact. One early example is Vanhove's (Reference Vanhove2013) reanalysis of data from DeKeyser, Alfi-Shabtay and Ravid (Reference DeKeyser, Alfi-Shabtay and Ravid2010), using piecewise regression to test the long-contested relationship between age of acquisition and ultimate attainment. Results pointed to a need to qualify earlier conclusions since a discontinuity in age effects was only found in one of the two datasets reanalysed. Evaluating the technical validity of earlier statistical approaches brought a twofold benefit. It highlighted the problem of arbitrary binning of continuous variables, and emphasised the usefulness of reanalysing existing studies by moving beyond linear statistics where curvilinear approaches are more suitable.
Analytic robustness
Beyond assuring the verifiability of results, the sharing of data and code enables a more stringent test of the robustness of published findings to different specifications of analysis. Researchers who prepare a data set for analysis must make a series of decisions regarding which data to combine, transform, or exclude. In a given study, for example, a researcher may need to decide whether and how to combine aspects of language experience and use into a single bilingualism quotient, which indices of executive function tasks to use as predictors, and how to treat outliers in response times. Choices such as these are frequently referred to as researcher degrees of freedom (Simmons, Nelson & Simonsohn, Reference Simmons, Nelson and Simonsohn2011). While many such choices appear methodologically or substantively arbitrary, they can be consequential to the inferences drawn. A recent study asking 29 teams of analysts to independently answer a research question given the same data set (Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey, Bahník, Bai, Bannard, Bonnier, Carlsson, Cheung, Christensen, Clay, Craig, Dalla Rosa, Dam, Evans, Flores Cervantes, Fong, Gamez-Djokic, Glenz, Gordon-McKeon, Heaton, Hederos, Heene, Mohr, Hofelich Högden, Hui, Johannesson, Kalodimos, Kaszubowski, Kennedy, Lei, Lindsay, Liverani, Madan, Molden, Molleman, Morey, Mulder, Nijstad, Pope, Pope, Prenoveau, Rink, Robusto, Roderique, Sandberg, Schlüter, Schönbrodt, Sherman, Sommer, Sotak, Spain, Spörlein, Stafford, Stefanutti, Tauber, Ullrich, Vianello, Wagenmakers, Witkowiak, Yoon & Nosek, Reference Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey, Bahník, Bai, Bannard, Bonnier, Carlsson, Cheung, Christensen, Clay, Craig, Dalla Rosa, Dam, Evans, Flores Cervantes, Fong, Gamez-Djokic, Glenz, Gordon-McKeon, Heaton, Hederos, Heene, Mohr, Hofelich Högden, Hui, Johannesson, Kalodimos, Kaszubowski, Kennedy, Lei, Lindsay, Liverani, Madan, Molden, Molleman, Morey, Mulder, Nijstad, Pope, Pope, Prenoveau, Rink, Robusto, Roderique, Sandberg, Schlüter, Schönbrodt, Sherman, Sommer, Sotak, Spain, Spörlein, Stafford, Stefanutti, Tauber, Ullrich, Vianello, Wagenmakers, Witkowiak, Yoon and Nosek2018) concluded that ‘significant variation in the results of analyses of complex data may be difficult to avoid, even by experts with honest intentions’ (p.338).
Looking to meta-research in related disciplines can inform us about the robustness of analyses in bilingualism. Plonsky et al. (Reference Plonsky, Egbert and Laflair2015) followed their survey of data availability in Language Learning and Studies in Second Language Acquisition with an assessment of the robustness of the subset of studies with usable data; when they applied a testing method that made different assumptions (viz., bootstrapping), they found that a quarter of previously significant focal tests were no longer significant. A different approach to assessing robustness was taken by Steegen, Tuerlinckx, Gelman and Vanpaemel (Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016), who constructed a series of datasets by iterating through all reasonable choices in data processing. By repeating their analysis over these differently constructed datasets (more than 100 reanalyses), the authors demonstrated the power of a multiverse analysis to ‘reduce the problem of selective reporting by making the fragility or robustness of the results transparent, and … [identify] the most consequential choices’ (p. 707).
A similar approach was recently adopted by Poarch, Vanhove and Berthele, (Reference Poarch, Vanhove and Berthele2019), who carried out a multiverse analysis of the bilingual executive function advantage in bidialectals. By documenting a range of possible analyses when varying data exclusion criteria, and the coding of the flanker and Simon effects, the authors illustrated the potential effects of subjective choices on result interpretations. This study is a particularly useful example of good practice in the context of substantial variation across studies on the effects of bilingualism on executive function.
Research synthesis and planning
A final benefit of providing data and code alongside published outputs concerns the development of research syntheses, and the planning of future research. Aggregating findings across a line of research is typically carried out through meta-analyses of summary effects from primary studies, yet the basic information required to compute effects is often missing from primary reports (Larson-Hall & Plonsky, Reference Larson-Hall and Plonsky2015). A culture of archiving data will not only increase the number of studies included in future meta-analyses, but also enable more sophisticated research syntheses using either trial or participant level data (see the special issue of Psychological Methods, Curran, Reference Curran2009; Glass, Reference Glass2000). The power of this approach to detect small effects, and hence adjudicate between inconsistent findings, can be seen in a study by Nicenboim, Vasishth and Rösler (Reference Nicenboim, Vasishth and Rösler2019) addressing the recent large scale, multisite ‘failure to replicate’ anticipatory effects in language comprehension (Nieuwland, Politzer-Ahles, Heyselaar, Segaert, Darley, Kazanina, Von Grebmer Zu Wolfsthurn, Bartolozzi, Kogan, Ito, Mézière, Barr, Rousselet, Ferguson, Busch-Moreno, Fu, Tuomainen, Kulakova, Husband, Donaldson, Kohu, Rueschemeyer & Huettig, Reference Nieuwland, Politzer-Ahles, Heyselaar, Segaert, Darley, Kazanina, Von Grebmer Zu Wolfsthurn, Bartolozzi, Kogan, Ito, Mézière, Barr, Rousselet, Ferguson, Busch-Moreno, Fu, Tuomainen, Kulakova, Husband, Donaldson, Kohu, Rueschemeyer and Huettig2018). In a meta-analysis with trial-level data, the authors found evidence for a clear, but small effect of prediction, that only emerged when analysed across multiple studies. More realistic estimation of effect sizes will further enable researchers to consider what effect sizes might be considered relevant, and shift to planning of studies powered to detect the ‘smallest effect size of interest’ (Lakens, Scheel & Isager, Reference Lakens, Scheel and Isager2018). Asking researchers to consider what effect sizes can be studied reliably may also mitigate future ‘decline effects’ like that identified by de Bruin and Della Sala (Reference de Bruin and Della Sala2015) in the bilingual advantage literature. The decline effect refers to a phenomenon whereby strong initial evidence for a novel effect diminishes as a line of research develops. De Bruin and Della Salla attribute the decline effect to a combination of statistical regression to the mean, and difficulties in publishing small or null effects.
Good practice in reproducibility
The examples discussed above highlight ways in which integrating reproducibility into bilingualism research has helped the field make theoretical advances. Nonetheless, they are not particularly illuminating to the researcher looking to share their data and analysis code now. An overview of issues involved in making research data available for dissemination can be found in the data sharing primer from UKRN (Towse et al., Reference Towse, Rumsey, Owen, Langford, Jaquiery and Bolibaugh2020). Further tangible guidance is available in recently published tutorials such as Klein, Hardwicke, Aust, Breuer, Danielsson, Hofelich Mohr, Ijzerman, Nilsonne, Vanpaemel and Frank (Reference Klein, Hardwicke, Aust, Breuer, Danielsson, Hofelich Mohr, Ijzerman, Nilsonne, Vanpaemel and Frank2018), as well as the inaugural issue of Advances in Methods and Practices in Psychological Science (Challenges in Making Data Available, 2018). Here, we briefly signpost some additional resources that can help implement the key principles of organisation, documentation, automation and dissemination necessary for reproducibility.
The simplest way to ensure the reproducibility of a research project is to plan for it from the beginning. This is the approach taken by the Project Tier Protocol (https://www.projecttier.org/), an opinionated framework that provides a clear template and workflow for creating and documenting a reproducible research project. The Project Tier protocols are a good entry point for researchers working with commercial analysis software such as SPSS, Stata, or SAS; they contain guidance on how to manually create meta-data, data codebook, and read-me files that supplement the syntax files available from these packages – and ensure that the distinction between processed data and raw or original data is preserved.
For researchers working in open source software environments like the R computing language (R Core Team, 2013), a number of packages that assist reproducible project management are available. One comprehensive package, Workflowr (Blischak, Carbonetto & Stephens, Reference Blischak, Carbonetto and Stephens2019), combines literate programming and version control with reproducibility checks, and is aimed at those with minimal experience with version control systems. Beyond R, Code Ocean (Clyburne-Sherin, Fei & Green, Reference Clyburne-Sherin, Fei and Green2019) (https://codeocean.com/) provides online modular containers for a large number of widely used software environments along with code and data, and runs in a browser. CodeOcean is useful for helping researchers without experience of using dedicated containerisation software to manage their code dependencies and guard against parts of their analysis ‘breaking’ as software packages are updated; additionally each capsule is assigned a DOI to ensure that it is persistently findable.
Open materials and protocols
The availability of data elicitation materials and study protocols underpins the development of systematic lines of research. When materials are available, researchers can evaluate the comparability of constructs and their operationalisations across studies. Establishing the commensurability of data elicitation measures also allows researchers to analyse pooled data across studies, in Integrative Data Analyses, an alternative to meta-analyses (Bauer & Hussong, Reference Bauer and Hussong2009). Finally, open materials and protocols are especially important for the planning of replication studies. Replication studies play a central role in the accumulation of evidence for or against a hypothesis (Leek & Peng, Reference Leek and Peng2015), and, when preregistered and conducted at scale (e.g., Morgan-Short, Marsden, Heil, Issa, Leow, Mikhaylova, Mikołajczak, Moreno, Slabakova & Szudarski, Reference Morgan-Short, Marsden, Heil, Issa, Leow, Mikhaylova, Mikołajczak, Moreno, Slabakova and Szudarski2018), may present the least biased way of estimating effects: a recent comparison of 15 meta-analyses to multi-site, pre-registered replications on the same topics found that meta-analyses systematically inflated effect sizes even after corrective measures had been taken (Kvarven, Strømland & Johannesson, Reference Kvarven, Strømland and Johannesson2019).
As is the case with sharing of data and code, existing meta-research suggests that materials and protocols in bilingualism research are not yet routinely archived or shared. In a methodological synthesis of the use of self-paced reading in studies investigating adult bilingual participants, Marsden, Thompson and Plonsky (Reference Marsden, Thompson and Plonsky2018) found that only 4% of 71 eligible studies had full materials available, and 77% gave just one brief example of stimuli. A survey of instrument availability across three journals in second language research found that only 17% of instruments were available between 2009 and 2013 (Derrick, Reference Derrick2016). Likewise, Hardwicke, Wallach, Kidwell, Bendixen, Crüwell, & Ioannidis (Reference Hardwicke, Wallach, Kidwell, Bendixen, Crüwell and Ioannidis2020), sampling a broader range of social science literature between 2014–2017, found that materials availability was indicated for only 11% of 151 sampled studies, and protocols availability for none. The lack of detailed protocols is particularly worrying in light of findings that researchers believe that unreported lab practices may influence the outcomes of their research (Brenninkmeijer, Derksen & Rietzschel, Reference Brenninkmeijer, Derksen and Rietzschel2019).
Unfortunately, the current lack of transparency regarding instrumentation and protocols presents an important threat to the quality of replication efforts. A synthesis of replication studies in second language learning (Marsden, Morgan-Short, Thompson & Abugaber, Reference Marsden, Morgan-Short, Thompson and Abugaber2019) found that only 3 of the original 67 studies that were replicated had provided all of their materials. In the absence of full reporting of materials and instructions, non-replications become contentious rather than informative, generating debate around the fidelity of the replication attempt rather than an understanding of the limiting conditions of an effect (e.g., Grundy & Bialystok, Reference Grundy and Bialystok2019).
From this admittedly low base, a growing number of initiatives and individual examples of good practice are addressing the conditions underpinning replicability. Firstly, care has been paid to theorising and measuring language proficiency (Kaushanskaya, Blumenfeld & Marian, Reference Kaushanskaya, Blumenfeld and Marian2019), language exposure (Anderson, Mak, Chahi & Bialystok, Reference Anderson, Mak, Chahi and Bialystok2018), and language dominance (Dunn & Fox Tree Reference Dunn and Fox Tree2009); this care is now being extended to examine constructs and tasks in executive function (e.g., Paap & Greenberg, Reference Paap and Greenberg2013, Poarch & Van Hell, Reference Poarch and Van Hell2019). More generally, materials availability is increasing. Digital objects associated with published reports in bilingualism research can now be found in generalist (e.g., Figshare, the Open Science Framework), and discipline specific repositories (e.g., the IRIS Repository of Instruments for Research into Second Languages). As a community supported repository archiving instruments, materials and stimuli for research into second and foreign languages, IRIS now also hosts special collections of instruments (e.g., 63 self-paced reading tasks). Finally, replicability and reproducibility have become priorities for a growing number of bilingualism researchers, e.g., Poort and Rodd (Reference Poort and Rodd2018)‘s publically accessible project archiving data elicitation materials, protocols, data, and analysis scripts exemplifies the systematic and transparent reporting necessary for future close replication. Beyond the efforts of individual researchers, a recent call for registered replications of second language studies with non-academic participant samples (Andringa & Godfroid, Reference Andringa and Godfroid2019) is systematically addressing questions around the contextual generalisability of L2 research. Similar efforts will be needed to more explicitly consider the role of bilinguals’ histories of language learning and use (Mishra, Reference Mishra2018).
Good practice in replicability
In order to replicate a research study, one needs the full set of stimuli (e.g., pictures, participant instructions, software setup, test items, response options, distractors) used to elicit the data. As this level of detail is usually more information than is conventionally accepted in a publication methods section, archiving all non-proprietary material in a public repository, and linking the material to the publication itself is an important first step. Practical guidance on sharing materials can be found in a recent tutorial from the founders of Databrary (Gilmore, Lorenzo Kennedy & Adolph, Reference Gilmore, Lorenzo Kennedy and Adolph2018).
Researchers have a number of choices regarding where to host their materials. While many behavioural tasks can now be shared in task specific repositories (e.g., PsychoPy, jsPsych, and lab.js experiments can be shared on the Pavlovia platform, pavlovia.org), and other researchers may share materials on their own websites or general repositories like the Open Science Foundation, there is a further tangible benefit to also archiving protocols, instruments and materials in domain specific repositories such as IRIS. Domain specific materials repositories increase the comparability of sources of data; for example, once uploaded to IRIS, materials are associated with rich, searchable meta-data, with parameters for Research Area, Instrument Type, Data Type, Participant Type, Language Feature, among many others. These collections in turn enable meta-research on constructs and methods, such as that exemplified by Marsden et al. (Reference Marsden, Thompson and Plonsky2018)'s methodological synthesis of the use of self-paced reading in second language research.
While archiving data elicitation materials is an important and relatively straightforward step, it may not be sufficient. Going forward, a key shortcoming to address is the lack of standardised formats to document data elicitation procedures. A method which may have promise, and which is being trialled in conjunction with Stage 1 Registered Reports, is the use of video recording of study protocols (Heycke and Spitzer, Reference Heycke and Spitzer2019; Spitzer and Heycke, Reference Spitzer and Heycke2020). The potential of this approach can be seen in the Databrary repository, which not only specifically encourages the archiving of video documentation of study procedures, participant instructions, apparatuses and testing contexts, but also provides tools to code, quantify and systematically compare differences across studies (Gilmore & Adolph, Reference Gilmore and Adolph2017).
Recommendations going forward
This review has attempted to illustrate something every researcher knows: the lifecycle of any research study is beset by a series of decisions, many of which are essentially arbitrary, whose consequences are usually unknown. Debates regarding tasks, coding, and analysis seldom arise, except when inconsistencies and failures to replicate threaten previously established findings. Compounding these issues, our current publication practices neither prioritise nor straightforwardly accommodate complete disclosure of research procedures.
We have argued that one simple remedy with the potential to minimise unhelpful sources of non-replicability is to ensure that published reports are accompanied by the archiving, and public release where possible, of study materials, protocols, data and analysis scripts. Of course, transparency does not guarantee quality, and further recommendations exist, including the need to make sure that data adhere to FAIR principles (Wilkinson, Dumontier, Aalbersberg, Appleton, Axton, Baak, Blomberg, Boiten, da Silva Santos, Bourne, Bouwman, Brookes, Clark, Crosas, Dillo, Dumon, Edmunds, Evelo, Finkers, Gonzalez-Beltran, Gray, Groth, Goble, Grethe, Heringa, ’t Hoen, Hooft, Kuhn, Kok, Kok, Lusher, Martone, Mons, Packer, Persson, Rocca-Serra, Roos, van Schaik, Sansone, Schultes, Sengstag, Slater, Strawn, Swertz, Thompson, Van Der Lei, Van Mulligen, Velterop, Waagmeester, Wittenburg, Wolstencroft, Zhao & Mons, Reference Wilkinson, Dumontier, Aalbersberg, Appleton, Axton, Baak, Blomberg, Boiten, da Silva Santos, Bourne, Bouwman, Brookes, Clark, Crosas, Dillo, Dumon, Edmunds, Evelo, Finkers, Gonzalez-Beltran, Gray, Groth, Goble, Grethe, Heringa, 't Hoen, Hooft, Kuhn, Kok, Kok, Lusher, Martone, Mons, Packer, Persson, Rocca-Serra, Roos, van Schaik, Sansone, Schultes, Sengstag, Slater, Strawn, Swertz, Thompson, Van Der Lei, Van Mulligen, Velterop, Waagmeester, Wittenburg, Wolstencroft, Zhao and Mons2016), that results can be reproduced with the code provided, and that analyses are pre-registered (with Chambers, Reference Chambers2013; or without peer review) – but we believe that full methodological transparency represents an initial, attainable minimum standard.
Researchers may hesitate to release their instruments, data and code for a number of reasons (Houtkoop, Chambers, Macleod, Bishop, Nichols & Wagenmakers, Reference Houtkoop, Chambers, Macleod, Bishop, Nichols and Wagenmakers2018), among them the worry that scrutiny will uncover mistakes. As increasingly sophisticated analyses and complex experimental paradigms become more common, this is unavoidable. A credibility revolution in bilingualism research will require a culture in which mistakes are viewed as inevitable, and practices are designed to collectively mitigate their impact (Rouder, Haaf & Snyder, Reference Rouder, Haaf and Snyder2019).