Conceptual motivation
Researchers in bilingualism, as in other social and behavioral sciences, have traditionally brought together findings in individual domains in the form of (narrative) literature reviews. Unfortunately, such an approach introduces a great deal of opacity as well as a number of potential flaws, biases, and limitations at all stages – from the collection of studies to the interpretation of their outcomes to the synthesis of findings across studies on relationships of interest. Such issues are doubtlessly at play in an often-contentious and always-complex field such as bilingualism. Meta-analysis endeavors to overcome many of these limitations by embracing a scientific approach to the process of reviewing existing literature (i.e., one that strives for systematicity, objectivity, and transparency). This paper seeks to provide a formal and conceptual introduction to meta-analysis – a procedure for aggregating findings across multiple studies that address a common question – for the field of bilingualism. For a tutorial on the more practical side of conducting a meta-analysis, see Plonsky & Oswald, Reference Plonsky, Oswald and Plonsky2015. We begin by highlighting some of the core attributes and advantages inherent to the meta-analytic approach.
One major benefit afforded by the systematicity and objectivity of meta-analysis is seen in the sample of studies that is synthesized. The meta-analyst endeavors to carry out an exhaustive search for relevant research such that the final sample approximates if not equals the population of studies within the domain of interest. More thorough sampling also allows for greater statistical power, greater generalizability of findings, and a more comprehensive view of accumulated findings within the domain. Traditional reviews, by contrast, are much more idiosyncratic thus allowing for gaps and biases in the corpus of evidence.
The benefits of scientific rigor are also evident in the data collection process. Meta-analysts treat each study as a ‘participant’ who is surveyed using a coding scheme designed to extract all relevant substantive and methodological features as well as study outcomes. Such coding allows for the systematic analysis of numerous and potentially multivariate relationships while also reducing if not eliminating reliance on the fallible memories or note-taking systems of reviewers. It is this feature of meta-analysis that would allow researchers in bilingualism to comprehensively account for unique sample attributes, for example, or for the many unique measures that might be employed for a given construct. The coding process also requires that the meta-analyst produce operational definitions that can be coded for reliably across the sample, thus potentially introducing a level of scientific rigor and transparency to theoretically challenging notions such as heritage learner status, implicit vs. explicit learning, and different levels of proficiency.
A third hallmark of meta-analysis is the use of standardized indices (i.e., effect sizes) to estimate the relationships of interest both overall and as moderated by the substantive and methodological features that are coded. Literature reviews, by contrast, have often relied on tests of statistical inference and the flawed practice of null hypothesis significance testing, which are inherently less precise, less stable, and less informative than effect sizes (e.g., Plonsky, Reference Plonsky and Plonsky2015).
Finally, related to the use of effect sizes are the meta-analytic principles of estimation-thinking and synthetic-mindedness (Cumming, Reference Cumming2014). The “synthetic research ethic” (Norris & Ortega, Reference Norris, Ortega, Norris and Ortega2006, p. 4), embodied in part by meta-analysis, involves recognizing that no single study can provide a conclusive answer to any question worth asking (Tryon, Reference Nicklin and Plonsky2016). Part of doing so involves an understanding of the error that is always present around our results; to ignore such error is both disingenuous and arguably unethical. We urge researchers to consider the implications of these principles not only when reviewing previous literature but throughout the research cycle and in all the roles we fill (e.g., authors, reviewers, editors, researcher trainers) in an effort to more fully advance our scientific understanding of bilingualism.
Brief description of meta-analyses to date in bilingualism
Given the benefits described in the previous section, it is not surprising that researchers in a wide range of fields – from ecology to medicine to education – have turned to meta-analysis as the means par excellence for synthesizing findings across studies (e.g., Cooper & Hedges, Reference Cooper, Hedges, Cooper, Hedges and Valentine2009; Ioannidis, Reference Ioannidis2016). Applications of meta-analysis are now common in the applied language sciences as well, such as in second-language acquisition (see Plonsky, Reference Plonsky, Loewen and Sato2017) and, in recent years, in the realm of bilingualism.
Table 1 presents an overview of research syntheses and meta-analyses on bilingualism. As shown in the middle column, a range of major topics are represented. The far-right column indicates the overall (meta-analytic) effects from each study. Approximately half of the studies included here have been concerned with aggregating correlations. Peng et al. (Reference Peng, Barnes, Wang, Wang and Li2018), for example, extracted and combined observed correlations from 197 studies of the relationship between working memory and reading comprehension, revealing a mean correlation among bilinguals of r = .30. Likewise, based on a sample of 59 unique reports, Jeon and Yamashita (Reference Jeon and Yamashita2014) meta-analyzed the relationships between second-language (L2) reading comprehension and a number of related skills including (a) L2 grammar knowledge (r = .85), L2 vocabulary knowledge (r = .79), and L2 decoding (r = .56).
Notes. *K denotes the number of primary studies in the sample; **d and g both represent standardized mean differences; r = correlations; OR = odds ratio
Other meta-analyses in this sample were interested in understanding mean differences between groups, which are generally expressed by a standardized mean difference index such as Cohen's d or Hedges g. For example, Adesope, Lavin, Thompson, and Ungerleider's (Reference Adesope, Lavin, Thompson and Ungerleider2010) meta-analysis of the cognitive benefits of bilingualism observed on the basis of 63 studies (N = 6,022) that bilinguals outperform monolinguals on cognitive tasks such as problem-solving, on average, by approximately .4 standard deviations (g = .41). We discuss strategies for interpreting effect sizes below.
Finally, in order to present a more inclusive view of the breadth of synthetic techniques, we have also included in Table 1 examples of a ‘scoping review’ (Visonà & Plonsky, Reference Visonà and Plonsky2020), a ‘systematic review’ (Hambly, Wren & McLeod, Reference Hambly, Wren and McLeod2013), and a methodological synthesis (Plonsky, Marsden, Crowther, Gass & Spinner, Reference Plonsky, Marsden, Crowther, Gass and Spinner2020).
Major stages in meta-analysis
We have thus far presented meta-analysis in purely conceptual and straightforward terms: primary studies are collected and coded to obtain overall effects within a given domain. In reality, as with primary research, numerous choices must be made throughout the meta-analytic process, each of which is likely to influence study outcomes (Boers, Bryfonski, Faez, McKay, Reference Boers, Bryfonski, Faez and McKayin press; Norris & Ortega, Reference Norris and Ortega2007; Oswald & Plonsky, Reference Oswald and Plonsky2010). In the section that follows, we briefly outline the major stages and some of the decisions they entail. Our intention is not to provide a tutorial, however. For guidance on how to conduct a meta-analysis, see Cooper (Reference Cooper2016) and, in the context of the language sciences, Plonsky and Oswald (Reference Plonsky, Oswald and Plonsky2015).
Defining the domain and searching for primary studies
In the first stage of a meta-analysis, researchers outline the domain of research that will be the focus of study and decide on designs and variables of interest. It should be emphasized that meta-analytic results are shaped by both the way constructs have been conceptualized and operationalized in primary studies as well as by the scope of the domain in question (i.e., broad and inclusive versus narrow and more specific). For instance, Adesope et al.'s (Reference Adesope, Lavin, Thompson and Ungerleider2010) meta-analysis included only those studies that recruited ‘balanced’ bilinguals (i.e., equally well-versed in both languages), while studies with L2 learners (i.e., sometimes dubbed ‘sequential bilinguals’) and/or participants with language impairment were not deemed eligible. Branum-Martin et al. (Reference Branum-Martin, Tao, Garnaat, Brunta and Francis2012), however, focused exclusively on bilingual children, whereas Hambly et al. (Reference Hambly, Wren and McLeod2013) meta-analyzed studies involving both bilingual and multilingual children, those with and without speech sound disorders. Donnelly et al.'s (2015) meta-analysis defined bilingual participants more broadly, including those who attained comparable proficiency levels in both languages and those who used both target languages at least 40% of the time in daily life. Given these unique definitions and operationalizations, it is not surprising that the findings of these reviews differ substantially. We have described the domain here in terms of target populations. However, it is also certainly the case that design features and data collection instruments, among other features, might also be considered in defining the domain of interest.
Having determined the research domain and scope, researchers proceed to the literature search process. As a guiding principle, wider-ranging searches are likely to capture a more comprehensive and thus more precise and more generalizable view of the domain in question. Options abound for conducting such searches, some obvious (library- and web-based databases, references in previous reviews) and others less so (e.g., websites of prominent authors, conference programs, technical reports, direct contact with individual authors) (see Delaney & Tamás, Reference Delaney and Tamás2018; Plonsky & Brown, Reference Plonsky, Oswald and Plonsky2015).
As candidate reports are examined, an explicit but likely expanding set of eligibility criteria must be applied to determine which studies will be included. It is critical to document this stage of the process and, ideally, to involve multiple reviewers in the decisions of which studies to include. By doing so, the team both reduces false negatives and allows for additional transparency in the form of agreement rates on study selection (Stoll et al., Reference Stoll, Izadi, Fowler, Green, Suls and Colditz2019). For example, Adesope et al.'s (Reference Adesope, Lavin, Thompson and Ungerleider2010) search yielded an initial pool of 5,185 articles. After excluding duplicate and ineligible articles based on abstract readings, 157 articles were retained, and inter-coder reliability reached a Cohen's Kappa of .88. The final round of screening involved reading the full texts of 157 articles, and 39 articles representing a total of 63 studies were then included in the meta-analysis (Cohen's Kappa = .92). Branum-Martin et al. (Reference Branum-Martin, Tao, Garnaat, Brunta and Francis2012) searched both English and Chinese databases and ended up with a sample of 38 primary studies that met the inclusion criteria, two of which were unpublished dissertations and three were articles published in Chinese; however, no information was provided on inter-coder reliability during the search process or on the possible presence of publication bias(es).
As shown in Table 1, meta-analytic samples (denoted as K) vary widely. There is no strict minimum number of studies to include. However, as with primary research, larger samples (K > 20) are preferred because they are more likely to yield stable estimates. In sum, in addition to being comprehensive, the literature search strategy must be concisely yet transparently summarized in the write-up.
Data collection (coding)
The second major stage involves developing a coding scheme that allows for key attributes of primary studies to be documented along with their corresponding effect sizes. The features to be coded depend on the research domain and research questions but can be generally classified broadly as (a) study descriptors (e.g., author(s), title, and other identifiers; characteristics of the sample; aspects of the research design; measures and instrumentation; features associated with methodological quality and transparency) and (b) study outcomes (i.e., effect sizes such as correlation coefficients, Cohen's d values, and odds ratios). The coding sheet must be based on a solid understanding of the substantive domain including pertinent variables and methodological practices. It is also necessary to pilot the instrument and to modify it based on the emerging characteristics across primary studies.
Furthermore, it is advisable to recruit and train one or more additional coders to increase the accuracy of coding, with the help of a coding manual (e.g., Marsden et al., Reference Marsden, Thompson and Plonsky2018; Melby-Lervåg & Lervåg, Reference Melby-Lervåg and Lervåg2014). When doing so, the researchers should calculate and report an estimate of inter-coder agreement (e.g., Cohen's Kappa, or ĸ) overall and for each category in the coding sheet (see Norouzian, Reference Norouzianin press). In Lehtonen et al. (Reference Lehtonen, Soveri, Laine, Järvenpää, de Bruin and Antfolk2018), for example, ĸ was in the range between .83 and 1.00; in Peng et al. (Reference Peng, Barnes, Wang, Wang and Li2018), inter-coder reliability ranged from .95 to .98. Of note, if the sample of primary studies is large, researchers may opt to code twice only a set of studies. For example, in Hambly et al. (Reference Hambly, Wren and McLeod2013), a sample of 14 studies out of 66 underwent double coding, and the overall inter-coder reliability was 86%; Melby-Lervåg and Lervåg (Reference Melby-Lervåg and Lervåg2011) double-coded all studies in the sample, but calculated inter-coder reliability for only 30% of the sample; Plonsky et al. (Reference Plonsky, Marsden, Crowther, Gass and Spinner2020) double-coded 15% of the sample of 302 studies, which exceeds the often recommended minimum number of 20 studies (Lipsey & Wilson, Reference Lipsey and Wilson2001).
To promote transparency and accuracy in research reporting, researchers are encouraged to make their codebook available as an appendix and/or online (e.g., on IRIS [iris-database] as in Marsden et al., Reference Marsden, Thompson and Plonsky2018) and to carefully explain their coding strategies as well as difficulties encountered along the way. For instance, participants’ language proficiency is often reported idiosyncratically across studies. A meta-analyst might, therefore, prepare a set of decision rules to allow for more consistent and transparent coding of this variable. Missing data almost invariably come into play as well. The meta-analyst must decide in such cases whether primary studies with unreported features will be excluded (e.g., Adesope et al., Reference Adesope, Lavin, Thompson and Ungerleider2010) or whether missing data will be imputed or, more likely, requested from primary authors, a strategy employed by a number of meta-analyses in the field of bilingualism (Lauro & Schwartz, Reference Lauro and Schwartz2017; Lehtonen et al., Reference Lehtonen, Soveri, Laine, Järvenpää, de Bruin and Antfolk2018; Melby-Lervåg & Lervåg, Reference Melby-Lervåg and Lervåg2014; Mukadam et al., Reference Mukadam, Sommerlad and Livingston2017; Peng et al., Reference Peng, Barnes, Wang, Wang and Li2018). We encourage authors who do so to provide the response rate for the sake of greater transparency (see, e.g., Nicklin & Plonsky, Reference Nicklin and Plonsky2020). Whether or not the missing data are provided, in such a situation, meta-analysts might find themselves feeling constrained by the shortcomings or even confounds in the designs and reporting practices of primary research.
To conclude, the quality of the instrument and the accuracy of the coded data exert a substantial influence on meta-analytic results. Therefore, it is vital that the coding scheme includes variables and values pertinent to the research questions posed and that concurrent double coding is performed in a consistent and reliable fashion.
Analysis
After all effect sizes have been compiled or calculated, the meta-analysis proper (i.e., the aggregation of primary effects) can take place. In theory, this process is fairly simple: the synthesist calculates the average of the effect sizes found in the sample and its corresponding variance. In practice, however, a number of decisions must be made concerning, for example, whether and how to account for data dependencies that arise when a single study includes multiple groups/conditions, measures, and/or testing points. It is also common to weight study effects by sample size (e.g., Li, Reference Li2010) or by inverse variance (Qureshi, Reference Qureshi2016) such that those with less sampling error contribute more to the meta-analytic mean. Corrections for statistical artifacts such as measurement error (reliability) and range restriction can also be applied.
Related to effect size weighting is the decision of model selection (fixed vs. random effects). The fixed effects model assumes that studies included in the meta-analysis are sampled from populations which have one fixed or ‘true’ effect size. Any deviations from that value are therefore assumed to be due to sampling error alone. By contrast, the random effects model allows for the presence of systematic variability in observed effects due to moderators. A full discussion of these models is outside the scope of this paper. We argue, however, that a random effects model is likely more appropriate for bilingualism researchers due to the complexities of language learning, usage, and so forth (Oswald & Plonsky, Reference Oswald and Plonsky2010). Research has also indicated that the real-world data in the social sciences are likely to have variable population parameters, making the random effects model preferred (Field & Gillett, Reference Field and Gillett2010).
Regardless of any weighting procedures that are applied, the ‘grand mean’ meta-analysis produces an estimate of an overall relationship – typically a difference between conditions or a correlation between variables – several examples of which are found in Table 1. However, we are often just as or even more interested in the variability in effects around that mean. In the next step, moderator analysis, the meta-analyst examines substantive and methodological features in relation to (i.e., as predictors of) study outcomes. Consider Qureshi's (Reference Qureshi2016) meta-analysis of age effects, for example. The overall difference of d = .46 between early and late bilinguals was strongly moderated by whether the participants were living in a second (d = .68) vs. foreign language (d = -.09) environment.
Finally, inherent to many domains is the potential for bias in available effects. Publication bias, also referred to as the ‘file-drawer problem’, often occurs because studies with statistically significant results are more likely to be published (Cooper, Reference Cooper2016; Field & Gillett, Reference Field and Gillett2010). When such a bias is present, the sample of observed effects is likely to present an overestimate of the population effect. Several strategies can be applied to minimize (pre-emptively), estimate, and reduce the presence of bias. These include more thorough searches to obtain unpublished and ‘gray’ literature and diagnostic tools such as funnel plots as seen in Lehtonen et al. (Reference Lehtonen, Soveri, Laine, Järvenpää, de Bruin and Antfolk2018) and Melby-Lervåg and Lervåg (Reference Melby-Lervåg and Lervåg2011, Reference Melby-Lervåg and Lervåg2014). A wide range of analytical and statistical tools are also available such as comparing effect sizes from published and unpublished studies (e.g., Avery & Marsden, Reference Avery and Marsden2019), calculating the fail-safe N statistic (Grundy & Timmer, Reference Grundy and Timmer2017), p-curve (Mahowald et al., Reference Mahowald, James, Futrell and Gibson2016), and others (see Rothstein, Sutton & Borenstein, Reference Rothstein, Sutton and Borenstein2005).
Interpretation
Parallel to a primary study, the final stage in meta-analysis involves interpreting outcomes. Here, too, a number of considerations come into play. Making sense of the effect sizes resulting from overall and moderator analyses can feel somewhat subjective. As a starting point, meta-analysts in bilingualism might consider existing benchmarks. Plonsky and Oswald (Reference Plonsky and Oswald2014) generated a distribution of observed d and r values in L2 research based on a sample of 91 meta-analyses and 346 primary studies. Similarly, as part of a methodological synthesis of the use of multiple regression, Plonsky and Ghanbar (Reference Plonsky and Ghanbar2018) aggregated R 2 values from a sample of 541 regression analyses found in 171 published reports. Both studies then proposed tentative but field-specific benchmarks for interpreting the different effect sizes of interest, as shown in Table 2.
We want to emphasize that such benchmarks are nothing more than a starting point for gauging the magnitude of effects within the field. There are a number of additional factors that should also be taken into consideration when interpreting meta-analytic effects. These include, for example, theoretical and/or practical significance (e.g., implications for health or educational policy), comparable domains, change over time in the domain's theoretical development and/or methodological practices, attenuation due to statistical artifacts (e.g., measurement error, range restriction), publication bias(es), ceiling effects, and over/under-sampling among certain populations (see related discussions in Avery & Marsden, Reference Avery and Marsden2019; Brysbaert, Reference Brysbaert2019; Plonsky & Oswald, Reference Plonsky and Oswald2014).
Conclusion
We have sought in this paper to both raise awareness of the potential of meta-analysis and to lay out some of the many decision points that meta-analysts necessarily encounter. In doing so, we have argued that meta-analysis represents a powerful approach for synthesizing findings across primary studies that improves on the challenges facing more traditional reviews. It is for these and other reasons that we anticipate applications of meta-analysis will continue to increase in tandem with continued expansion and accumulation of findings in the field of bilingualism.