1. Introduction
Studying the timecourse of comprehension is a central goal in bilingual processing research, which has been significantly fostered by the use of time-sensitive methods such as self-paced reading, eye-tracking, and event-related potentials. The importance of timing is highlighted by findings showing that comprehension is often slower in a non-native than in a native language, in both lexical and sentence domains. For example, compared to monolinguals, even highly proficient bilinguals show slower lexical access (Duyck, Vanderelst, Desmet & Hartsuiker, Reference Duyck, Vanderelst, Desmet and Hartsuiker2008; Gollan, Slattery, Goldenberg, Van Assche, Duyck & Rayner, Reference Gollan, Slattery, Goldenberg, Van Assche, Duyck and Rayner2011; Lehtonen, Hultén, Rodríguez-Fornells, Cunillera, Tuomainen & Laine, Reference Lehtonen, Hultén, Rodríguez-Fornells, Cunillera, Tuomainen and Laine2012; Lemhöfer, Spalek & Schriefers, Reference Lemhöfer, Spalek and Schriefers2008; Ransdell & Fischler, Reference Ransdell and Fischler1987). Similarly, sentence processing studies often find that when a word violates a grammatical constraint or a previously established parse, monolinguals display processing disruptions soon after the violation, while disruptions in bilinguals are often delayed (Boxell & Felser, Reference Boxell and Felser2017; Felser & Cunnings, Reference Felser and Cunnings2012; Grüter, Lew-Williams & Fernald, Reference Grüter, Lew-Williams and Fernald2012; Hopp, Reference Hopp2017; Steinhauer, White & Drury, Reference Steinhauer, White and Drury2009; White, Genesee & Steinhauer, Reference White, Genesee and Steinhauer2012).
Despite the rich data generated by current methods, our inferences about L1-L2 temporal asymmetries are often limited by using methods demonstrating that differences in native vs. non-native processing affect different sentence regions (in self-paced reading), different temporal windows (in event-related potentials), or different reading measures (in eye-tracking). Instead, it would be preferable to establish the precise timepoint at which an effect onsets in order to directly compare timing differences between speaker groups or between experimental manipulations. This article summarizes several techniques for achieving this goal. Such information is relevant to testing a variety of L2 accounts. For example, some accounts propose that L1-L2 processing differences concern the relative timing of grammatical versus non-grammatical information (Clahsen & Felser, Reference Clahsen and Felser2018). Meanwhile, capacity-based accounts link timing delays to differential proficiency, lexical access speed, and working memory (Dekydtspotter & Renaud, Reference Dekydtspotter and Renaud2014; Hopp, Reference Hopp2013; McDonald, Reference McDonald2006). Establishing numeric divergence points rather than dichotomous contrasts (effect present/absent) would allow testing of whether timing delays are predicted by such variables.
To encourage the use of divergence point analyses, we provide a practical introduction using a L1-L2 visual world eye-tracking dataset. The data and a step-by-step R analysis tutorial are available at https://osf.io/exbmk/. Finally, we note that divergence point analyses differ from another set of techniques which examine timeseries data by modeling the shape (i.e., functional form) of change across time (e.g., Mirman, Reference Mirman2017; Porretta, Kyröläinen, van Rij & Järvikivi, Reference Porretta, Kyröläinen, van Rij, Järvikivi, Czarnowski, Howlett and Jain2018). When characterizing timeseries data, both types of techniques are useful and provide complementary information.
2. A practical example
Our L1-L2 dataset belongs to a visual world experiment examining the use of syntactic gender information to make noun predictions. The visual world paradigm involves tracking eye movements to objects on a computer screen while participants hear a sentence, with the assumption that there is a close link between eye movements and language processes (Huettig, Rommers & Meyer, Reference Huettig, Rommers and Meyer2011). The visual world paradigm is thus particularly useful in L1-L2 timecourse research because it measures how language processing unfolds over time.
We tested a group of L1 German speakers and two groups of intermediate-to-advanced L2 German speakers, whose L1 was either Spanish or English (for demographic details see Appendix S1, Supplementary Materials). Participants saw four objects on a computer display and heard a German instruction to click on one of the objects as quickly as possible, e.g., Click on the blue button (Figure 1A). The determiner and adjective in the instruction agreed in gender and color with only one of the objects (henceforth the “target”), allowing participants to identify it prior to its pronunciation; namely, at the adjective (Hopp & Lemmerth, Reference Hopp and Lemmerth2018; Lemmerth & Hopp, Reference Lemmerth and Hopp2019). The properties of the other objects were manipulated such that they matched the target only in color (“color competitor”), only in gender (“gender competitor”), or neither (“distractor”).
The critical time window for assessing gender predictions was from the onset of the adjective to 200 ms after the onset of the noun, to account for the time taken to program and launch an eye movement (Hallett, Reference Hallett1986; Salverda, Kleinschmidt & Tanenhaus, Reference Salverda, Kleinschmidt and Tanenhaus2014). As Figure 1B shows, fixations before the adjective were distributed similarly between the four objects. At the adjective, fixations to the target and the color competitor increased, while looks to the gender competitor and distractor abruptly decayed. Given this pattern, we focused on the divergence between the target and color competitor (henceforth, “competitor”). As both objects match the color of the adjective, but only the target has the appropriate gender, any target-over-competitor advantage should reflect the predictive use of gender.
We used a divergence point analysis to establish how soon after the adjective a target-over-competitor advantage appeared (i.e., a predictive effect), and whether this divergence occurred later in L2 than L1 speakers, consistent with previous findings (Dussias, Valdés Kroff, Guzzardo Tamargo & Gerfen, Reference Dussias, Valdés Kroff, Guzzardo Tamargo and Gerfen2013; Grüter et al., Reference Grüter, Lew-Williams and Fernald2012; Hopp, Reference Hopp2013; Lew-Williams & Fernald, Reference Lew-Williams and Fernald2010). We also wanted to determine whether gender predictions were modulated by participants’ native language. If so, Spanish speakers may benefit from the rich morphosyntactic gender agreement of their L1 and show faster predictions than English speakers, whose L1 lacks syntactic gender agreement.
3. Divergence point analyses: an intuitive approach
One possible approach to determine the divergence point between looks to the target vs. competitor is to statistically compare the difference in fixation proportions at each timepoint and find the earliest significant test statistic. To illustrate this, we use our data where eye positions were sampled at 50 Hz, i.e., every 20 ms. At each sampled timepoint, we fit a generalized logistic mixed-effects model with a binomial distribution to compare the proportion of fixations to the target vs. competitor (GLMM; Barr, Reference Barr2008). The divergence point was defined as the earliest point with a significant positive estimate (Figure 2).
Although this approach is intuitive, it involves as many statistical comparisons as there are timepoints and thus runs a risk of false positives (Type 1 error). For example, at an alpha of 0.05, the probability of a single test delivering a false positive is 5%. But with 45 timepoints in our window of interest, this probability rises to 90% (1 - 0.9545). The combined probability of a false positive over an entire set of tests is known as the family-wise error rate (FWER; Hochberg & Tamhane, Reference Hochberg and Tamhane1987).
A common way to control for multiple comparisons is the Bonferroni correction, which lowers the alpha-level by dividing the desired alpha by the number of tests (Bonferroni, Reference Bonferroni1936). Thus, the alpha-level for 45 tests becomes 0.001 and the FWER at this adjusted alpha is around 5%. The downside of the Bonferroni correction is that lowering alpha necessarily decreases statistical power, because it becomes more difficult for an effect (true or otherwise) to reach the significance threshold. Thus, the larger the number of tests, the lower the power to detect a true effect.
A second type of correction that preserves power is false discovery rate (FDR) control (Benjamini & Hochberg, Reference Benjamini and Hochberg1995). Instead of correcting the alpha level, FDR control restricts the proportion of false discoveries among the significant results. To apply FDR control, we take the p-values from the 45 tests and sort them from smallest to largest. A critical value for each p-value is then calculated via a suitable method (e.g., some methods account for autocorrelated data, others for data with many significant results; Benjamini, Krieger & Yekutieli, Reference Benjamini, Krieger and Yekutieli2006; Benjamini & Yekutieli, Reference Benjamini and Yekutieli2001). The largest p-value below this critical value is then chosen as the new significance cut-off for the original p-values.
Both the Bonferroni correction and FDR control can be easily implemented (code §2). Figure 2 shows the corrected and uncorrected divergence point estimates for our data. As expected, corrected estimates are always later than uncorrected estimates, suggesting that the latter are false positives. The higher power of FDR control over Bonferroni is visible in the German and English groups; whereas in the Spanish group, where the difference in fixation proportions arises more abruptly, both corrections yield similar results.
While the corrections account for FWER, an additional issue in visual world data is autocorrelation. Autocorrelation occurs because modern eye-trackers can record eye fixations at high frequencies (e.g., once per millisecond), but planning and executing an eye movement takes around 200 milliseconds. Thus, neighboring datapoints often reflect the same stage of cognitive processing and so are strongly correlated. Applying parametric tests at multiple timepoints will overestimate variance in the data and can influence the Type 1 error rate, because parametric tests assume independent observations. Importantly, grouping observations into larger bins can reduce (Mirman, Reference Mirman2014:18), but not eliminate autocorrelation; Figure 3, code §3).
A second dependency issue in our data is the contingency of fixations to the target and competitor, since a participant cannot simultaneously look at both objects. For the same reasons as autocorrelation, this can inflate the Type 1 error rate. Finally, the approach above does not estimate the temporal uncertainty around a divergence point because the latter is based on a single statistical test. While we can estimate the 95% confidence interval of a test coefficient, this reflects uncertainty about the magnitude of the target vs. competitor difference, rather than the temporal location of the divergence point. In order to statistically compare the onset of predictions between groups, a measure of temporal variability is necessary. With this goal, we turn to non-parametric resampling approaches.
4. Non-parametric approaches
The corrected comparisons above allow us to estimate a divergence point indexing the onset of predictive looks. But how certain are we about this estimate? Because we only conducted our procedure once, we cannot be sure that a similar divergence point would be found in a different sample. Non-parametric approaches such as bootstrapping and cluster permutation can answer this question by resampling or permuting existing data to generate “new” datasets and sampling the distributions of their test statistics. Conveniently, they also control for FWER and autocorrelation (Groppe, Urbach & Kutas, Reference Groppe, Urbach and Kutas2011; Maris & Oostenveld, Reference Maris and Oostenveld2007; Reingold & Sheridan, Reference Reingold and Sheridan2014).
4.1 Cluster permutation tests
Cluster permutation identifies temporal “clusters” in which two experimental conditions differ (Barr, Jackson & Phillips, Reference Barr, Jackson and Phillips2014). In our dataset, these clusters would represent time windows in which looks to the target and competitor differed significantly. In a permutation test, condition labels (e.g., target/competitor) are randomly reassigned multiple times in order to destructure the experimental manipulation and generate a distribution of test results consistent with the null hypothesis. The significance of the test statistic from the original dataset is then based on its relative position in the permutation-derived null distribution. FWER is controlled by reducing the number of statistical comparisons to one. Autocorrelation is also controlled, because the temporal structure of the data is preserved during permutation. Thus, the effect of autocorrelation is constant across permutations and the only factor affecting the variance of the permutation distribution is the reassignment of condition labels.
However, one disadvantage of cluster-based permutation is that significant clusters do not indicate when an effect arose or its temporal variability, but rather only that there was a window in which an effect was significant (Maris & Oostenveld, Reference Maris and Oostenveld2007; Sassenhagen & Draschkow, Reference Sassenhagen and Draschkow2019). Since our research question concerns the onset of predictive looks, below we demonstrate an approach that can address this question while preserving the advantages of non-parametric approaches.
4.2 Bootstrapping
The goal of bootstrapping is to estimate what the distribution of statistical test results would be if we repeated our experiment many times. For this, an existing dataset is resampled multiple times to generate “new” datasets and a statistical test is applied after each resample. Below we use a non-parametric bootstrap, which does not make assumptions about the population distribution underlying the data, meaning that it can be used for non-normally distributed data (Maris & Oostenveld, Reference Maris and Oostenveld2007; Hesterberg, Reference Hesterberg, Moore, McCabe, Duckworth and Sclove2002). The bootstrapping technique has previously been applied to reading eye-tracking and event-related potentials (Schad, Risse, Slattery & Rayner, Reference Schad, Risse, Slattery and Rayner2014; Wasserman & Bockenholt, Reference Wasserman and Bockenholt1989; Sheridan & Reingold, Reference Sheridan and Reingold2012; Reingold & Sheridan, Reference Reingold and Sheridan2014), and its results have been shown to be comparable to those of permutation tests (Rosenfeld & Donchin, Reference Rosenfeld and Donchin2015). A bootstrapping approach for visual world data is presented in Seedorff, Oleson and McMurray (Reference Seedorff, Oleson and McMurray2018), although it answers a different research question from the one of interest here.
The steps in our approach are as follows. First, for each speaker group, we extract data where either the target or competitor was fixated. To identify a divergence point between fixations, we apply an uncorrected statistical test at each timepoint aggregating over items (code §4). Here we use a one-sample t-test on fixation proportions because it is conceptually straightforward and convenient in terms of convergence and computational time. T-tests are often used in non-parametric methods (e.g., Groppe et al., Reference Groppe, Urbach and Kutas2011; Maris & Oostenveld, Reference Maris and Oostenveld2007; Efron & Tibshirani, Reference Efron and Tibshirani1986; Hesterberg, Reference Hesterberg2015; Reingold & Sheridan, Reference Reingold and Sheridan2014). For data that do not include a large number of extreme values (e.g., clustered close to 0% or 100%), a t-test reasonably approximates the results of a logistic model, which would be a more appropriate choice given the binary nature of our data. However, fitting multiple logistic models with the appropriate random effects structure comes at the expense of increased complexity and computation time. For a comparison between different tests see Appendix S2 (Supplementary Materials).
To establish a divergence point, we take the first timepoint in a run of at least 10 consecutive timepoints with significant t-values. A run of 10 is used because we are interested in the beginning of sustained looks to the target (in our case, at least 200 ms given the 50 Hz sampling rate). Researchers should choose their own threshold depending on their research question and experimental design.
Next, we use a non-parametric bootstrap to generate “new” datasets by resampling the original dataset with replacement. The resampling is stratified by participant, timepoint, and object type (target/competitor), meaning that data are resampled within these categories. A new divergence point is estimated after each resample. With sufficient resampling (1000–2000 times; Efron & Tibshirani, Reference Efron and Tibshirani1993) a distribution of divergence points is generated whose mean is taken as the overall divergence point (Figure 4A). Variability around the mean of the bootstrap distribution can be quantified with a confidence interval (CI), calculated via a method suited to the properties of the bootstrap distribution and computation time (Carpenter & Bithell, Reference Carpenter and Bithell2000; DiCiccio & Efron, Reference DiCiccio and Efron1996).
Bootstrapped means and CIs for each group are plotted in Figure 4. To compare between groups, we can bootstrap the difference between their divergence points. The result is a distribution of differences (Figure 5; code §4.4). The mean difference in divergence points between the L1 and L2 groups is 244 ms, 95% CI = [160, 340] ms. The CI does not contain zero and thus supports a reliable difference. Between the Spanish and English groups, the mean difference is 40 ms, 95% CI = [−40, 100] ms, consistent with a slightly earlier divergence point in the English group. However, the CI of the between-group difference contains zero and thus fails to support a difference. If desired, p-values can also be computed (see Appendix S4, Supplementary Materials).
The results show that L2 speakers are slower than native speakers to start looking preferentially at the target, consistent with a predictive advantage in the native speaker group. Further, both L2 groups show mean divergence points after the appearance of the noun, which is not consistent with a predictive use of gender. The lack of evidence for an earlier onset in Spanish vs. English speakers does not support the claim that having a gendered native language enhances its predictive use in a foreign language. Instead, it is consistent with a general delay due to L2 status. Note that studies relying on time-window analyses would have reached a similar conclusion by showing significant effects in earlier time windows for L1 than L2 speakers (e.g., by stating that an effect is significant in one group but absent in another). The critical contribution of the bootstrapping method is that it precisely quantifies the delay in the L2 speakers, while allowing a direct between-group comparison of divergence points and estimating their uncertainty.
4.3 Advantages and disadvantages of the bootstrapping approach
Above we demonstrate that resampling approaches can control FWER and autocorrelation in time series analyses. The main advantage of our bootstrapping approach is that it quantifies divergence points and their temporal uncertainty, enabling statistical comparisons between participant groups and/or experimental conditions. However, one disadvantage of the approach is that it does not estimate the duration of an effect or the presence of multiple divergences, although it could be extended to do so. Second, our approach – and onset detection approaches in general – may not be appropriate for analyses where the research question concerns whether an effect is present (Seedorff et al., Reference Seedorff, Oleson and McMurray2018). Our approach assumes that an effect is present and that the task is simply to detect its onset.
Furthermore, resampling approaches like bootstrapping can describe a dataset but are not generative models. Generative models provide explicit assumptions to connect data with cognitive processes of interest, allowing researchers to examine the parameters that best explain the data and to compare the goodness of fit of different models (Vandekerckhove, Matzke & Wagenmakers, Reference Vandekerckhove, Matzke, Wagenmakers, Busemeyer, Wang, Townsend and Eidels2015). Two generative approaches that allow divergence point estimation include generalized additive mixed-effects models (GAMMs; van Rij, Reference van Rij2015; van Rij, Vaci, Wurm & Feldman, Reference van Rij, Vaci, Wurm and Feldman2020; Miwa & Baayen, Reference Miwa and Baayen2020) and Bootstrapped Differences of Timeseries (BDOTS; Seedorff et al., Reference Seedorff, Oleson and McMurray2018). GAMMs are regression models that estimate non-linear patterns from timecourse data (Appendix S3, Supplementary Materials). BDOTS fits 4-parameter logistic and double Gaussian functions to individual fixation curves, which are then bootstrapped to estimate the standard error of mean fixations at each time point in the series. The onset of a divergence in fixations between conditions is then established via t-tests and a Bonferroni correction modified to account for autocorrelation.
The downside of models such as GAMMs and BDOTS is that they do not provide a measure of variability around their divergence point estimates, which is needed for statistical comparison (Table 1). Furthermore, while GAMMs can estimate a within-condition divergence point, BDOTS can only estimate a divergence point between conditions.
Note. Bootstrapping refers to our proposed approach; BDOTS stands for Bootstrapped Differences of Timeseries (Seedorff et al., Reference Seedorff, Oleson and McMurray2018) and GAMMs for Generalized Additive Mixed-Effect Models (van Rij, Reference van Rij2015; van Rij et al., Reference van Rij, Vaci, Wurm and Feldman2020; Miwa & Baayen, Reference Miwa and Baayen2020).
5. Further applications to bilingualism
Onset estimates can enrich L2 processing theories in several ways. Consider, for example, the claim made by capacity-based accounts that processing is slower in L2 than L1 due to limits on lexical access speed and working-memory capacity (Dekydtspotter & Renaud, Reference Dekydtspotter and Renaud2014; Hopp, Reference Hopp2013; McDonald, Reference McDonald2006). These two constructs can already be measured quantitatively using word recognition and working-memory span tasks, but it is unclear how well they predict processing speed during sentence comprehension. Having precise estimates of prediction speed would allow us to answer this question and provide a more precise evaluation of capacity-based accounts.
Another useful application concerns L2 accounts that posit that non-native and native speakers weigh different kinds of information differently in processing (Clahsen & Felser, Reference Clahsen and Felser2018; Cunnings, Reference Cunnings2017). Some of this research has found that L2 speakers are often slower (or less sensitive) than native speakers to syntactic information, but more sensitive to discourse-level information like extra-sentential context and semantic plausibility (Felser & Cunnings, Reference Felser and Cunnings2012; Pan, Schimke & Felser, Reference Pan, Schimke and Felser2015; Roberts & Felser, Reference Roberts and Felser2011). Having a method to formally establish when different information sources affect L2 processing would provide key data to test these claims.
Finally, our bootstrapping method can be adapted to quantify variability between speakers. For example, our failure to find Spanish vs. English group differences may have resulted from our sample demographic properties (e.g., potential differences in L2 age of acquisition or proficiency). While data from individual participants is noisier than averaged data, the analysis presented here could be performed on a by-participant basis given a sufficient number of trials (Reingold & Sheridan, Reference Reingold and Sheridan2014), allowing us to examine correlation between individual divergence points and factors like proficiency and L2 exposure. Together with the estimation of by-group timing effects, we believe that quantifying individual variability will prove crucial to improve models of bilingual processing.
Supplementary Material
For supplementary material accompanying this paper, visit https://doi.org/10.1017/S1366728920000607
List of supplementary materials:
S1. Demographic profiles of the language groups.
S2. Comparison of different statistical tests in the bootstrap approach.
S3. Comparison of bootstrap- and GAMM-derived divergence points.
S4. Null hypothesis tests of the bootstrapped estimates
Acknowledgements
We thank Harald Baayen, Pantelis Bagos, Dale Barr, Jason Geller, Ben Goodrich, Carrie Neal Jackson, Lauren Kennedy, Dorothea Pregla, João Veríssimo, and Titus von der Malsburg for valuable comments and feedback. Kate Stone and Sol Lago were supported by a German Research Council project awarded to Sol Lago (grant number LA 3774/1-1).We also thank Lisa Becker for her assistance in building the experiment and collecting data.