1. Introduction
Learning speech in a second language (L2) after puberty is a difficult task that is characterized by a great deal of individual variation. Some learners can achieve high-level L2 oral proficiency while others have a tremendous amount of difficulty doing so. These differences could be owing to not only the amount of time spent practicing the target language, but also to learners’ ability to make the most of every opportunity for input via a range of perceptual and cognitive abilities relevant to L2 acquisition (i.e., aptitude; Wen & Skehan, Reference Wen and Skehan2021). Examining these abilities can provide a deeper theoretical understanding of the mechanisms that underlie and drive how learners process input, convert it into intake, and acquire a new language. There has been much debate about whether such mechanisms are specific to language learning or generalizable across various kinds of learning behaviors (domain-specific vs. domain-general; e.g., Hamrick et al., Reference Hamrick, Lum and Ullman2018); and whether they differ in first language (L1) and L2 acquisition (the degree of awareness; e.g., Diaz et al., Reference Diaz, Mitterer, Broersma, Escera and Sebastian-Galles2016). An examination of this topic also has considerable pedagogical relevance. An understanding of individual aptitude profiles could help teachers identify students who would likely benefit more from certain types of instructional approaches. For example, those with explicit language learning aptitude would likely benefit more from a language-focused approach (e.g., metalinguistic instruction), while those with stronger implicit learning aptitude may benefit from implicit and meaning-oriented instruction (i.e., aptitude-treatment interaction; DeKeyser, Reference Dekeyser2012).
In this paper, I will first briefly review a range of aptitude frameworks relevant to L2 speech learning and then introduce an emerging paradigm that holds that having a good earFootnote 1 (i.e., auditory processing precision) serves as an anchor of L1 acquisition and L2 speech learning in adulthood (i.e., the Auditory Precision Hypothesis-L2). Auditory processing is a complex of domain-general perception abilities related to encoding the acoustic characteristics of sounds. Since auditory processing is the first ability that learners rely on to extract linguistic information from spoken input, any individual differences in this ability are thought to affect various dimensions (segmentals, suprasegmentals, vocabulary, morphosyntax) and phases (speed of learning, ultimate attainment) of language learning. Finally, I will discuss how we can assess L2 students’ auditory processing ability (e.g., our team's offline test deposited at L2 Speech Tools for Researchers & Teachers [http://sla-speech-tools.com/]) and make a range of pedagogical suggestions about how such assessments could be used to provide more effective instruction. Following the aptitude-treatment interaction paradigm, I will explain how L2 learners with diverse aptitude profiles (explicit vs. implicit; acuity vs. integration; strong vs. poor) can be encouraged to understand, speak, and master their L2 through profile-matched training programs (explicit vs. incidental; naturalistic vs. classroom; phonetic vs. auditory).
2. What is L2 speech learning?
According to Saito and Plonsky's (Reference Saito and Plonsky2019) framework, L2 speech proficiency comprises: (a) the ability to perceive and produce novel (or partially acquired) consonantal and vocalic sounds in an L2 without deleting and substituting them for L1 counterparts (i.e., segmental proficiency); (b) the ability to use adequate and varied stress (characterized by longer, louder, and higher pitch) at the word (e.g., correct assignment of word stress) and sentence (appropriate use of intonation for declarative and interrogative intensions) levels (i.e., melodic and prosodic proficiency); and (c) the ability to deliver speech at an optimal tempo without making too many pauses or repetitions/self-corrections (rhythmic and temporal proficiency). The last two dimensions have often been collectively described as “suprasegmental proficiency” (Trofimovich & Baker, Reference Trofimovich and Baker2006). The development of precise, robust, and refined L2 segmental and suprasegmental representations is fundamental for reaching advanced levels of listening (Field, Reference Field2008) and speaking proficiency (Levis, Reference Levis and Hughes2006). With solid L2 segmental and suprasegmental representations, L2 learners can more easily process phonologically similar and complex words (Saito, Reference Saito2013), perceptually non-salient morphosyntactic markers (Goldschneider & DeKeyser, Reference Goldschneider and DeKeyser2001), and a range of discourse functions (Brazil, Reference Brazil1997), all of which underpin successful oral communication (Isaacs et al., Reference Isaacs, Trofimovich and Foote2018).
Scholars have extensively examined how learners’ L1 phonetic systems influence their L2 speech acquisition. Major frameworks addressing this topic include the Speech Learning Model (Flege & Bohn, Reference Flege, Bohn and Wayland2021), the Perceptual Assimilation Model (Best & Tyler, Reference Best, Tyler, Bohn and Munro2007), Structural Conformity Hypothesis (Eckman, Reference Eckman2004), and the Optimality-Theoretic Model (Escudero & Boersma, Reference Escudero and Boersma2004). These theoretical accounts share the view that the phonetic distance between the L1 and L2 systems is partially responsible for determining the degree of speech learning difficulty. For example, very few Japanese speakers can perceive and produce English [r] and [l] contrast at a nativelike level because the relevant auditory and articulatory cues are not actively used in the L1 system (third formant frequencies and labial, alveolar, and pharyngeal constrictions; Iverson et al., Reference Iverson, Kuhl, Akahane-Yamada, Diesch, Kettermann and Siebert2003).
Another line of research has explored which individual difference factors predict advanced L2 speech acquisition. For instance, a wide body of research suggests that factors related to both the quantity (how much learners are exposed to and practice a target language) and quality (with whom learners use a target language [L1 vs. L2 users]), and timing of language experience (how early participants have started learning a target language and have arrived at an L2 speaking country; e.g., Derwing & Munro, Reference Derwing and Munro2013) are related to L2 speech learning outcomes.
However, research has shown that experience factors alone cannot fully explain the variability in ultimate L2 speech attainment. In one of my projects, for example, I examined the accuracy of English [r] production among approximately 200 L1 Japanese L2 English late bilinguals in Canada (Saito, Reference Saito2015). All participants had an extensive amount of immersion experience (length of residence > 6 years), had arrived in Canada after puberty, and used their L2 (English) every day as a primary language of communication. Despite their similar backgrounds and overall speaking proficiency, analysis of the participants’ performance on word reading and picture description tasks suggested that the degree of their English [r] pronunciation attainment widely varied—some demonstrated nativelike pronunciation while others had detectable L1 accents.
One hypothesized source of the individual variation observed in Saito (Reference Saito2015) and other studies (e.g., Abrahamsson & Hyltenstam, Reference Abrahamsson and Hyltenstam2008) is aptitude—a talent for processing the L2 input more efficiently and/or effectively, resulting in larger learning gains in the long run (Doughty, Reference Doughty2019). But what characterizes aptitude for successful L2 speech learning? To answer this question, I will first provide a selective review on the role of aptitude in adult L2 speech learning.
3. What is L2 speech aptitude?
Fifty years of research have provided evidence that aptitude plays an important role in L2 vocabulary and morphosyntax learning (for comprehensive overviews, see Li, Reference Li2016; Wen & Skehan, Reference Wen and Skehan2021). This has led to the development of thorough conceptual and methodological aptitude frameworks, such as the Modern Language Aptitude Test (MLAT; Carroll & Sapon, Reference Carroll and Sapon1959), LLAMA (Meara, Reference Meara2005), and Hi-LAB (Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson and Doughty2013). Although the existing aptitude tests do include audio materials (e.g., sound recognition in LLAMA-D) and refer to their relevance to speech learning on a broad level (e.g., phonemic coding in MLAT for “oral proficiency”; Baker Smemoe & Haslam, Reference Baker Smemoe and Haslam2013), very few studies have examined in depth to what degree, how, and why their specific aptitude tests tap into the development of segmental, melodic, and temporal proficiency. In their focused review, Trofimovich et al. (Reference Trofimovich, Kennedy, Foote, Reed and Levis2015) pointed out that “there has been little systematic research on the relationship between various components of aptitude and L2 pronunciation learning” (p. 354).
To fill this gap, I surveyed a range of studies on this topic (i.e., aptitude and speech learning), focusing particularly on those published since Trofimovich et al.'s (Reference Trofimovich, Kennedy, Foote, Reed and Levis2015) call for further research (Table 1). Skehan (Reference Skehan, Granena, Jackson and Yilmaz2016) provided a set of useful frameworks that researchers can use to survey and categorize different types of L2 aptitude. The frameworks include: (a) linguistic focus (which dimensions of language is the aptitude related to?), (b) domain generality (is the aptitude specific to L2 learning or applicable to all learning behaviors?), and (c) explicitness (is the aptitude associated with explicit or implicit learning?). Accordingly, the following criteria were considered in the creation of a framework for L2 speech learning aptitude:
• Is the aptitude relevant to segmental learning (enhancing consonant and vocalic accuracy) or suprasegmental learning (melody and rhythm)?
• Is the aptitude domain-specific or domain-general?
• Is the aptitude associated with explicit or implicit learning?
Note. Phonemic coding refers to sound-symbol correspondence (featured in Carroll and Sapon's [1959] MLAT framework); tonal and rhythm imagery refers to sensitivity to differences in melody and rhythm (featured in Gordon's [1995] notion of Music Aptitude Profile)
As summarized in Table 1, though limited in number, a growing number of studies have explored the relationship between aptitude and L2 speech learning outcomes. The results of this body of work have shown that different types of aptitude (explicit vs. implicit; domain-general vs. -specific) uniquely relate to different areas of L2 speech learning (segmental vs. suprasegmental learning).
4. Auditory processing as an emerging aptitude framework
More recently, some scholars (including our team) have begun to conceptualize, test, and elaborate on an aptitude framework based on a very simple hypothesis: that having a “good ear” (i.e., domain-general auditory processing ability) is the root of language acquisition (Mueller et al., Reference Mueller, Friederici and Männel2012). Since auditory processing is the first ability that infants rely on to parse incoming linguistic input, the detection and interpretation of acoustic information underlies every stage of phonetic, phonological, lexical, and morphosyntactic learning and delay. Thus, it is possible that auditory processing can explain the rate and ultimate attainment of L2 acquisition as well.
Auditory processing refers to a set of lower-order abilities related to precisely perceiving individual dimension of acoustic information, such as pitch (the perception of the lowest, fundamental frequency of a sound wave), formants (acoustic energy concentrations resulting from resonance), duration (length of sounds), and intensity (loudness of sounds). Corresponding to an influential view in cognitive psychology, auditory processing can be considered domain-general, and forms the basis of multiple domain-specific phenomena, such as music, emotion, environmental sounds, and language (Kraus & Banai, Reference Kraus and Banai2007). To measure such domain-general abilities, a number of synthesized stimuli are prepared. Since these stimuli comprise very simple acoustic characteristics (e.g., completely flat fundamental frequencies and formant contours), normal hearing listeners will not perceive them as speech. While exposed to the nonverbal stimuli, participants are assessed for their abilities to precisely perceive one particular acoustic dimension (e.g., pitch, duration).
In the context of language learning, this domain-general ability is thought to play a key role in the development of phonology, vocabulary, and morphosyntax. For example, infants rely on auditory processing to detect the probabilities of individual phonemes in the L1 system within the first six to eight months of their life (Werker, Reference Werker2018). During this critical period, every phoneme can be statistically defined in accordance with the different weighting of multiple acoustic cues, such as pitch (F0), first formant (F1), second formant (F2), third formant (F3), duration, and intensity (Kuhl, Reference Kuhl2004). Auditory processing is also instrumental to the identification of word and phrase boundaries (Cutler & Butterfield, Reference Cutler and Butterfield1992), syntactic structures (Penner et al., Reference Penner, Weissenborn, Wymann, Weissenborn and Höhle2001), and morphosyntactic markers (Joanisse & Seidenberg, Reference Joanisse and Seidenberg1998; Koester et al., Reference Koester, Gunter, Wagner and Friederici2004).
In terms of development trajectory, children reach adult-like auditory processing within the first eight to ten years of life (e.g., Thompson et al., Reference Thompson, Cranford and Hoyer1999 for pitch discrimination; Elfenbein et al., Reference Elfenbein, Small and Davis1993 for duration discrimination). From their early 20s onwards, however, auditory processing gradually declines over the rest of the lifespan (Skoe et al., Reference Skoe, Krizman, Anderson and Kraus2015; but see the relatively slow peak and decline curve on the development of audio-motor integration abilities, see Thompson et al., Reference Thompson, White-Schwoch, Tierney and Kraus2015).
Based on these observations, many L1 acquisition researchers have put forth the hypothesis that auditory impairments are a source of many language problems (Goswami, Reference Goswami2015); that is, if someone experiences deficits in auditory processing, it immediately affects their speech perception, which could, in turn, prevent them from detecting, developing, and consolidating the speech categories, and could lead to a range of global language problems. For example, auditory processing measures have been suggested to be a diagnostic tool for dyslexia (Hornickel & Kraus, Reference Hornickel and Kraus2013) and other language-related disorders (Russo et al., Reference Russo, Skoe, Trommer, Nicol, Zecker, Bradlow and Kraus2008).
There is ample cross-sectional and longitudinal evidence showing that auditory individual differences among normal hearing children are significantly tied to a range of L1 outcomes (e.g., speech-in-noise perception, vocabulary use, literacy, and phonological awareness) (Anvari et al., Reference Anvari, Trainor, Woodside and Levy2002; Bavin et al., Reference Bavin, Grayden, Scott and Stefanakis2010; Boets et al., Reference Boets, Wouters, Van Wieringen, De Smedt and Ghesquiere2008; Tierney et al., Reference Tierney, Gomez, Fedele and Kirkham2021; for evidence as to how auditory processing influences L1 vocabulary development over the first three years of life, see Kalashnikova et al., Reference Kalashnikova, Goswami and Burnham2019). In addition, correlation studies have shown a medium-to-large relationship between reading difficulty and auditory deficits for various dimensions of nonverbal sounds (see McArthur & Bishop, Reference McArthur and Bishop2005 for frequency; Casini et al., Reference Casini, Pech-Georgel and Ziegler2018 for duration; Goswami et al., Reference Goswami, Fosker, Huss, Mead and Szűcs2011 for amplitude rise time).
Because the Auditory Precision Hypothesis concerns causality, it is naturally subject to a great deal of controversy. Specifically, some scholars have argued that not all dyslexic children and adults have auditory deficits (see Rosen, Reference Rosen2003 for an overview). From a methodological point of view, it is important to remember that behavioral tasks for measuring auditory perception (e.g., A × B discrimination; for details, see below) inevitably tap into a set of higher-order executive skills (e.g., attentional control, memory), in addition to lower-order skills. For instance, the highly repetitive and abstract nature of laboratory tasks may make it difficult for participants to maintain auditory information in working memory and thus may limit how much information is available for acoustic analysis (Zhang et al., Reference Zhang, Moore, Guiraud, Molloy, Yan and Amitay2016). Accordingly, individuals with language impairments may perform poorly on auditory processing tasks because of problems with both auditory processing and executive functioning, which suggests that any link between auditory processing and linguistic deficits could be confounded with higher-order cognitive abilities (Gooch et al., Reference Gooch, Hulme, Nash and Snowling2014; Henry et al., Reference Henry, Messer and Nash2012; Snowling et al., Reference Snowling, Gooch, McArthur and Hulme2018).
5. Auditory Precision Hypothesis-L2
More recently, researchers have begun to explore how well the Auditory Precision Hypothesis generalizes to adult L2 speech learning (i.e., Auditory Precision Hypothesis-L2; Mueller et al., Reference Mueller, Friederici and Männel2012). This concurs with the assumptions underlying major L2 speech theories that the mechanisms in successful L1 speech acquisition remain active throughout the lifespan, and are germane to any new speech learning experience (e.g., Flege & Bohn, Reference Flege, Bohn and Wayland2021). In this paper and elsewhere (e.g., Saito et al., Reference Saito, Sun and Tierney2020b), I would like to further argue that auditory processing could be particularly consequential in post-pubertal L2 learning (relative to L1 acquisition). This is arguably owing to the quantitative and qualitive differences between the L1 and L2 learning experiences.
Because L1 learners are normally exposed to an extensive amount of spoken language, they may be able to overcome auditory-based difficulties via remedial strategies. For example, those with pitch deficits (amusics) can still process phrase boundaries normally using durational rather than pitch information (Jasmin et al., Reference Jasmin, Dick, Holt and Tierney2020). In contrast, the amount of communicatively authentic and interactive input that L2 learners receive is generally highly limited in classroom settings (Muñoz, Reference Muñoz2014), and subject to a great deal of individual variation in naturalistic settings (Derwing & Munro, Reference Derwing and Munro2013). Thus, L2 learners may have more difficulty developing a similar range of remedial strategies.
Furthermore, different from L1 acquisition, which is free of influence from prior language learning experience, L2 speakers need to encode spectro-temporal patterns through already-developed and automatized L1 perception strategies (see McAllister et al., Reference McAllister, Flege and Piske2002 for the feature account of adult L2 speech learning). That is, to acquire new speech categories, L2 speakers need to not only adjust their already-attuned cue weighting patterns (e.g., Chinese speakers need to use both pitch and duration to perceive L2 English prosody; Jasmin et al., Reference Jasmin, Sun and Tierney2021), but also need to learn and develop new perception strategies that they do not actively use in their L1 (e.g., Japanese speakers need to discriminate variation in F3 to perceive English [r] and [l]; Iverson et al., Reference Iverson, Kuhl, Akahane-Yamada, Diesch, Kettermann and Siebert2003).
6. Components of auditory processing
Extending several popular aptitude frameworks in second language acquisition (SLA) (Skehan, Reference Skehan, Granena, Jackson and Yilmaz2016) and L2 speech (see Table 1), I propose four different auditory process abilities particularly relevant to adult L2 speech learning. Under this 2 × 2 model (see Table 2), the key distinctions concern: (a) whether the abilities relate to L2 speech learning with or without awareness (i.e., explicit vs. implicit) and (b) whether the abilities concern the processing of formants or prosodic information, such as pitch, duration, and intensity. Scholars have operationalized auditory processing of duration and intensity via amplitude rise time (i.e., the time/duration from the onset of a sound to its maximum amplitude; Goswami, Reference Goswami2015).
7. Explicit acuity
Explicit acuity concerns how subtle of a difference in a particular acoustic dimension (e.g., formant, pitch, duration, and intensity) learners can encode. This ability is behaviourally measured via A × B discrimination tasks, where participants hear three nonverbal sounds, one of which is different from the other two, and must indicate which sound differs. The sounds featured in this task are typically synthesized stimuli whose acoustic dimensions are identical except for one dimension. As shown in Table 2, learners’ sensitivity to first, second, and third formants (F1, F2, and F3) is thought to relate to segmental learning; and their sensitivity to prosodic information (fundamental frequency [F0], duration, and amplitude rise time) is thought to relate to suprasegmental learning.
Lengeris and Hazan (Reference Lengeris and Hazan2010) used this type of task to measure L1 Greek English learners’ formant acuity. A total of 51 stimuli were developed that differed in terms of a single formant (analogous to vowel F2 = 1,250–1,500 Hz), and were presented to participants. Those who were capable of perceiving the smaller differences in formants demonstrated more learning gains when intensively exposed to multi-talker English vowels. Similarly, the Qin et al. (Reference Qin, Zhang and Wang2021) study with 32 Mandarin learners of Cantonese found that participants with more precise pitch acuity (F0 = 100.07–178.17 Hz) benefited more from the intensive exposure to multi-talker Cantonese tones.
8. Implicit acuity
Implicit acuity concerns learners’ ability to track a particular acoustic dimension on a subconscious level. Our research team has so far explored whether and to what degree auditory processing can predict the ultimate attainment of high-level L2 speech proficiency. To reach such an advanced stage of speech development, we assume that learners will need to have years of naturalistic and classroom learning experience. In addition, we assume that they will need explicit and implicit auditory processing abilities that allow them to maximize any learning opportunities, regardless of awareness. In our recent studies (Kachlicka et al., Reference Kachlicka, Saito and Tierney2019; Saito et al., Reference Saito, Sun and Tierney2019a, Reference Saito, Kachlicka, Sun and Tierney2020a; Sun et al., Reference Sun, Saito and Tierney2021), we propose the idea of using electroencephalography (EEG) to measure how the brain tracks and reacts to the acoustic characteristics of sounds at a subcortical level (i.e., implicit acuity).
Among the many EEG paradigms in L2 speech research (e.g., Diaz et al., Reference Diaz, Mitterer, Broersma, Escera and Sebastian-Galles2016 for a comprehensive overview), we have adopted the frequency following response (FFR) to study the subcortical auditory system (Coffey et al., Reference Coffey, Herholz, Chepesiuk, Baillet and Zatorre2016). During FFR tasks, participants engage in a meaning-oriented activity (e.g., reading for pleasure, watching silent movies) while listening to a range of synthesized nonverbal sounds. As attention is not required in this task, FFR data can be assumed to reflect an unconscious sensitivity to certain aspects of acoustic signals (formants, pitch) without the contaminating influence of cognitive and affective states. There is a growing amount of research using FFR that has shown that those with more precise encoding of formants likely attain more advanced L2 segmental proficiency (e.g., Saito et al., Reference Saito, Sun and Tierney2019a, Reference Saito, Kachlicka, Sun and Tierney2020a) and that those with more precise encoding of pitch gain more from pitch-based artificial language training (e.g., Chandrasekaran et al., Reference Chandrasekaran, Kraus and Wong2012).
9. Empirical evidence
Our research team has conducted a series of cross-sectional and longitudinal projects to examine the complex relationships among auditory processing, experience, and L2 speech learning. We recruited more than 400 L2 speakers of English from Poland, Spain, China, Japan, and Vietnam who had studied in naturalistic and/or classroom conditions. Those participants with any immersion experience (range <1 to 20 years) had arrived at an L2 country after the age of 17 (i.e., late bilinguals), assuming that they used L2 English with detectable L1-related accents. We measured participants’ L2 comprehension and production proficiency via measures of segmentals, suprasegmentals, vocabulary, and morphosyntax. Next, we assessed their auditory processing profiles via behavioral and EEG measures. Finally, we surveyed their biographical backgrounds, gathering data on experience-related variables (length of foreign language education and residence, daily L1/L2 use) and age-related variables (chronological age, age of learning, and arrival).
The findings were published separately in several different papers between 2020 and 2022. Adopting cross-sectional or/and longitudinal designs, each paper linked various types of auditory processing (explicit, implicit) to different dimensions (segmentals, suprasegmentals, vocabulary, morphosyntax), modes (perception, production), and stages (early, mid, final) of L2 speech learning. By analyzing these studies as a group, it is possible to synthesize their findings in order to obtain suggestive patterns.
First and foremost, the results of multiple regression and mixed-effects modeling analyses showed that performance scores were equally associated with biographical and auditory processing factors. As visually summarized in Figure 1, half of the variance was explained by how much participants practiced a target language in a classroom setting, and how much they had been using the target language on a daily basis in immersion settings. The other half of the variance was accounted for by their auditory processing ability.
In terms of type of auditory processing, explicit auditory processing appeared to be important at every stage of adult L2 learning (e.g., Saito et al., Reference Saito, Sun and Tierney2020b for the longitudinal analyses of the first 1 year of immersion), while implicit auditory processing had stronger predictive power for experienced, long-term L2 residents (length of residence = 1–10 years; Kachlicka et al., Reference Kachlicka, Saito and Tierney2019; Saito et al., Reference Saito, Sun and Tierney2019a, Reference Saito, Suzukida and Sun2019b, Reference Saito, Kachlicka, Sun and Tierney2020a, Reference Saito, Sun and Tierney2020b; cf. see Sun et al., Reference Sun, Saito and Tierney2021 for short-term residents with less than 1 year of length of residence). Interestingly, the effects of auditory processing are relatively weak among L2 learners in classroom settings (Saito et al., Reference Saito, Sun and Tierney2020b, Reference Saito, Suzukida, Tran and Tierney2021, Reference Saito, Sun, Kachlicka, Robert, Nakata and Tierney2022). This is probably because auditory processing may be unrelated to the outcomes of classroom L2 speech learning wherein learners receive and process the limited amount of aural input (but see the “Different types of auditory processing” section below).
Furthermore, in the context of 70 Japanese speakers of English with varied experience and proficiency levels, Saito et al. (Reference Saito, Cui, Suzukida, Dardon, Suzuki, Jeong, Revesz, Sugiura and Tierneyin press-a) examined the extent to which auditory processing and cognitive abilities interacted to determine the rate of success in L2 speech proficiency. The results of the correlation analyses showed that all variables were equally related to L2 speech outcomes. More interestingly, the results of the factor analyses showed that auditory processing and explicit cognitive abilities (phonological short-term memory, executive functions, and declarative memory) were clustered into two different categories (see Table 3). Of course, the findings are tentative as they need to be replicated with L2 learners with different L1 backgrounds (e.g., Polish, Spanish, and Vietnamese). However, the study here at least hints at the possibility that auditory processing may be distinct from explicit cognitive abilities and instead related to implicit and procedural memory. The suggestions here support the view that the test of auditory processing may trigger implicit statistical learning of the distribution of stimuli across trials (combining the prior stimulus distribution and the acoustic representations of each incoming stimulus; Raviv et al., Reference Raviv, Ahissar and Loewenstein2012; for a more detailed discussion on the role of implicit statistical learning in auditory processing, see Saito et al., Reference Saito, Cui, Suzukida, Dardon, Suzuki, Jeong, Revesz, Sugiura and Tierneyin press-a).
Taken together, there are three main observations from the empirical research. First, auditory processing appears to be a relatively independent construct. Second, individual differences in auditory processing may serve as a moderate-to-strong determinant of post-pubertal L2 speech acquisition, especially if learners engage in a great deal of authentic, conversational auditory input on a daily basis. The first two observations led me to propose the last observation: that even adult L2 learners may draw on similar language learning mechanisms used for L1 acquisition, and that these have a lifelong impact on the rate and ultimate attainment of language learning throughout the lifespan (for a comprehensive summary of auditory processing in L1 and L2 acquisition research, see Saito et al., Reference Saito, Suzukida, Tran and Tierney2021).
10. Future directions
10.1 Offline test development and dissemination
To facilitate follow-up studies on the role of auditory processing in L2 speech learning, our team has developed an open-source, freely available auditory processing test battery that researchers, students, and practitioners can use. The test comprises four subcomponents (formant discrimination, pitch, discrimination, duration discrimination, and amplitude rise discrimination) following an A × B discrimination task format (see Figure 2). The tasks adopt Levitt's (Reference Levitt1971) adaptive procedure, wherein task difficulty decreases (i.e., the difference being wider) or increases (i.e., the difference being smaller) based on participants’ trial-by-trial performance. Ultimately, the test allows us to measure the extent to which participants can perceive subtle differences in one of four different types of domain-general acoustic information: second formant (1,500–1,700 Hz), fundamental frequencies (300–360 Hz), stimulus duration (250–500 ms), and the timing of amplitude change (15–300 ms). Test materials and a user manual are deposited at Tools for Second Language Speech Research and Teaching (Mora-Plaza et al., Reference Mora-Plaza, Saito, Suzukida, Dewaele and Tierney2022, [http://sla-speech-tools.com/]).
Evidence for the reliability of these instruments was provided in a test-retest study with 100 L1 and L2 speakers (Saito & Tierney, Reference Saito and Tierneyin press-e). The study found that the inter-class correlations among the different tasks could be considered “fair” to “good” (ICC(2,2) = .4–.6). This suggests that these behavioural measures can reliably tap into various dimensions of participants’ supposedly stable perceptual acuity abilities (Moore, Reference Moore2012). To further examine the source of individual variation among participants’ auditory processing scores, future research could examine the auditory processing profiles of participants with varied biographical backgrounds (e.g., L1 vs. L2 vs. L3 speakers; classroom vs. immersion learners; tonal vs. non-tonal speakers; musicians vs. non-musicians). For instance, our tentative evidence suggests that auditory processing is relatively stable regardless of experience-related variables (e.g., length and intensity of immersion and foreign language education) but may be subject to the influence of age-related variables (e.g., Saito et al., Reference Saito, Kachlicka, Sun and Tierney2020a, Reference Saito, Sun, Kachlicka, Robert, Nakata and Tierney2022 for chronological age; Saito et al., Reference Saito, Kachlicka, Sun and Tierney2020a for age of arrival). Future studies on this topic will shed light on what characterizes the individual variation observed in explicit auditory processing ability.
10.2 Enhancing auditory processing
If auditory processing matters for L2 acquisition, one relevant question is, “Can it be enhanced via focused training?” In the L1 hearing literature, some studies have shown that a few hours of training can boost various dimensions of auditory processing among children with language disorders (see Merzenich et al., Reference Merzenich, Jenkins, Johnston, Schreiner, Miller and Tallal1996 for temporal acuity; Micheyl et al., Reference Micheyl, Delhommeau, Perrot and Oxenham2006 for pitch acuity; Whiteford & Oxenham, Reference Whiteford and Oxenham2018 for audio-motor integration). In turn, they can reach optimal auditory thresholds, and subsequently make the most of every input opportunity in their L1.
Following this line of work, our team's current study examined whether domain-general auditory processing (i.e., precise representation of sounds) can be improved via focused online training and whether this affects speech learning (Saito et al., Reference Saito, Petrova, Suzukida, Kachlicka and Tierneyin press-c). Ninety-eight adult Japanese speakers were divided into two training groups targeting the acquisition of English [æ] and [ʌ]: an auditory training group and a phonetic training group. The auditory training group completed activities designed to improve their ability to use the second formant frequency (1,200–1,600 Hz) to discriminate between nonverbal sounds. The phonetic training group was taught to discriminate between English [æ] and [ʌ] contrasts using multi-talker speech stimuli. The results showed that the phonetic training group improved only their English[æ] and [ʌ] identification, while the auditory training group enhanced both auditory and phonetic skills. The results suggest that auditory acuity to key, domain-general acoustic cues (F2 = 1,200–1,600 Hz) anchors, triggers, and promotes speech learning on a domain-specific level (English [æ] vs. [ʌ]). The findings also suggest that auditory training could help remediate difficulties with L2 speech learning in some individuals with auditory deficits.
10.3 Different types of auditory processing (beyond acuity)
Thus far, auditory processing has been conceptualized as the ability to encode subtle acoustic characteristics of sounds (i.e., perceptual acuity). On a broader level, auditory processing can also comprise a range of neighboring abilities, such as attention to particular acoustic dimensions while ignoring others (i.e., auditory selective attention) and the use of acoustic information for motor action (i.e., audio-motor integration). There is emerging evidence that different types of auditory training are more or less relevant to different dimensions of L2 speech learning.
On the one hand, perceptual acuity and audio-motor integration appear to be good indices of successful L2 speech learning in naturalistic settings. Since such immersion experience can provide learners with ample L2 input and output opportunities, those with more precise acuity and integration can better encode the acoustic dimensions of new sounds and then integrate this information into their L2 system more efficiently and effectively. As a result, these learners can achieve more advanced L2 speech proficiency (e.g., Saito et al., Reference Saito, Sun, Kachlicka, Robert, Nakata and Tierney2022; Zheng et al., Reference Zheng, Saito and Tierney2022).
On the other hand, the rate of learning success in classroom settings appears to be linked to audio-motor integration but not to perceptual acuity. In many English-as-a-Foreign-Language (EFL) classrooms, L2 learners typically learn the target language through decontextualized, production-based teaching methods (e.g., mechanical repetition and memorization of model pronunciation forms. Such learning environments do not provide an abundant amount of contextually rich, communicatively authentic input (Shintani et al., Reference Shintani, Li and Ellis2013). Owing to the asymmetry here (output > input), learners’ audio-motor integration (but not perceptual acuity) has been found to impact the outcomes of classroom L2 speech learning (e.g., Saito et al., Reference Saito, Suzukida, Tran and Tierney2021 for Vietnamese EFL classrooms; Shao et al., Reference Shao, Saito and Tierney2022 for Chinese EFL classrooms).
10.4 Aptitude-treatment interaction
In L2 morphosyntax learning, there is a well-researched hypothesis stating that learners with greater explicit aptitude will benefit more from explicit training, and those with greater implicit aptitude will benefit more from implicit training (for comprehensive reviews, see DeKeyser, Reference Dekeyser2012; Fu & Li, Reference Fu and Li2021). Following this line of thought, it would be intriguing to examine the extent to which auditory processing tests can be used as a diagnostic tool to provide profile-matched instructional approaches.
As reviewed earlier, it has been shown that learners with high-level explicit auditory processing benefit from explicit, language-focused speech training such as high variability phonetic training (e.g., Lengeris & Hazan, Reference Lengeris and Hazan2010; Qin et al., Reference Qin, Zhang and Wang2021). Few studies have examined the relationship between auditory processing (or any measure of aptitude for that matter) and the effectiveness of incidental, implicit, and meaning-oriented L2 speech training, arguably because scholars have exclusively used intentional, explicit and language-focused training to date. Though limited in number, some scholars have proposed using communicative focus on form (Lee & Lyster, Reference Lee and Lyster2016), task-based pronunciation training (Mora & Levkina, Reference Mora and Levkina2017), and phonological recasts (Saito, Reference Saito, Nassaji and Kartchava2021) in this regard.
In accordance with the notion of incidental and multimodal auditory categorization learning in the field of cognitive psychology (Lim & Holt, Reference Lim and Holt2011), our team has developed and tested the pedagogical potential of a video game-based target shooting game that aims to support segmental acquisition among Japanese learners of English (Saito et al., Reference Saito, Hanzawa, Petrova, Suzukida, Kachilicka and Tierneyin press-b). In this game, participants are told that the faster targets are shot, the more points can be earned. Unknown to the participants, each target is accompanied by unique English consonants and vowel sounds. As such, participants are incidentally guided to use speech cues (L2 vowels and consonants) and acquire a series of novel foreign sounds as a by-product of playing the game. The findings of Saito et al. showed that participants’ overall gains were similar to those of comparable explicit training (e.g., High Variability Phonetic Training; overt identification of target contrasts followed by trial-by-trial feedback), but that the degree of improvement widely varied among individual participants. Follow-up studies are called for, which investigate whether the effectiveness of this type of training is related to explicit and implicit auditory processing ability.
There is also a possibility that learners’ degree of auditory precision in general (relatively strong or relatively poor) could help determine the extent to which they might benefit from phonetic training (using speech stimuli) and auditory training (using non-speech stimuli). Provision of phonetic training alone could be sufficient for L2 learners with strong auditory processing skills as they are more capable of encoding the acoustic dimensions of new sounds and are likely to show larger gains when they receive various types of intensive L2 speech training (see Lengeris & Hazan, Reference Lengeris and Hazan2010 for high variability phonetic training; Shao et al., Reference Shao, Saito and Tierney2022 for shadowing training; Sun et al., Reference Sun, Saito and Tierney2021 for five months of study abroad).
Conversely, such an approach (phonetic training only) could be confusing and/or have adverse effects when conducted with L2 learners with poorer auditory processing. Poor auditory processing prevents learners from detecting the novel acoustic characteristics of L2 speech while minimizing interference from their L1, extracting reliable acoustic cues (while ignoring irrelevant cues), and attaining robust L2 speech perception (e.g., pitch contour for the acquisition of lexical tones, Perrachione et al., Reference Perrachione, Lee, Ha and Wong2011; formants and duration for the acquisition of vowels, Ruan & Saito, Reference Ruan and Saitoforthcoming).
As a remedial strategy, I propose that those with relatively low auditory processing may benefit from auditory training prior to phonetic training. During auditory training, learners are exposed to acoustically simple and monotonous nonspeech sounds that are manipulated along a single acoustic parameter. This can guide learners to focus on enhancing their sensitivity to the most useful dimensions of L2 speech (e.g., F2 = 1,200–1,600 Hz for English [æ] and [ʌ]; Saito et al., Reference Saito, Petrova, Suzukida, Kachlicka and Tierneyin press-c).
10.5 Auditory processing and different aspects of L2 learning
In a broader sense, L2 speech proficiency concerns one's ability to access multiple dimensions of linguistic knowledge while comprehending and speaking language on a global level. Intuitively, it is unsurprising that auditory processing can explain some variances in the phonological aspects of L2 speech learning because the role of auditory input processing is most directly linked to segmental and suprasegmental acquisition. The question has now become: To what degree does auditory processing matter not only for the acquisition of lower-order linguistic information (phonology), but also the acquisition of higher-order linguistic information (vocabulary and grammar)? Auditory precision plays an important role in word segmentation (Norris & McQueen, Reference Norris and McQueen2008) and the identification of word and phrase boundaries (Cutler & Butterfield, Reference Cutler and Butterfield1992). Further, auditory precision facilitates the detection of suffixes, inflection, and articles (Joanisse & Seidenberg, Reference Joanisse and Seidenberg1998) and word order (Penner et al., Reference Penner, Weissenborn, Wymann, Weissenborn and Höhle2001). Since auditory processing is involved in every stage of L2 speech learning, future research can further explore how this ability differentially promotes the development of phonology, vocabulary, and grammar in a complementary fashion (for some emerging evidence, see Kachlicka et al., Reference Kachlicka, Saito and Tierney2019; Saito et al., Reference Saito, Macmillan, Kroeger, Magne, Takizawa, Kachlicka and Tierneyin press-d).
11. Conclusion
In this paper, I have introduced the auditory precision paradigm from L1 acquisition as a way to look at the complex mechanisms underlying adult L2 speech learning (i.e., Auditory Precision Hypothesis-L2). First and foremost, everyone can learn new sounds and achieve comprehensible, intelligible, communicatively adequate, and functional L2 oral proficiency as long as they practice the target language on a daily basis with a good level of motivation and willingness to communicate (Derwing & Munro, Reference Derwing and Munro2013). Here, the Auditory Precision Hypothesis-L2 is in line with major L2 speech learning theories in that both consider the quantity, quality, and intensity of experience as the crucial determinant of L2 speech learning (e.g., Flege & Bohn, Reference Flege, Bohn and Wayland2021 for Speech Learning Model).
However, much individual variation has still been found in terms of the levels of attainment among highly experienced, regular, motivated, and functional L2 learners—some are able to reach a stage of proficiency where they are almost indistinguishable from native speakers of the target language. These differences in learning outcomes exist not only because of the amount of time spent practicing the target language, but also because some learners are more cognitively and perceptually adept at making the most of every opportunity for input. Consequently, this could lead to larger and more robust gains in the long run (Doughty, Reference Doughty2019).
An “auditory precision view” of L2 speech learning predicts that individuals with a good ear (i.e., precise auditory processing) are able to make the most of every input opportunity. That is, more precise auditory processing helps learners better capture the acoustic dimensions of L2 speech input (McAllister et al., Reference McAllister, Flege and Piske2002), adjust to new cue weighting patterns (Jasmin et al., Reference Jasmin, Sun and Tierney2021), develop new speech categories (or revise existing speech categories; Flege & Bohn, Reference Flege, Bohn and Wayland2021), and continue to refine these categories to a near-nativelike level in the long run (Abrahamsson, Reference Abrahamsson2012). The Auditory Precision Hypothesis-L2 assumes that domain-general and pre-categorical sound processing abilities govern language learning throughout the lifespan and play a key role in late L2 speech learning (Mueller et al., Reference Mueller, Friederici and Männel2012).
Given that auditory processing is fundamental to parsing L2 aural input, any lower-order problems will likely slow down other L2 speech learning processes, even if learners have a relatively strong working memory and high attentional control, receive ample input, and/or are motivated to practice the target language (Perrachione et al., Reference Perrachione, Lee, Ha and Wong2011; Ruan & Saito, Reference Ruan and Saitoforthcoming). Going forward, both researchers and practitioners are encouraged to carry out more auditory processing research that can provide insight into the different types of speech training participants may benefit from (e.g., explicit auditory processing for explicit speech training). In addition, more research is called for which explores how tests of auditory processing can be used to diagnose learners with relatively low-level auditory precision. This latter group of L2 learners may greatly benefit from auditory processing training, especially prior to L2 speech training and immersion experience. This will, in turn, ensure that all L2 learners can reduce the challenge of learning a new language despite any disadvantages they may have at the level of auditory processing.
Acknowledgments
I am grateful to the following team members for their huge contributions to all the foundation projects that we worked on together: Adam Tierney, Hui Sun, Magdalena Kachlicka, Yui Suzukida, Ingrid Mora-Plaza, & Katya Petrova. I also greatly acknowledge the funders who have supported our research activities: Leverhulme Trust (RPG-2019-039), Spencer Foundation (202100074), and Economic and Social Research Council (ES/S013024/1).
Kazuya Saito is Associate Professor in Applied Linguistics at University College London, UK. His research interests include how second language learners develop various dimensions of their speech in naturalistic settings; and how instruction can help optimize such learning processes in classroom contexts. He is Co-Founder of Tools for Second Language Speech Research and Teaching, wherein a range of online pronunciation research and teaching materials are freely available (http://sla-speech-tools.com/).