Hostname: page-component-cd9895bd7-lnqnp Total loading time: 0 Render date: 2024-12-27T07:55:20.709Z Has data issue: false hasContentIssue false

Information encoding and transmission profiles of first-language (L1) and second-language (L2) speech

Published online by Cambridge University Press:  18 August 2021

Ann R. Bradlow*
Affiliation:
Department of Linguistics, Northwestern University, Evanston, Illinois, USA
*
Address for correspondence: Ann Bradlow, Northwestern University Department of Linguistics 2016 Sheridan Road Evanston, IL, 60208 USA E-mail: abradlow@northwestern.edu
Rights & Permissions [Opens in a new window]

Abstract

Inspired by information theoretic analyses of L1 speech and language, this study proposes that L1 and L2 speech exhibit distinct information encoding and transmission profiles in the temporal domain. Both the number and average duration of acoustic syllables (i.e., intensity peaks in the temporal envelope) were automatically measured from L1 and L2 recordings of standard texts in English, French, and Spanish. Across languages, L2 acoustic syllables were greater in number (more acoustic syllables/text) and longer in duration (fewer acoustic syllables/second). While substantial syllable reduction (fewer acoustic than orthographic syllables) was evident in both L1 and L2 speech, L2 speech generally exhibited less syllable reduction, resulting in low information density (more syllables with less information/syllable). Low L2 information density compounded low L2 speech rate yielding very low L2 information transmission rate (i.e., less information/second). Overall, this cross-language comparison establishes low information transmission rate as a language-general, distinguishing feature of L2 speech.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
Copyright © The Author(s), 2021. Published by Cambridge University Press

Introduction

When individuals know two or more languages, variation in the onset and extent of exposure to each language typically results in a functional imbalance across the first-language (L1), learned from early and extended exposure, and the second language (L2), learned from later and more limited exposureFootnote 1. This experience-dependent L1-L2 imbalance is evident in L1 versus L2 speech patterns. Just as listeners can identify individual talkers (e.g., Schweinberger & Zaske, Reference Schweinberger, Zaske, Fruhholz and Belin2018) and languages (e.g., Vaughn & Bradlow, Reference Vaughn and Bradlow2017) from short speech samples, listeners can also determine whether an utterance was produced in L1 or L2 speech (i.e., sounds “unaccented” or “foreign-accented”) from short snippets of speech in a known language (Flege, Reference Flege1984; Munro, Derwing & Burgess, Reference Munro, Derwing, Burgess, Solé, Recasens and Romero2003, Reference Munro, Derwing and Burgess2010) and even in an unknown language (Bond, Stockmal & Markus, Reference Bond, Stockmal and Markus2003, Reference Bond, Stockmal and Markus2008; Major, Reference Major2007). These observations suggest patterns of speech production associated with an L1 or L2 speech “mode” regardless of the language being spoken or of language-specific L1-L2 interactions.

The present study seeks to identify language-general acoustic properties that distinguish L1 and L2 speech and that may underlie the observed rapid identification of speech samples as either L1 or L2 speech. Furthermore, inspired by information theoretic analyses of L1 speech and language production, this study proposes that L1 and L2 speech exhibit distinct information encoding and transmission profiles in the temporal domain. Accordingly, the present study compares speech timing patterns in L1 and L2 speech in three languages: English, French, and Spanish. This is done in terms of two key information theoretic parameters: information density (number of speech units for a given meaning) and information transmission rate (information conveyed per second). Identification of a language-general, information-driven temporal parameter along which L1 and L2 speech differ would broaden our perspective on the communicative impact of speech communication between speakers from different language backgrounds to include the impact of L1 versus L2 speech mode on the dynamics of information flow via the speech channel. This study thus represents an initial step towards characterizing the language-general L1 versus L2 speech mode in quantifiable information theoretic terms that can be automatically applied to a multi-lingual multi-talker speech corpus. This empirical base can then support further speculation and theorizing regarding cognitive mechanisms that underlie L1 versus L2 speech production and their impact on communicative efficiency (i.e., beyond speech intelligibility) under a variety of conversational conditions.

Prior work on information encoding and transmission in language and speech supports two inter-connected claims: (1) variation in complexity at all levels of linguistic structure is related to variation in predictability given the context (i.e., conditional entropy, or surprisal/redundancy), and (2) the distribution of information across utterances tends toward uniform density of encoding (i.e., constant degree of surprisal/redundancy) in order to optimize information transmission via the capacity-limited speech channel (Aylett & Turk, Reference Aylett and Turk2004; Crocker, Demberg & Teich, Reference Crocker, Demberg and Teich2016; Jaeger, Reference Jaeger2010; Jaeger & Tily, Reference Jaeger and Tily2011; Levy, Reference Levy2008). This information theoretic perspective on language and speech has provided insightful accounts of intra-language variation. For example, in order to adhere to a principle of “uniform information density” (Jaeger, Reference Jaeger2010; Jaeger & Tily, Reference Jaeger and Tily2011; Levy, Reference Levy2008) or “smooth signal redundancy” (Aylett & Turk, Reference Aylett and Turk2004, Reference Aylett and Turk2006; Turk, Reference Turk2010), syntactic and phonetic reduction phenomena – a major source of intra-language variation – abound when surprisal is low (i.e., when redundancy, or predictability, is high – see also Bell, Brenier, Gregory, Girand & Jurafsky, Reference Bell, Brenier, Gregory, Girand and Jurafsky2009; Cohen Priva, Reference Cohen Priva2015; Gahl, Yao & Johnson, Reference Gahl, Yao and Johnson2012; Johnson, Reference Johnson, Yoneyama and Maekawa2004; Jurafsky, Bell, Gregory & Raymond, Reference Jurafsky, Bell, Gregory, Raymond, Bybee and Hopper2001 and many others).

A related line of research operationalized information density for cross-language comparisons in terms of the number of linguistic units over which a given meaning, or text, is distributed (Coupé, Oh, Dediu & Pellegrino, Reference Coupé, Oh, Dediu and Pellegrino2019; Fenk-Oczlon & Fenk, Reference Fenk-Oczlon and Fenk2010; Pellegrino, Coupé & Marsico, Reference Pellegrino, Coupé and Marsico2011). This research proposed the syllable as an appropriate unit for cross-language comparison based on its universal applicability – all languages have a unit that is comprised of a sequence of segments even if the phonological status of this unit varies – and its quantifiability in terms of intensity peaks in the speech signal (acoustic syllables) or by metalinguistic counts from first-language speakers (phonological syllables). Languages that express a given meaning (i.e., direct translations of a given text) in relatively few syllables have higher syllable information density (more information conveyed per syllable) than languages that express the same meaning in relatively many syllables. Importantly, these cross-language comparisons have revealed a trade-off between speech rate and information density such that languages with relatively low information density (few syllables for a given meaning) exhibit relatively high speech rates (many syllables per second), and vice versa. Thus, while languages vary substantially in syllabic speech rate, the rate of information transmission (information conveyed per second) is more constrained across languages (Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011; Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019). Together, the intra-language tendency toward uniform information density (or, smooth signal redundancy) through suprisal-modulated complexity (or, redundancy-induced reduction), and the inter-language constraint on information rate through a trade-off between syllable rate and syllable information density provide converging evidence for optimization of information encoding and transmission as a universal tendency in speech and language production.

Given that a defining difference between L1 and L2 speech is variation in experience with the language, we might expect significant differences across L1 and L2 speech in surprisal/redundancy-driven modulation of complexity. L1 talkers have deeply entrenched implicit knowledge of the statistical structure of the language including the conditional probabilities of linguistic units at all levels (i.e., from phonotactic-level to syntactic-level probabilities). L1 talkers therefore exhibit reduction patterns that are finely tuned to the contours of surprisal (or conversely, redundancy) during L1 speech production (e.g., see Brandt, Reference Brandt2019 for an extensive study of suprisal and phonetic reduction). Specifically, L1 speakers systematically increase information density through phonetic reduction (same meaning distributed over less phonetic material) precisely where contextually determined suprisal is low (or redundancy is high – see also Aylett & Turk, Reference Aylett and Turk2004, Reference Aylett and Turk2006; Turk, Reference Turk2010). In contrast, the language models of L2 talkers likely developed on the basis of explicit instruction about canonical grammatical structures, including dictionary-based pronunciations of words, rather than on early and continuous experience with naturally occurring conversational speech. Furthermore, the language model that underlies L2 speech production will be influenced by interactions between the L1 and L2 sound structures, including the size and structure of their phoneme inventories, their characteristic patterns of phoneme combination (i.e., the phonotactics), and their prosodic structures. Thus, as a consequence of the different L1 and L2 learning contexts and the L1-L2 interactions that influence L2 speech, the language models of L1 and L2 talkers may diverge substantially, leading to different estimations of suprisal/redundancy for particular words and phonemes in a given context, and thus to different patterns of context-dependent reduction. Regardless of whether surprisal/redundancy-driven phonetic reduction is primarily talker-oriented (i.e., arises from unit selection and production planning processes), listener-oriented (i.e., serves to ensure accurate communication) or driven by general evolutionary dynamicsFootnote 2, this view predicts different L1 and L2 patterns of phonetic reduction. While phonetic reduction is evident to some extent in both L2 and L1 speech (e.g., Schertz & Ernestus, Reference Schertz and Ernestus2014) and particularly when related to repetition within a discourse (e.g., Gustafson & Goldrick, Reference Gustafson and Goldrick2018), where direct comparisons have been performed, L2 speech is generally more conservative in this respect than L1 speech (e.g., Baker, Baese-Berk, Bonnasse-Gahot, Kim, Van Engen & Bradlow, Reference Baker, Baese-Berk, Bonnasse-Gahot, Kim, Van Engen and Bradlow2011; Spilková, Reference Spilková2014; Rallo Fabra, Reference Rallo Fabra2015; Oh & Lee, Reference Oh and Lee2016; Li, Li, Luo & Mok, Reference Li, Li, Luo and Mok2018). For example, Baker et al. (Reference Baker, Baese-Berk, Bonnasse-Gahot, Kim, Van Engen and Bradlow2011) found that L2 talkers exhibited less phonetic reduction than L1 talkers for segments in function words relative to in content words. Similarly, Spilková (Reference Spilková2014), Oh and Lee (Reference Oh and Lee2016), and Li et al. (Reference Li, Li, Luo and Mok2018) all found that both L1 and L2 talkers of English exhibited predictability-related phonetic reduction – however, the degree of reduction was smaller for the L2 English talkers, particularly those with lower L2 proficiency, than for the L1 talkers; and Rallo Fabra (Reference Rallo Fabra2015) found less extensive unstressed vowel reduction in English for early and late Spanish-English bilinguals as compared to English monolinguals. This pattern of less phonetic reduction in L2 than in L1 speech results in overall lower information density for L2 speech than for L1 speech since, for any given text (i.e., meaning), the L2 production involves more phonetic content than the L1 production.

Another salient, language-general difference between L1 and L2 speech is overall speech rate. Slower speaking rates (fewer syllables or words per second) for L2 versus L1 speech have been shown for L2 speech in various languages, including English (Guion, Flege, Liu & Yeni-Komshian, Reference Guion, Flege, Liu and Yeni-Komshian2000; Baese-Berk & Morrill, Reference Baese-Berk and Morrill2015), Spanish (Garcia Lecumberi, Cooke & Wester, Reference Garcia Lecumberi, Cooke and & Wester2017), Japanese (Idemaru, Gubbins & Wei, Reference Idemaru, Gubbins and Wei2019), and Dutch (De Jong, Groenhout, Schoonen & Hulstijn, Reference De Jong, Groenhout, Schoonen and Hulstijn2015). Slower L2 speech rates have also been shown for between-language comparisons within individuals (e.g., Towell & Dewaele, Reference Towell, Dewaele and Dewaele2005; Derwing, Munro, Thomson & Rossiter, Reference Derwing, Munro, Thomson and Rossiter2009; De Jong et al., Reference De Jong, Groenhout, Schoonen and Hulstijn2015; Bradlow, Blasingame & Kim, Reference Bradlow, Blasingame and Kim2017). Taken together, this research establishes speech rate as a phonetic parameter that is strongly influenced by L1 versus L2 mode independently of the language(s) involved.

Both speech rate and phonetic reduction influence the flow of linguistic information as the speech signal unfolds in time. For a given utterance a slower speech rate necessarily implies a slower information transmission rate (fewer bits of information conveyed per unit of time). Similarly, since less phonetic reduction implies production of more phonetic material and since all phonetic material has some acoustic duration, the tendency of L2 speech to involve fully articulated instead of reduced forms also effectively decreases the rate of information transmission for L2 relative to L1 speech. Thus, the characteristically slow rate of L2 speech presumably combines with the tendency of L2 speech towards fully articulated rather than phonetically reduced forms (i.e., low information density) to produce an L2 speech mode with a very low information transmission rate. Importantly, this low L2 information transmission rate results from a combination of slow speaking rate and low information density, a combination that is contrary to the efficiency-driven trade-off between speech reduction and information density. The present study explores this information theoretic perspective on L2 speech through a close examination of syllable rate and syllable information density – two global speech properties that together influence information transmission rate – in L1 and L2 speech across three languages.

English, French, and Spanish are well-suited to this multi-language analysis because all three are widely spoken as both L1 and L2 and differ notably in their sound structures. In the World Atlas of Languages Structures Online (WALS, Dryer & Haspelmath, Reference Dryer and Haspelmath2013), English and French are listed with extreme values for each of three critical phonological features: segment inventory size (both have large inventories), consonant-to-vowel ratio (both have low ratios), and complex syllable structures (both have complex syllable structures). In contrast, Spanish is listed with the central value on the WALS 3- or 5-point scales for each of these phonological features: average segment inventory, average consonant-to-vowel ratio, and moderately complex syllable structure (Maddieson, Reference Maddieson, Dryer and Haspelmath2013a, Reference Maddieson, Dryer and Haspelmath2013b, Reference Maddieson, Dryer and Haspelmath2013c, Reference Maddieson, Dryer and Haspelmath2013d; Goedemans & van der Hulst, Reference Goedemans, van der Hulst, Dryer and Haspelmath2013). In addition, English, French, and Spanish each exhibit several contextually conditioned phonological processes that affect timing patterns in connected speech. These include, amongst many others, unstressed vowel reduction in English, liaison in French, and consonant lenition in Spanish. While identification of the phonetic and perceptual correlates of isochrony is debated in the phonetics and psycholinguistics literature, it is note-worthy that the traditional rhythm-class hypothesis (Pike, Reference Pike1945; Abercrombie, Reference Abercrombie1967) classifies English as stress-timed while French and Spanish are syllable-timed. Finally, downloadable recordings of a standard reading passage (the North Wind and Sun passage) in both L1 and L2 speech were available for all three languages.

To recap, this study aims to identify a language-general acoustic signature of the contrast between L1 and L2 modes of speech production in terms of information encoding and transmission in the temporal-domain. Through a novel application of information theoretic concepts to both L1 and L2 speech in three languages, this study represents an initial step towards a broader goal of gaining new insights into underlying mechanisms and communicative consequences of L1 versus L2 speech production. With the empirical contours of information transmission via the L2 speech channel in sharper focus, we will be poised for further speculation and hypothesizing for this broader research agenda.

Methods and materials

The dataset for the present study consisted of 351 observations of 21 variables. Each observation is based on a digital speech recording of a given text by a given talker, and the variables are derived from automatic phonetic measurements applied to each recording. The digital speech recordings were taken from two separate speech corpora, the Northwestern University ALLSSTAR Corpus (Bradlow, Reference Bradlown.d.; described in detail in Bradlow et al., Reference Bradlow, Blasingame and Kim2017 and freely available to the public at https://speechbox.linguistics.northwestern.edu/#!/?goto = allsstar) (Bradlow, Reference Bradlown.d.) and the University of Toronto Romance Phonetics Database (Colantoni & Steele, Reference Colantoni and Steele2004, freely available to the public at http://rpd.chass.utoronto.ca/). The complete dataset and analysis scripts (R and Praat) are freely available from the Open Science Foundation (OSF) web-based repository via the following link: https://osf.io/vawdb/.

An overview of the 351 digital speech recordings included in this dataset is shown in Table 1. A total of 256 English recordings came from the ALLSSTAR Corpus. These English recordings consisted of readings of two texts by each of 128 talkers, the North Wind and the Sun (NWS) passage and a set of longer, complex sentences taken from the Universal Declaration of Human Rights (DHR). These 128 talkers included all of the L1 (n = 26) and L2 (n = 98) English talkers for whom English NWS and a complete set of English DHR recordings were available. All talkers were recruited from the Northwestern University community. The L2 talkers were mostly international graduate students. A small number of L2 talkers were family members of international students. English proficiency levels based on test scores (The VersantTM English Test, n.d.) were available for most (73/98 or 75%) of the L2 English talkers and showed a concentration at the intermediate level (43 talkers with Versant scores of 47–68 / 80) and advanced/near-native level (29 talkers with Versant scores of 69–80 / 80). Only one L2 English talker for whom proficiency information was available was at the beginner level (Versant score < 47 / 80). The L2 talkers came from 21 different L1 backgrounds distributed as follows: Cantonese (n = 11), Mainland Mandarin, (n = 13), Turkish (n = 13), Korean (n = 11), Spanish (n = 11), Hindi (n = 5), Brazilian Portuguese (n = 5), Russian (n = 5), Hebrew (n = 4), Vietnamese (n = 4), Farsi(Persian) (n = 3), Japanese (n = 3), German (n = 2), Singaporean Mandarin (n = 1), Taiwanese Mandarin (n = 1), French (n = 1), Gishu (n = 1), Greek (n = 1), Gujarati (n = 1), Indonesian (n = 1), and Runyankore (n = 1). The 26 L1 English talkers (14 females) had an average age of 20 years, and the 98 L2 English talkers (36 females) had an average age of 25 years at the time of recording.

Table 1. Overview of recordings. See text for detailed explanations.

* Only two talkers reported an age of L2 acquisiton of 0 years, one with Cantonese as the L1 (ALLSSTAR Corpus, L2 English) and one with Tagalog as the L1 (RPD Corpus, L2 Spanish).

Recordings from 103 talkers (61 recordings in French + 42 recordings in Spanish) came from the Romance Phonetics Database (RPD, http://rpd.chass.utoronto.ca/, Colantoni & Steele, Reference Colantoni and Steele2004), an on-line research and teaching tool that includes recordings of individual words and passages in several Romance Languages (French, Italian, Portuguese, Romanian, and Spanish). Included in this database is a set of recordings of the NWS passage in L1 French (n = 14), L2 French (n = 47), L1 Spanish (n = 19) and L2 Spanish (n = 23). The RPD also includes a small number of NWS recordings in Italian, Portuguese, and Romanian; however, these were not included in the present study due to too few recordings in L2 speech (4 in L2 Italian, 0 in L2 Portuguese, and 3 in L2 Romanian) for meaningful comparison with either their L1 counterparts (2 in L1 Italian, 1 in L1 Portuguese, and 5 in L1 Romanian) or with other L2s included in the RPD (47 talkers in L2 French and 23 talkers in L2 Spanish) or with L2 English from the ALLSTAR Corpus (98 talkers).

The 14 L1 French (8 females) and 19 L1 Spanish (13 females) talkers had average ages of 31 and 29 years, respectively. The 47 L2 French (37 females) and 23 L2 Spanish (10 females) talkers had average ages of 27 and 31 years, respectively. The L2 French talkers came from 7 different L1 backgrounds distributed as follows: Albanian (n = 1), Arabic (n = 1), Mainland Mandarin (n = 3), Czech (n = 5), English (n = 31), Russian (n = 3), and Spanish (n = 3). Of the L2 Spanish talkers, 22 had English as their L1 and one L2 Spanish talker spoke Tagalog as their L1. L2 proficiency levels are available for most of the talkers from the RPD (63/70 or 90%) showing a concentration at the intermediate and advanced/near-native levels (French: beginner = 4, intermediate = 17, advanced/near-native = 24; Spanish: intermediate = 7, advanced/near-native = 15).

Table 2 provides an overview of the 21 variables that describe each of the 351 recordings. Of these 21 variables, eleven are grouping or identifying variables, seven are phonetic parameters, and three are information transmission variables. The grouping and identifying variables provide information about the talkers and recordings from the ALLSSTAR Corpus and Romance Phonetics Database that were included in the dataset.

Table 2. Overview of variables. See text for detailed explanations.

The seven phonetic variables were automatically extracted from each recording using a published Praat script (de Jong & Wempe, Reference De Jong and Wempe2009)Footnote 3 that detects intensity peaks surrounded by intensity dips of at least 2 dB and that rise above a threshold that is determined as 25 dB below the 99th quantile of the intensity maximum for the entire sound file. The 2 dB and 25 dB settings are adjustable defaults designed to minimize the influence of non-speech sound bursts in the intensity peak picking process. Furthermore, peaks that occur during unvoiced portions of the signal are excluded. The output of the Praat script provides the basis for calculating three measures of temporal structure: nsyll (number of acoustic syllables), SR (speech rate) and AR (articulation rate). SR is the rate of syllable production over the entire recording including pauses and other major disfluencies (nsyll/total duration), while AR is the rate of syllable production with silent pauses and other major disfluencies removed (nsyll/total duration minus pauses and disfluencies). By definition, for any given recording, AR is greater than SR. Because pausing and other disfluencies may be more prevalent in L2 speech than in L1 speech (e.g., see de Jong, Reference De Jong2016; Trouvain, Fauth & Möbius, Reference Trouvain, Fauth and Möbius2016; Matzinger, Ritt & Fitch, Reference Matzinger, Ritt and Fitch2020 and references therein), the magnitude of L2 versus L1 differences in SR will, if anything, be larger than L2 versus L1 differences in AR. In order to adopt a conservative stance, this study focuses on AR rather than SR. Any AR differences between L2 and L1 speech would likely be larger for SR. Average syllable duration is calculated based on AR (i.e., average syllable duration is equal to 1/AR).

The three information transmission variables were information density (ID), information rate (IR), and syllable reduction (LOSS). While articulation rate expresses the number of speech units (acoustic syllables) per unit of time, information density expresses the amount of information encoded in each speech unit. The relation between units of information (i.e., meaning) and speech units is highly complex, abstract, and involves a degree of non-compositionality that precludes a straightforward mapping. However, following an approach developed by researchers at the Laboratoire Dynamique Du Langage (CNRS / Université Lyon 2) (Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011; Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019), we can compare the relative information densities of various productions of a given text (i.e., for a fixed meaning). As discussed above, this prior work compared speech rate, information density, and information rate across languages with a focus on the syllable as the relevant speech unit. In the present study, we adopt this approach to compare the syllable information density for productions of a given text by L1 and L2 talkers within each of three languages, English, French, and Spanish – that is, within each language, we compare the number of acoustic syllables over which the information in the given text (i.e., the message of the DHR sentences or the NWS fable) is distributed for L1 versus L2 speech. For example, Talker A may exhibit a higher rate of syllable reduction for a given text than Talker B; in which case, Talker A's production of the text has a higher information density than that of Talker B because Talker A's production encodes the information of the text in fewer acoustic syllables than that of Talker B. The syllables of Talker A's utterance would, on average, encode a greater proportion of the total information of the text than the syllables of Talker B's production of the same text.

For the purposes of the present within-language comparisons, syllable information density, ID, for each talker, T, for each text, K, was expressed as the inverse of the number of acoustic syllables produced, as shown in (1) below,

(1)$${\rm I}{\rm D}_{{\rm KT}} = {\rm avgInfoPerSy}{\rm l}_{{\rm KT}} = 1/ {\rm nSyl}{\rm l}_{{\rm KT}}$$

where nSyllKT is the number of acoustic syllables in the current talker's, T's, production of text K. The interpretation of this value is the average proportion of information encoded per acoustic syllable. Relatively low information density (ID) values indicate that the current talker distributes the information contained in the text over a relatively large number of acoustic syllables each of which encodes a relatively low proportion of the total information content of the text. Conversely, relatively high information density (ID) values indicate production of relatively few acoustic syllables each of which encodes a relatively high proportion of the total information of the text.

The information rate, IR, for each recording was calculated based on the average acoustic syllable duration (i.e., the inverse of the number of acoustic syllables per second) and information density, ID (i.e., the average proportion of the total information of the text encoded per acoustic syllable), as in (2) below,

(2)$${\rm I}{\rm R}_{{\rm KT}} = {\rm avgInfoPerSe}{\rm c}_{{\rm KT}} = {\rm I}{\rm D}_{{\rm KT}}/ {\rm avgSylDu}{\rm r}_{{\rm KT}} = ( {1/{\rm nSyl}{\rm l}_{{\rm KT}}} ) / {\rm avgSylDu}{\rm r}_{{\rm KT}}$$

where IRKT, IDKT, and avgSylDurKT are the acoustic syllable information rate, information density, and average acoustic syllable duration for the current talker's, T's, production of text K, respectively. The information conveyed per second, IR, is thus the information encoded per acoustic syllable divided by the seconds per acoustic syllable, to yield the information conveyed per second. The units for IR are: (info/syl) / (sec/syl) = (info/syl) * (syl/sec) = info/sec.

A relatively high IR, can result from either a high information density (relatively few, information dense syllables for a given meaning), a fast articulation rate (short durations of acoustic syllables), or a combination of high ID and fast AR. Conversely, a relatively low IR can result from either a low information density (relatively many, information sparse syllables for a given meaning), a slow articulation rate (few syllables per second), or a combination of low ID and slow AR. As discussed in great depth in Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011 and Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019, across languages there appears to be a trade-off between information density and speech rate that imposes a limit on cross-language variation in information rate: languages with high information density tend to also exhibit slow speaking rates and vice versa.

In the present study, we compare information rate (IR) of L1 and L2 speech within each text (English DHR, English NWS, French NWS, and Spanish NWS). For the L2 recordings, the IR values might be expected to reflect either a compounding of ID and AR differences (if both ID and AR are lower for L2 than L1 speech) or a counteraction of ID and AR differences (if ID is higher but AR is lower for L2 than L1 speech). IR comparisons across L1 and L2 speech are therefore interpreted as indices of syllabic information transmission efficiency taking into account both duration and number of syllables produced for a given text.

Finally, syllable loss (LOSS) is a measure of syllable level phonetic reduction as expressed through the difference between the number of acoustic and orthographic syllables. Automatic measurement of speech rate based on acoustic syllable detection is an objective, purely signal driven technique that proceeds without reference to the linguistic content of the utterance, individual talker characteristics, or other parameters related to language-specific phonotactics. Due to connected speech processes that result in extensive phonetic variation (including, but not limited to, redundancy-driven reduction processes), acoustic syllables are not transparently related to the underlying abstract metrical and information bearing units of language production and perception. Specifically, the number of acoustic syllables produced for a given utterance will frequently differ from the sum of canonical, phonological syllables in the words that constitute the text of the utterance based on syllable counts in, for example, a pronunciation dictionary or as counted by L1 speakers in a meta-linguistic syllable-counting task (e.g., by tapping out the number of syllables in a word). In order to relate acoustic syllables to orthographic syllables and to understand any observed variation across L1 and L2 speech in the information transmission variables that are based on acoustic syllables (i.e., information density, ID, and information rate, IR), we define syllable-level phonetic reduction, σLOSS, as the number of acoustic syllables for a given utterance relative to the number of orthographic (i.e., phonological) syllables that constitute the linguistic structure of the utterance as in (3) below:

(3)$${\rm \sigma }_{{\rm LOSS}} = ( {{\rm \sigma }_{\rm a}\ndash {\rm \sigma }_{\rm o}} ) / {\rm \sigma }_{\rm o}$$

where σa is the number of acoustic syllables and σo is the number of orthographic syllables.

For the English recordings in the present study, the number of orthographic words and syllables was obtained using an English syllable counting function implemented in R (Kendall, Reference Kendall2013). This function applies a general rule for syllable counting based roughly on the number of orthographic vowels with adjustments for common orthographic deviations from the one-vowel-one-syllable rule. For example, a word's syllable count is decreased if it contains ‘tia’ (e.g., ‘inertia’ has 3 syllables but 4 orthographic vowels) and a word's syllable count is increased if it contains ‘ism’ (e.g., ‘schism’ has 2 syllables but 1 orthographic vowel). The function also specifies a list of exceptional cases to which the user can add any words that are not already included in this list of exceptions. For the DHR sentences, two words were added to the list of “special” 2-syllable words, “peaceful,” and “movement,” and two words were added to the list of “special” 3-syllable words, “entitled,” and “realized.” A close inspection of the counts returned by the function determined that the syllable counts for all other words in the English texts were acceptable with a conservative count for the few words (less than 1%) with ambiguous syllabification due to sonorants in coda position. These include ‘cruel’ (2 syllables), ‘prior’ (2 syllables), ‘realized’ (3 syllables), and ‘hours’ (2 syllables). It is therefore possible that for some speakers the rates of English syllable reduction were very slightly over-estimated. For the French and Spanish recordings, the number of orthographic words and syllables were hand counted and checked with L1 speakers. All orthographic word and syllable counts are shown in Table 3.

Table 3. Number of sentences, words, and orthographic (i.e., phonological) syllables in the NWS passages and DHR sentences.

2. Results

Table 4 shows summary statistics (mean and standard error of the mean) for six variables: (1) articulation rate (AR), (2) acoustic syllable duration, (3) number of acoustic syllables, (4) information density (ID), (5) information rate (IR), and (6) syllable reduction (LOSS). Data are shown for L1 speech and L2 speech for each of the four texts, English DHR, English NWS, French NWS, and Spanish NWS. Variables (2) and (3) are included for reference but are not included in visualizations (figures 1 and 2) or statistical modeling since they are directly derived from Variables (1) and (4), respectively. Variable (2), acoustic syllable duration is the inverse of variable (1), articulation rate. Variable (3), number of acoustic syllables is the inverse of variable (4), information density.

Table 4. Articulation rate (AR), acoustic syllable duration, number of acoustic syllables, information density (ID), information rate (IR), and acoustic syllable reduction (LOSS) by talker group (L1 versus L2) and by recording text (English DHR, English NWS, French NWS, and Spanish NWS). Data shown are means with standard error of the mean in parentheses. See text for additional explanation for each variable.

Figure 1 shows density plots for each of the four critical variables (AR, ID, IR, and LOSS) for each of the four texts (ENG_DHR, ENG_NWS, FRA_NWS and SPA_NWS). Each density plot compares the distributions for the L1 and L2 talkers. Since the texts varied substantially in terms of length (see Table 3), the data shown in these density plots were all z-transformed within their own distributions (i.e., within Text) allowing for a uniform horizontal scale for all frames in the composite figure.

Fig. 1. Density plots of articulation rate (AR), information density (ID), information rate (IR) and syllable reduction (LOSS) for the L1 and L2 groups within each recording text (English DHR, English NWS, French NWS, and Spanish NWS). All data are shown on z-transformed scales within their own distributions.

Consistent with prior comparisons of articulation rate (AR) in L1 and L2 speech, the data in Table 4 and top row of Figure 1 show slower average articulation rates (fewer acoustic syllables per second) for L2 speech than for L1 speech. On average, the L2 articulation rates for the four texts are 85% (English DHR), 85% (English NWS), 84% (French NWS), and 95% (Spanish NWS) of their L1 counterparts. Accordingly, the average syllable durations for L2 speech are longer than for L1 speech; the average L2 syllable durations are 119% (English DHR), 118% (English NWS), 120% (French NWS), and 104% (Spanish NWS) of their L1 counterparts.

In addition to articulatory slowing (fewer syllables per second), L2 speech is also characterized by a consistent increase in the number of acoustic syllables produced relative to L1 speech (second row of Figure 1). For each of the four recording sets, the average number of L2 acoustic syllables is over 100% of the L1 counterpart: 105%, 110%, 107%, and 105% for English DHR, English NWS, French NWS, and Spanish NWS, respectively – that is, the number of acoustic syllables over which the information of the text (i.e., its intended meaning) is distributed is greater for L2 than for L1 speech, and consequently, the average proportion of information encoded per syllable (the average information density, ID) is lower for L2 than for L1 speech. The ID for L2 speech in each case is less than 100% of the L1 counterpart: 95% (English DHR), 90% (English NWS), 94% (French NWS), and 95% (Spanish NWS).

This combination of acoustic syllables that are relatively long in duration and relatively sparse in information content resulted in a substantially lower information rate for L2 speech relative to L1 speech (third row of Figure 1). The L2 average information rates (proportion of the total text meaning conveyed per second) are 81%, 77%, 78%, and 90% of their L1 counterparts for English DHR, English NWS, French NWS, and Spanish NWS, respectively. Thus, for all four texts, the slower articulation rate of L2 speech is compounded by the lower L2 syllable information density, to yield a substantially lower information rate for L2 speech than L1 speech – that is, the information (or, meaning) of a given text is encoded over a relatively large number of relatively long syllables with the consequence that the proportion of the text's information that is conveyed per second is quite low for L2 speech as compared to L1 speech.

The final rows of Table 4 and Figure 1 provide a comparison of the rate of syllable reduction across L1 and L2 speech. These comparisons show that for all four texts, in both L1 and L2 speech the number of acoustic syllables in the speech signals generally falls below the number of orthographic syllables in the corresponding written texts – that is, almost all average syllable reduction values shown in Table 4 are negative (fewer acoustic than orthographic syllables). The only exception is for L2 French where the average reduction is slightly positive (indicating some syllable insertion). Critically, these average data show that for all texts there is more syllable reduction for L1 than for L2 speech. In other words, the low information density of L2 speech relative to L1 speech is due to extensive syllable-level reduction by L1 talkers rather than to ‘extra’ syllable insertion by L2 talkers.

Statistical modelling of the data was conducted in R (RStudio Version 1.3.959) using generalized linear mixed effects regression models (glmmTMB package). Separate analyses were conducted for each of the four critical dependent variables, articulation rate (AR), information density (ID), information rate (IR), and syllable reduction (LOSS). In each analysis, the effects of interest were the fixed effect of Group (L1 versus L2) and its interaction with Text (English DHR, English NWS, French NWS, and Spanish NWS). Forward contrast coding was applied to both categorical fixed factors, Group (2 levels) and Text, (4 levels). A random intercept for talker was included in the models. For all four dependent variables (AR, ID, IR, and LOSS), gaussian, beta, and gamma distributions were compared via the Akaike information criterion (AIC) resulting in selection (minimum AIC) of gaussian distributions for AR and LOSS, and gamma distributions for ID and IR. To assess the improvement in fit of the models with the interaction term (Talker*Group), log likelihood tests (anova) were run between the model with the interaction term and the model with only the additive term (Talker + Group, i.e., without the interaction term). For all four dependent variables, the model fit was significantly improved with addition of the interaction term. The model comparisons and predictors in the models with the interaction term are summarized in Tables 5 and 6, respectively.

Table 5. Summary of comparisons between models with and without the Group-by-Text interactive term. In all cases the interactive model was a significantly better fit (lower AIC) than the additive model.

Table 6. Summaries of the best fit models with the Group by Text interactive terms. The referent category in all models is ENG_DHR and L1 for the Text and L1 factors, respectively.

Pairwise comparisons for the critical effect of interest (i.e., L1 versus L2 within each text) are shown in Table 7. For all four dependent variables (AR, ID, IR, and LOSS), these comparisons show significant differences between L1 and L2 talkers within all four texts (ENG_DHR, ENG_NWS, FRA_NWS, and SPA_NWS) with some variation in the magnitude of these differences thereby yielding the significant improvement in model fit with inclusion of the Group-by-Text interaction term as shown by the model comparisons in Table 5.

Table 7. Pairwise comparisons of the estimated means between L1 and L2 within each text.

In order to gain some insight into a possible relation between L2 proficiency and the observed patterns of variation in AR, ID, IR and syllable reduction (LOSS), a subset of the data was examined by proficiency group. However, this post-hoc and unplanned division of the binary grouping variable (L1 versus L2) into more fine-grained proficiency-based sub-groups should be considered preliminary and suggestive rather than conclusive because there was no consistent and reliable proficiency assessment across the ALLSSTAR and RPD datasets, and proficiency data were not available for all talkers. The labels “Beginner,” “Intermediate,” and “Advanced/Near-L1” were intended as approximate groupings and the available data do not support identification of clear boundaries between these proficiency labels. Moreover, as noted above, no proficiency data was available for a substantial portion of the L2 talkers (approximately 17%), and the “Beginner” subgroup was very small (only 5 individuals across the full dataset). Notwithstanding these limitations, a subset of the data was examined in order to see if there was any indication that the observed L1-L2 differences in AR, ID, IR, and LOSS decrease with increasing L2 proficiency.

The sub-dataset for this proficiency-based examination excluded L2 talkers for whom L2 proficiency information was not available (n = 28). In addition, L2 talkers at the beginner level were excluded due to the small number of L2 talkers at that level of L2 proficiency (n = 5). Finally, to avoid analyses based on very small groups, the proficiency focused analyses were conducted on data that were aggregated across texts (i.e., across language groups). The sub-dataset therefore excluded data for the ENG-DHR text in order to avoid double-counting the individual talkers in the English group. The final proficiency dataset thus included data from 80.4% of the L2 talkers and 100% of the L1 talkers in the total dataset with the distribution across proficiency levels as follows: intermediate (n = 67), advanced/near-L1 (n = 68), and L1 (n = 59).

Density plots for each of the four critical variables by proficiency group are shown in Figure 2. Separate analyses were conducted for each of the four dependent variables, articulation rate (AR), information density (ID), information rate (IR), and syllable reduction (LOSS). In each analysis, the effect of interest was the fixed effect of Proficiency Group (Intermediate, advanced/near-L1, and L1). This 3-level factor was coded with the forward contrast scheme. For all four dependent variables, there was a significant effect of Proficiency Group (AR: F(191) = 74.6, p < .0001; ID: F(191) = 5.53, p < .005; IR: F(191) = 49.27, p < .0001; LOSS: F(191) = 27.55, p < .0001). Pairwise comparisons confirmed consistently significant differences for all four dependent variables between the L1 group and each of the L2 groups, intermediate (all p < .005) and advanced/near-L1 (all p < .04). The difference between the two L2 groups, intermediate and advanced/near-L1 was significant at the p < .0005 level for AR, IR, and LOSS; but the difference between these two L2 groups was not significant for ID. Thus, while this proficiency analysis should be viewed with caution, there is some suggestion that AR, ID, IR, and LOSS are all dynamic features of L2 speech and that the observed difference between L1 and L2 speech along these parameters may diminish with increasing experience with the target language.

Fig. 2. Density plots of articulation rate (AR), information density (ID), information rate (IR) and syllable reduction (LOSS) by proficiency group (L2 Intermediate, L2 Near-L1/Advanced, and L1) aggregated across texts and languages. All data are shown on z-transformed scales within their own distributions.

3. Discussion

This study set out to identify a language-general acoustic marker of L1 versus L2 modes of speech production in terms of information theoretic parameters that have proved powerful for explaining intra-language variability in L1 speech. As such, this study represents a first step towards a broader goal of understanding information transmission via an L2 speech channel. The data showed that L1 and L2 productions of the same text within each of three typologically distinct languages (English, French, and Spanish) diverged along two phonetic parameters that define the temporal structure of the speech signal at the utterance level: articulation rate (the number of acoustic syllables per second) and information density (the number of acoustic syllables over which the information of the text is distributed). Specifically, in comparison to L1 speech, L2 speech was produced with a slower rate of articulation (fewer acoustic syllables per second) and a lower information density (more acoustic syllables for the given text). In combination, these two salient features of utterance-level speech timing each compounded the other such that the proportion of the total information of the text that was conveyed per second, the information transmission rate, was substantially lower for L2 than for L1 speech. Notably, both L1 and L2 speech generally involved production of substantially fewer acoustic syllables than orthographic syllables (as counted from the text scripts) indicating that the lower average information density for L2 speech is related to avoidance of phonetic reduction rather than a predisposition towards syllable insertion. As discussed in the introduction, English, French, and Spanish were well suited to this cross-language comparison due to their divergence along several relevant dimensions of sound structure including segment inventory size, consonant-to-vowel ratio, complexity of syllable structure, and several phonological processes that affect connected speech timing. The convergent pattern of lower information transmission rate for L2 than L1 speech across these languages suggests that this may be a language-general distinguishing feature of L2 versus L1 speech.

This investigation was inspired by prior work supporting the claim that a significant portion of surface variation within and across languages can be attributed to a principle of efficiency of information encoding and transmission (see Gibson, Futrell, Piantadosi, Dautriche, Mahowald, Bergen & Levy, Reference Gibson, Futrell, Piantadosi, Dautriche, Mahowald, Bergen and Levy2019 for an overview of this approach). Under this view, a major source of intra-language variation across levels of speech and language structure is the modulation of density of information encoding through structural and/or phonetic reduction in accordance with variation in contextually-determined suprisal, or predictability. Moreover, while syllabic rate (number of syllables per second) and syllabic information density (number of syllables for a given meaning) vary substantially across languages – due to structural differences including phonotactics and phoneme inventory – cross-language variation in communicative efficiency is constrained via a trade-off between articulation rate and information density (Pellegrino et al., Reference Pellegrino, Coupé and Marsico2011; Coupé et al., Reference Coupé, Oh, Dediu and Pellegrino2019). Within this conceptual framework, which emphasizes optimization of communicative efficiency, and building on this empirical approach, which operationalizes information rate in terms of automatically measurable phonetic units, the present study establishes a solid empirical base from which we can now speculate on the underlying causes and communicative consequences of the distinct L1 versus L2 information encoding and transmission profiles.

From the perspective of the message encoder (i.e., the talker), both the slower rate and the increased number of acoustic syllables in L2 speech may originate from general features of L2 speech production regardless of the particular L2 being spoken and regardless of the talker's L1. The slower articulation rate is likely related to slower processes at multiple levels, including lexical retrieval, production planning, and speech articulation (e.g., see Broos, Duyck & Hartsuiker, Reference Broos, Duyck and Hartsuiker2018). Identification of the underlying mechanisms for this general slowing of L2 production is an active topic of current research with proposals including L1-L2 interactions (i.e., bi-directional competition and/or interference) and L1 versus L2 frequency-of-usage differences (for reviews see Kroll & Gollan, Reference Kroll, Gollan, Goldrick, Ferreira and Miozzo2014; Runnqvist, Strijkers & Costa, Reference Runnqvist, Strijkers, Costa, Goldrick, Ferreira and Miozzo2014). Regardless of the cognitive, linguistic, and/or articulatory differences between L2 and L1 speech production, the present data are consistent with prior demonstrations that the encoding of a given message in L2 speech typically results in a speech signal with a relatively slow average syllable-level temporal modulation rate (see also Bradlow et al., Reference Bradlow, Blasingame and Kim2017; Baese-Berk & Morrill, Reference Baese-Berk and Morrill2015; Guion et al., Reference Guion, Flege, Liu and Yeni-Komshian2000). For the recordings in this study, the overall average L1 and L2 articulation rates were 4.6–4.9 cycles per second and 3.9–4.4 cycles per second, respectively. For both L1 and L2 speech, these modulation rates tend towards the low end of the theta neural oscillation band of approximately 4–8 Hz, which has received considerable attention as one of a series of correspondences between speech unit durations and neural oscillator frequency ranges. Gamma (>40 Hz) and beta (15–30 Hz) oscillations correspond to phonetic features with 20–50 ms durations, theta (4–8 Hz) oscillations correspond to syllables and words with ~ 250 ms durations, and delta oscillations (<3 Hz) correspond to phrases with 500–2000 ms durations. These correspondences have inspired theories linking speech perception and neurophysiology (e.g., Poeppel, Reference Poeppel2003; Ahissar & Ahissar, Reference Ahissar, Ahissar, Konig, Heil, Bundinger and Scheich2005; Ghitza & Greenberg, Reference Ghitza and Greenberg2009; Ghitza, Reference Ghitza2011, Reference Ghitza2012; Giraud & Poeppel, Reference Giraud and Poeppel2012; Peelle & Davis, Reference Peelle and Davis2012; but see Cummins, Reference Cummins2012 for a more skeptical view of this approach). The correspondence between theta band neural oscillations and syllable rate, in particular, has inspired the proposal of “theta-syllables” (Ghitza, Reference Ghitza2013) whose definition (“a theta-cycle long speech segment located between two successive vocalic nuclei”) closely matches that of the acoustic syllable of the present study. Importantly, while the general slowing of L2 speech relative to L1 speech is quite consistent across individuals and languages, it remains within the theta range of 3.5–7.5 Hz. Thus, from an auditory processing perspective, the magnitude of the reduced articulation rate for L2 speech compared to L1 speech, though quite reliable, is probably not large enough to alter basic auditory processing that may be driven by theta-band neural oscillations.

While the processes of message encoding in L2 speech result in subtle yet reliable changes to the temporal modulation rate of the L2 speech signal, the concurrent reduced prevalence of phonetic reduction, and consequent lowering of syllable information density, alters the mapping between acoustic content and linguistic representations. Under the view that phonetic reduction is driven by a principle of uniform information density, or smooth signal redundancy, we can envision several mutually compatible sources of a lack of phonetic reduction in L2 speech. First, it is possible that during L2 speech production, optimization of information transfer through redundancy-related reduction is down-weighted in favor of resource allocation to other processes of language and speech production (lexical selection, production planning, and articulation) which are less practiced and therefore more resource demanding in L2 compared to L1 speech. This possibility is consistent with the notion of phonetic reduction (and its counterpart, phonetic strengthening) as an active, and therefore resource demanding, process of information transmission optimization that talkers implement during message encoding.

A second possibility is that connected speech in L1 and L2 involve similar reduction-related optimization processes, but the language models from which probabilities of occurrence are derived differ due to different contexts in which the language was learned and is used. Later onset and shorter duration are defining features of L2 versus L1 acquisition, and in contrast to the typical implicit L1 learning context, the L2 learning context often involves explicit instruction with an emphasis on written forms (see Lupyan & Dale, Reference Lupyan and Dale2010 for discussion of L1 versus L2 learning contexts). Furthermore, L2 learning inevitably involves interaction with the L1 – which, depending on the typological relationship between the structures of the particular L1 and L2, may be facilitatory, neutral, or inhibitory. All of these features of L2 learning would likely result in distinct L1 and L2 language models from which probabilities of occurrence are derived, resulting in different patterns of redundancy-driven phonetic reduction. Evidence that different L1 and L2 reduction patterns reflect different underlying language models rather than differences in the reduction processes per se come from studies that have demonstrated similar degrees of later-mention reduction in L2 and L1 speech (e.g., Baker et al., Reference Baker, Baese-Berk, Bonnasse-Gahot, Kim, Van Engen and Bradlow2011; Gustafson & Goldrick, Reference Gustafson and Goldrick2018; Spilková, Reference Spilková2014; Oh & Lee, Reference Oh and Lee2016; and Li et al., Reference Li, Li, Luo and Mok2018). All of these studies showed that the second or later mention of a word in a discourse (Baker et al., Reference Baker, Baese-Berk, Bonnasse-Gahot, Kim, Van Engen and Bradlow2011; Gustafson & Goldrick, Reference Gustafson and Goldrick2018; Spilková, Reference Spilková2014; Oh & Lee, Reference Oh and Lee2016; and Li et al., Reference Li, Li, Luo and Mok2018) or in a sequence of experimental trials (Gustafson & Goldrick, Reference Gustafson and Goldrick2018) was shortened relative to its first mention by a similar amount in L2 as in L1 speech suggesting similar tracking of local, discourse-level probability for L1 and L2 speech. These studies thus support the view that distinct L1 versus L2 reduction patterns are related to their distinct contexts of learning and experience rather than to down- or up-weighting optimization of information transfer depending on resource availability.

The complexity of the language models over which surprisal/redundancy is estimated has been highlighted in several detailed studies of redundancy-driven reduction in L1 speech (e.g., Aylett & Turk, Reference Aylett and Turk2004, Reference Aylett and Turk2006; Bell et al., Reference Bell, Brenier, Gregory, Girand and Jurafsky2009; Cohen Priva, Reference Cohen Priva2015; Gahl et al., Reference Gahl, Yao and Johnson2012; Johnson, Reference Johnson, Yoneyama and Maekawa2004; Jurafsky et al., Reference Jurafsky, Bell, Gregory, Raymond, Bybee and Hopper2001; Turk, Reference Turk2010). Using large corpora of spontaneous speech with linguistic annotations and extensive automatic phonetic measurements, these studies have demonstrated a tight relationship between probability of occurrence given the context and various measures of phonetic reduction in both the temporal and spectral domains. The Smooth Signal-Redundancy Hypothesis (Aylett & Turk, Reference Aylett and Turk2004, Reference Aylett and Turk2006; Turk, Reference Turk2010) has proposed that prosodic prominence and boundary structure mediate the implementation of redundancy-driven variation in articulatory strength, probably in combination with direct temporal and spectral adjustment. Importantly, effective implementation of redundancy-related reduction, whether through prosodic restructuring or through direct temporal and spectral adjustment of syllables and segments, depends on long-term knowledge and frequent usage of the language in varying contexts. It thus follows naturally that the distinct learning and usage patterns of L2 and L1 speech will strongly impact their respective patterns of redundancy-related reduction.

Furthermore, L2 learning often relies heavily on the written medium which could lead to prioritization of production targets based on orthographic forms with fully specified syllables over frequently used reduced forms. For example, the word “traveler” in the English NWS passage is frequently pronounced with two syllables in L1 English. However, in L2 English, this word is frequently pronounced with three syllables. The unstressed middle syllable of the orthographic form is often fully articulated in L2 speech but reduced in L1 speech. Similarly, in the French NWS passage, the word “enveloppé” may be pronounced with four full syllables in L2 French but with a reduced second syllable in L1 French, and in the Spanish NWS passage, the word “viento” may be pronounced with three syllables in L2 Spanish but with two syllables in L1 Spanish. These L1 reductions may reflect redundancy, or predictability, based on lexical frequency, morphological composition, phonotactic sequencing, and/ discourse context. In contrast, for L2 talkers, their learning context and usage-based experiences may prioritize the orthographically-driven pronunciation resulting in production of “extra” acoustic syllables, thereby lowering the L2 speech information density and transmission rate.

From the perspective of the listener at the receiver's end of the speech transmission chain, what might be the consequences for message decoding of the slower information transmission rate of L2 compared to L1 speech? Studies of L2 speech recognition by L1 listeners have demonstrated highly generalized improvement in L2 speech recognition accuracy following repeated exposure to L2 speech samples (e.g., Bradlow & Bent, Reference Bradlow and Bent2008; Baese-Berk, Bradlow & Wright, Reference Baese-Berk, Bradlow and Wright2013; Sidaras, Alexander & Nygaard, Reference Sidaras, Alexander and Nygaard2009; Tzeng, Alexander, Sidaras & Nygaard, Reference Tzeng, Alexander, Sidaras and Nygaard2016; Xie & Myers, Reference Xie and Myers2017; Xie, Weatherholtz, Bainton, Rowe, Burchill, Liu & Jaeger, Reference Xie, Weatherholtz, Bainton, Rowe, Burchill, Liu and Jaeger2018; Alexander & Nygaard, Reference Alexander and Nygaard2019), suggesting that listeners can learn new message decoding routines in response to novel message encoding features. However, even with effective perceptual adaptation to L2 speech in terms of improved intelligibility (i.e., improved word recognition accuracy), the slow information transmission rate of L2 speech signals would presumably limit its communicative efficiency in terms of processing time.

The divergence of intelligibility (i.e., recognition accuracy) and communicative efficiency (i.e., processing time) is also illustrated by the contrast between L2 speech and “clear” speech, a speech style that talkers often adopt when they are aware of a speech communication barrier for the listener due to, for example, a hearing impairment or environmental noise (for reviews of clear speech research, see Krause & Braida, Reference Krause and Braida2004; Smiljanic & Bradlow, Reference Smiljanic and Bradlow2009; Smiljanic, Reference Smiljanic, Pardo, Nygaard, Remez and Pisoni2021). In comparison to “plain” (or “conversational”) L1 speech, both clear speech and L2 speech are characterized by slow articulation rates (fewer syllables/second) and avoidance of phonetic reduction (i.e., lower information density). However, while clear speech is associated with enhanced intelligibility, L2 speech is associated with reduced intelligibility. Crucially, the phonetic encoding adjustments of clear speech are driven by the principle of communicative efficiency that is at the heart of the Smooth Signal Redundancy and the Uniform Information Density hypotheses. Under conditions of reduced predictability/redundancy due to environmental or listener-related degradation of the communication channel, talkers adopt a mode of message encoding that involves phonetic enhancement to compensate for lost or degraded information in the transmission channel. Thus, while both L2 speech and clear speech involve a combination of slow articulation rate and low information density to yield a low information transmission rate, they differ dramatically with respect to their impact at the receiver's end of the information transmission system. The slow articulation rate and low information density of clear speech production are in direct response to the presence of environmental or receiver-related noise, and therefore effectively enhance intelligibility and communicative efficiency in the compromised speech communication context – that is, clear speech adjustments at the transmission end are designed to counteract impedance to information flow in the transmission channel or at the receiver's end of the communication system. In contrast, the slow articulation rate and low information density of L2 speech production are internal to the message encoder (the talker) and therefore introduce an element of potential disruption to the smooth and efficient flow of information across the speech communication system as it extends from the transmitter (talker) through the communication channel (speech signal) to the receiver (listener).

The focus of the present study was on the compounding effects of slow articulation rate (few syllables/second) and low information density (many syllables for a given message) to yield a slow information rate (low proportion of total information conveyed per second) for L2 speech compared to L1 speech. A key feature of the data presented in this study is that the L1 and L2 speech recordings within each of the three languages (English, French, and Spanish) were readings of a given text. Thus, the within-language comparisons of L1 versus L2 speech were controlled for intended meaning, allowing us to view the rate and number of acoustic syllables produced in relation to the encoding of a fixed meaning in L1 versus in L2 speech. However, this dataset was rather limited in terms of statistical modelling of phonetic reduction as a function of surprisal, or probability, of linguistic forms in various contexts. Detailed analyses with large corpora of L2 speech would allow for direct comparison of the relationship between probability of occurrence given the context (i.e., redundancy, or conversely, surprisal) and phonetic reduction in L2 and L1 speech. This would, in turn, allow for deeper insight into the causes and consequences of the distinct temporal modulation pattern of L2 speech across various languages and for hypothesis-driven testing of alternative explanations for the observed differences between L1 and L2 information encoding and transmission

4. Conclusion

This study provides cross-language evidence that the characteristically slow rate of L2 speech (low number of acoustic syllables per second) combines with a tendency towards fully articulated rather than phonetically reduced forms (i.e., low information density) to produce an L2 speech style with a very low information transmission rate. This compounding of slow articulation rate with low information density in L2 speech contrasts with the efficiency-driven trade-off between speech reduction and information density that characterizes L1 speech. Future research with large-scale, multi-lingual, multi-talker corpora of L1 and L2 speech under natural, conversational conditions and in a wide range of languages are needed to gain further insight into the dynamics of communicative efficiency via L2 speech channels.

Acknowledgments

I gratefully acknowledge technical support from Chun-Liang Chan, statistical consultation with Dorina Strori, and numerous critical discussions with Matt Goldrick throughout the development of this project. This project has benefitted greatly from feedback on earlier versions that were presented at conferences (Interspeech 2019, and annual meetings of the Psychonomics Society in 2019 and 2020) as well as from presentations to the Phonatics Discussion Group at Northwestern University. Work supported by Grant R01-DC005794 from NIH-NIDCD.

Footnotes

1 Note that, while the first-acquired language is usually the dominant language (i.e., exposure to the L1 is usually both early and extended), many bilingual adults exhibit a pattern of “switched dominance” language learning – that is, for many bilingual adults (often known as “heritage speakers”), the second-acquired language, the L2, is the dominant language due to extended exposure for the L2 in contrast to early but limited exposure to the L1. In the present paper, L2 should be understood as both the second-acquired and the non-dominant (i.e., the later and less exposed) language.

2 For review and extensive discussion of these accounts, as well as a nuanced analysis of the multiple factors that modulate phonetic reduction, see Clopper and Turnbull, Reference Clopper, Turnbull, Cangemi, Clayards, Niebuhr, Schuppler and Zellers2018.

3 In order to reduce the influence of artefacts in the detection of acoustic syllables, all files were checked for spurious high-amplitude bursts (a noise burst from the end of one RPD file was removed during this process), and the root-mean-square intensity (RMS) of all files was equalized before running the Praat script. This step ensured equivalent application of the syllable detecting procedure, without adjustment to the default settings, to all speech files in both L1 and L2 speech and in all three languages.

References

Abercrombie, D (1967) Elements of general phonetics. Aldine: ChicagoGoogle Scholar
Ahissar, E and Ahissar, M (2005) Processing of the temporal envelope of speech. In Konig, R, Heil, P, Bundinger, E and Scheich, H (eds), The Auditory Cortex. A Synthesis of Human and Animal Research. Mahwah, NJ: Lawrence Erlbaum, pp. 295313.Google Scholar
Alexander, J and Nygaard, L (2019) Specificity and generalization in perceptual adaptation to accented speech. The Journal of the Acoustical Society of America 145, 33823398.CrossRefGoogle ScholarPubMed
Aylett, MP and Turk, A (2004) The smooth signal redundancy hypothesis: a functional explanation for relationships between redundancy, prosodic prominence and duration in spontaneous speech. Language & Speech 47, 3156.CrossRefGoogle ScholarPubMed
Aylett, MP and Turk, A (2006) Language redundancy predicts syllabic duration and the spectral characteristics of vocalic syllable nuclei. Journal of the Acoustical Society of America 119, 30483059.CrossRefGoogle ScholarPubMed
Baese-Berk, MM, Bradlow, AR and Wright, BA (2013) Accent-independent adaptation to foreign accented speech. The Journal of the Acoustical Society of America 133, EL174EL180.CrossRefGoogle ScholarPubMed
Baese-Berk, M and Morrill, T (2015) Speaking rate consistency in native and non-native speakers of English, The Journal of the Acoustical Society of America,138, EL223EL228.CrossRefGoogle ScholarPubMed
Baker, R, Baese-Berk, M, Bonnasse-Gahot, L, Kim, M, Van Engen, K and Bradlow, AR (2011) Word durations in non-native English. Journal of Phonetics 39, 117.CrossRefGoogle ScholarPubMed
Bell, A, Brenier, JM, Gregory, M, Girand, C and Jurafsky, D (2009) Predictability effects on durations of content and function words in conversational English. Journal of Memory & Language 60, 92111.CrossRefGoogle Scholar
Bond, Z, Stockmal, V and Markus, D (2003) Sentence durations and accentedness judgments. The Journal of the Acoustical Society of America 113, 2330.CrossRefGoogle Scholar
Bond, Z, Stockmal, V and Markus, D (2008) A Note on Native and Non-Native Accentedness Judgments. Ms., The Ohio State University. (Last retrieved December 2020 from https://www.researchgate.net/publication/241128402_A_Note_on_Native_and_Non-Native_Accentedness_Judgments).CrossRefGoogle Scholar
Bradlow, AR and Bent, T (2008) Perceptual adaptation to non-native speech. Cognition 106, 707729.CrossRefGoogle ScholarPubMed
Bradlow, AR (n.d.) ALLSSTAR: Archive of L1 and L2 Scripted and Spontaneous Transcripts And Recordings. Last retrieved December, 2020 from https://speechbox.linguistics.northwestern.edu/#!/?goto=allsstarGoogle Scholar
Bradlow, AR (n.d.) SpeechBox. Last retrieved December, 2020 from https://speechbox.linguistics.northwestern.edu.Google Scholar
Bradlow, AR, Blasingame, M and Kim, M (2017) Language-independent talker-specificity in first-language and second-language speech production by bilingual talkers: L1 speaking rate predicts L2 speaking rate. The Journal of the Acoustical Society of America 141, 886899.CrossRefGoogle ScholarPubMed
Brandt, E (2019) Information density and phonetic structure: Explaining segmental variability. PhD Dissertation. University of Saarland, Germany.Google Scholar
Broos, WPJ, Duyck, W and Hartsuiker, RJ (2018) Are higher level processes delayed in second language word production? Evidence from picture naming and phoneme monitoring. Language, Cognition and Neuroscience 33, 12191234.CrossRefGoogle Scholar
Clopper, CG and Turnbull, R (2018) Exploring variation in phonetic reduction: Linguistic, social, and cognitive factors. In Cangemi, F, Clayards, M, Niebuhr, O, Schuppler, B, Zellers, M (eds), Rethinking Reduction: Interdisciplinary Perspectives on Conditions, Mechanisms, and Domains for Phonetic Variation. Berlin, Germany: Walter de Gruyter GmbH & Co. KG, pp. 2572.CrossRefGoogle Scholar
Cohen Priva, U (2015) Informativity affects consonant duration and deletion rates. Laboratory Phonology 6, 243–78.CrossRefGoogle Scholar
Colantoni, L and Steele, J (2004) The University of Toronto Romance Phonetics Database. Last retrieved December, 2020 from http://r1.chass.utoronto.ca/rpd/.Google Scholar
Coupé, C, Oh, YM, Dediu, D and Pellegrino, F (2019) Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Science Advances 5, 110.CrossRefGoogle ScholarPubMed
Crocker, MW, Demberg, V and Teich, E (2016) Information Density and Linguistic Encoding (IDEAL). Künstl Intell 30, 7781. https://doi.org/10.1007/s13218-015-0391-y.CrossRefGoogle Scholar
Cummins, F (2012) Oscillators and syllables: a cautionary note. Frontiers in Psychology 3, 364. doi: 10.3389/fpsyg.2012.00364.CrossRefGoogle ScholarPubMed
De Jong, NH (2016) Predicting pauses in L1 and L2 speech: The effects of utterance boundaries and word frequency. International. Review of Applied Linguistics in Language Teaching 54, 113132.Google Scholar
De Jong, NH and Wempe, T (2009) Praat script to detect syllable nuclei and measure speech rate automatically. Behavioral Research Methods 41, 385390.CrossRefGoogle ScholarPubMed
De Jong, NH, Groenhout, R, Schoonen, R and Hulstijn, JH (2015) Second language fluency: Speaking style or proficiency? Correcting measures of second language fluency for first language behavior. Applied Psycholinguistics 36, 223243.CrossRefGoogle Scholar
Derwing, TM, Munro, MJ, Thomson, RI and Rossiter, MJ (2009) The relationship between L1 fluency and L2 fluency development. Studies in Second Language Acquisition 31, 533557.CrossRefGoogle Scholar
Dryer, MS and Haspelmath, M (eds.) (2013) The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info, Accessed on 2020-08-05.)Google Scholar
Fenk-Oczlon, G and Fenk, A (2010) Measuring basic tempo across languages and some implications for speech rhythm. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, Interspeech 2010, 1537–1540CrossRefGoogle Scholar
Flege, JE (1984) The detection of French accent by American listeners. The Journal of the Acoustical Society of America 76, 692707,CrossRefGoogle ScholarPubMed
Gahl, S, Yao, Y and Johnson, K (2012) Why reduce? Phonological neighborhood density and phonetic reduction in spontaneous speech. Journal of Memory and Language 66, 789806.CrossRefGoogle Scholar
Garcia Lecumberi, L, Cooke, M, & Wester, , M. (2017) A bi-directional task-based corpus of learners’ conversational speech. International Journal of Learner Corpus Research 3, 175195.CrossRefGoogle Scholar
Ghitza, O (2011) Linking speech perception and neurophysiology: speech decoding guided by cascaded oscillators locked to the input rhythm. Frontiers in Psychology 2, 130.CrossRefGoogle Scholar
Ghitza, O (2012) On the role of theta-driven syllabic parsing in decoding speech: intelligibility of speech with a manipulated modulation spectrum. Frontiers in Psychology 3, 238 .CrossRefGoogle ScholarPubMed
Ghitza, O (2013) The theta-syllable: a unit of speech information defined by cortical function. Frontiers in Psychology 3, 364.Google Scholar
Ghitza, O and Greenberg, S (2009) On the possible role of brain rhythms in speech perception: intelligibility of time-compressed speech with periodic and aperiodic insertions of silence. Phonetica 66, 113126.CrossRefGoogle ScholarPubMed
Gibson, E, Futrell, R, Piantadosi, SP, Dautriche, I, Mahowald, K, Bergen, L and Levy, R (2019) How efficiency shapes human language. Trends in cognitive sciences 23, 389407.CrossRefGoogle ScholarPubMed
Giraud, AL and Poeppel, D (2012) Cortical oscillations and speech processing: emerging computational principles and operations. Nature Neuroscience 15, 511517.CrossRefGoogle ScholarPubMed
Goedemans, R and van der Hulst, H (2013) Rhythm Types. In Dryer, MS and Haspelmath, M (eds), The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info/chapter/17, Accessed on 2020-08-04.)Google Scholar
Guion, SG, Flege, JE, Liu, SH and Yeni-Komshian, GH (2000) Age of learning effects on the duration of sentences produced in a second language. Applied Psycholinguistics 21, 205228.CrossRefGoogle Scholar
Gustafson, E and Goldrick, M (2018) The role of linguistic experience in the processing of probabilistic information in production. Language, Cognition and Neuroscience 33, 211226.CrossRefGoogle ScholarPubMed
Idemaru, K, Gubbins, LS and Wei, P (2019) Acoustic sources of accent in second language Japanese speech. Language and Speech 62, 333357.CrossRefGoogle ScholarPubMed
Jaeger, TF (2010) Redundancy and reduction: speakers manage syntactic information density. Cognitive Psychology 61, 2362.CrossRefGoogle ScholarPubMed
Jaeger, TF and Tily, H (2011) On language utility: processing complexity and communicative efficiency. Wiley Interdisciplinary Reviews Cognitive Science 2, 323335.CrossRefGoogle ScholarPubMed
Johnson, K (2004) Massive reduction in conversational American English, in Spontaneous Speech: Data and Analysis. In Yoneyama, K and Maekawa, K (eds.), Proceedings of the 1st Session of the 10th International Symposium. Tokyo, Japan: The National International Institute for Japanese Language, pp. 2954.Google Scholar
Jurafsky, D, Bell, A, Gregory, M and Raymond, WD (2001) Probabilistic relations between words: Evidence from reduction in lexical production. In Bybee, J and Hopper, P (eds), Frequency and the Emergence of Linguistic Structure. Amsterdam: Benjamins, pp. 229254.CrossRefGoogle Scholar
Kendall, T (2013) Speech rate, pause, and language variation: Studies in corpus sociophonetics. Basingstoke, UK: Palgrave Macmillan. (Last retrieved December, 2020 from http://lingtools.uoregon.edu/scripts/english_syllable_counter-102.R.)CrossRefGoogle Scholar
Krause, JC and Braida, LD (2004) Acoustic properties of naturally produced clear speech at normal speaking rates. The Journal of the Acoustical Society of America 115, 362378.CrossRefGoogle ScholarPubMed
Kroll, J and Gollan, T (2014) Speech Planning in Two Languages: What Bilinguals Tell Us about Language Production. In Goldrick, M, Ferreira, V and Miozzo, M (eds.), The Oxford Handbook of Language Production, pp. 165181. Oxford University Press.Google Scholar
Levy, R (2008) Expectation-based syntactic comprehension. Cognition 106, 11261177.CrossRefGoogle ScholarPubMed
Li, VG, Li, X, Luo, J and Mok, P (2018) L1 and L2 phonetic reduction in quiet and noisy environments. In Proceedings of the 9th International Conference on Speech Prosody. Poznań, Poland, pp. 848852.Google Scholar
Lupyan, G and Dale, R (2010) Language Structure Is Partly Determined by Social Structure. PLoS ONE 5, e8559.CrossRefGoogle ScholarPubMed
Maddieson, I (2013a) Consonant Inventories. In Dryer, MS and Haspelmath, M (eds), The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info/chapter/1, Accessed on 2020-08-04.)Google Scholar
Maddieson, I (2013b) Consonant-Vowel Ratio. In Dryer, MS and Haspelmath, M (eds), The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info/chapter/1, accessed on 2020-08-04.)Google Scholar
Maddieson, I (2013c) Syllable Structure. In Dryer, MS and Haspelmath, M (eds), The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info/chapter/1, Accessed on 2020-08-04.)Google Scholar
Maddieson, I (2013d) Vowel Quality Inventories. In Dryer, MS and Haspelmath, M (eds), The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info/chapter/1, Accessed on 2020-08-04.)Google Scholar
Major, RC (2007) Identifying a foreign accent in an unfamiliar language. Studies in Second Language Acquisition 29, 539556.CrossRefGoogle Scholar
Matzinger, T, Ritt, N and Fitch, WT (2020) Non-native speaker pause patterns closely correspond to those of native speakers at different speech rates. PLoS ONE 15, e0230710.CrossRefGoogle ScholarPubMed
Munro, MJ, Derwing, TM and Burgess, CS (2003) The detection of foreign accent in backwards speech. In Solé, MJ, Recasens, D and Romero, J (eds), Proceedings of the 15th International Congress of Phonetic Sciences, pp. 535538. Barcelona, Spain: Futurgraphic.Google Scholar
Munro, MJ, Derwing, TM and Burgess, CS (2010) Detection of nonnative speaker status from content-masked speech. Speech Communication 52, 626637.CrossRefGoogle Scholar
Oh, S and Lee, Y (2016) Repeated mention reduction in L2 English spontaneous speech. In Proceedings of the 7th International Conference on Speech Prosody, pp. 961964.CrossRefGoogle Scholar
Peelle, JE and Davis, MH (2012) Neural oscillations carry speech rhythm through to comprehension. Frontiers in Language Science 3, 320.Google ScholarPubMed
Pellegrino, F, Coupé, C and Marsico, E (2011) Across-language perspective on speech information rate Language 87, 539558.CrossRefGoogle Scholar
Pike, KL (1945) The intonation of American English. University Press: MichiganGoogle Scholar
Poeppel, D (2003) The analysis of speech in different temporal integration windows: cerebral lateralization as &lsquo;asymmetric sampling in time.&rsquo; Speech Communication 41, 245255.CrossRefGoogle Scholar
Rallo Fabra, L (2015) Can Nonnative Speakers Reduce English Vowels in a Native-Like Fashion? Evidence from L1-Spanish L2-English Bilinguals. Phonetica 72:162–81.CrossRefGoogle Scholar
Runnqvist, E, Strijkers, K and Costa, A (2014) Bilingual Word Access. In Goldrick, M, Ferreira, V and Miozzo, M (eds), The Oxford Handbook of Language Production. Oxford University Press, pp. 182198.Google Scholar
Schertz, J and Ernestus, M (2014) Variability in the pronunciation of non-native English the: Effects of frequency and disfluencies. Corpus Linguistics and Linguistic Theory 10, 329345.CrossRefGoogle Scholar
Schweinberger, SR and Zaske, R (2018) Perceiving speaker identity from the voice. In Fruhholz, S and Belin, P (eds), The Oxford Handbook of Voice Perception. Oxford: Oxford University Press.Google Scholar
Sidaras, SK, Alexander, JE and Nygaard, LC (2009) Perceptual learning of systematic variation in Spanish-accented speech. The Journal of the Acoustical Society of America 125, 33063316.CrossRefGoogle ScholarPubMed
Smiljanic, R and Bradlow, AR (2009) Speaking and hearing clearly: Talker and listener factors in speaking style changes. Linguistics and Language Compass 3, 236264.CrossRefGoogle ScholarPubMed
Smiljanic, R (2021) Clear speech perception. In Pardo, J, Nygaard, L, Remez, R and Pisoni, DB (eds), The Handbook of Speech Perception, Second Edition. John Wiley & Sons, Inc.Google Scholar
Spilková, H (2014) Phonetic reduction in spontaneous speech: an investigation of native and non-native production. PhD Dissertation. Norwegian University of Science and Technology.Google Scholar
The VersantTM English Test (n.d.) Automatic evaluation of the spoken English skills of non-native English speakers. Menlo Park, CA: Pearson Education, Inc.Google Scholar
Towell, R and Dewaele, JM (2005) The role of psycholinguistic factors in the development of fluency amongst advanced learners of French. In Dewaele, JM (ed.), Focus on French as a Foreign Language. Clevedon, UK: Multilingual Matters, pp. 210239.CrossRefGoogle Scholar
Trouvain, J, Fauth, C and Möbius, B (2016) Breath and non-breath pauses in fluent and disfluent phases of German and French L1 and L2 read speech, Proceedings of the 7th International Conference on Speech Prosody, pp. 3135.Google Scholar
Turk, A (2010) Does prosodic constituency signal relative predictability? A Smooth Signal Redundancy hypothesis. Journal of Laboratory Phonology 1, 227262.Google Scholar
Tzeng, CY, Alexander, JE, Sidaras, SK and Nygaard, LC (2016) The role of training structure in perceptual learning of accented speech. Journal of Experimental Psychology: Human Perception and Performance 42, 1793.Google ScholarPubMed
Vaughn, CR and Bradlow, AR (2017) Processing relationships between language-being-spoken and other speech dimensions in monolingual and bilingual listeners. Language and Speech 60, 530561.CrossRefGoogle ScholarPubMed
Xie, X and Myers, EB (2017) Learning a talker or learning an accent: Acoustic similarity constrains generalization of foreign accent adaptation to new talkers. Journal of Memory and Language 97, 3046.CrossRefGoogle ScholarPubMed
Xie, X, Weatherholtz, K, Bainton, L, Rowe, E, Burchill, Z, Liu, L and Jaeger, TF (2018) Rapid adaptation to foreign-accented speech and its transfer to an unfamiliar talker. The Journal of the Acoustical Society of America 143, 20132031.CrossRefGoogle Scholar
Figure 0

Table 1. Overview of recordings. See text for detailed explanations.

Figure 1

Table 2. Overview of variables. See text for detailed explanations.

Figure 2

Table 3. Number of sentences, words, and orthographic (i.e., phonological) syllables in the NWS passages and DHR sentences.

Figure 3

Table 4. Articulation rate (AR), acoustic syllable duration, number of acoustic syllables, information density (ID), information rate (IR), and acoustic syllable reduction (LOSS) by talker group (L1 versus L2) and by recording text (English DHR, English NWS, French NWS, and Spanish NWS). Data shown are means with standard error of the mean in parentheses. See text for additional explanation for each variable.

Figure 4

Fig. 1. Density plots of articulation rate (AR), information density (ID), information rate (IR) and syllable reduction (LOSS) for the L1 and L2 groups within each recording text (English DHR, English NWS, French NWS, and Spanish NWS). All data are shown on z-transformed scales within their own distributions.

Figure 5

Table 5. Summary of comparisons between models with and without the Group-by-Text interactive term. In all cases the interactive model was a significantly better fit (lower AIC) than the additive model.

Figure 6

Table 6. Summaries of the best fit models with the Group by Text interactive terms. The referent category in all models is ENG_DHR and L1 for the Text and L1 factors, respectively.

Figure 7

Table 7. Pairwise comparisons of the estimated means between L1 and L2 within each text.

Figure 8

Fig. 2. Density plots of articulation rate (AR), information density (ID), information rate (IR) and syllable reduction (LOSS) by proficiency group (L2 Intermediate, L2 Near-L1/Advanced, and L1) aggregated across texts and languages. All data are shown on z-transformed scales within their own distributions.