Highlights
-
• Two hypotheses are at play in speech intelligibility against different languages.
-
• Recognition of L2 targets (but not L1 targets) can disentangle the two hypotheses.
-
• All bilinguals had more difficulties ignoring an L1 masker than an L2 or foreign masker.
-
• Informational masking is particularly strong with a native masker.
1. Introduction
Imagine you are attending a crowded cocktail party, trying to hear what a friend is saying over the noise in the room. To make matters even more challenging, you happen to be at a scientific conference in Montreal with international attendees speaking a variety of different languages. In this situation, would you have more trouble understanding your friend if they spoke your first or your second language? Would it matter what language the other guests were speaking? There is considerable literature on the cocktail party problem (Cherry, Reference Cherry1953), and most of this research is concerned with the mechanisms underlying speech recognition, spatial hearing, auditory masking and source segregation, among other factors (McDermott, Reference McDermott2009). To conduct such studies, experimenters primarily recruit listeners who are first-language speakers of the materials used in the test and tend to test monolingual listeners disregarding other language(s) they may have been exposed to (Linck, Osthus, Koeth & Bunting, Reference Linck, Osthus, Koeth and Bunting2014; Melby-Lervåg & Lervåg, Reference Melby-Lervåg and Lervåg2014; Yow & Li, Reference Yow and Li2015). While this approach can be useful in reducing variability between participants to delve into psychoacoustics and the mechanisms of speech processing, these studies do not address the experiences of the estimated half of the global population who speak two or more languages (Grosjean, Reference Grosjean, Grosjean and Li2012). Research that has included bilingual participants has often compared them to monolinguals (e.g., Cooke et al., Reference Cooke, Lecumberri and Barker2008; Broersma & Scharenborg, Reference Broersma and Scharenborg2010; Lecumberri et al., Reference Lecumberri, Cooke and Cutler2010; Calandruccio & Zhou, Reference Calandruccio and Zhou2014; Bidelman & Dexter, Reference Bidelman and Dexter2015) and rarely explored differences amongst bilinguals in their speech perception performance (Luk, Reference Luk2015; de Bruin, Reference de Bruin2019; DeLuca, Rothman, Bialystok & Pliatsikas, Reference DeLuca, Rothman, Bialystok and Pliatsikas2019; Kim et al., Reference Kim, Marton, Obler, Sekerina, Sradlin and Valina2019). Yet, these individual differences (e.g. if they rely more on speaking/listening proficiency, speaking/listening use or age of acquisition) could provide insight into the top-down processes that help deal with cocktail party situations (Bregman, Reference Bregman1990). More precisely, they provide a valuable opportunity to rethink the linguistic aspect of informational masking.
1.1. Energetic and informational masking
In situations where multiple auditory signals are present, two types of phenomena can interfere with a listener’s ability to detect and process the target signal: energetic masking and informational masking. Energetic masking is relatively well-defined (Culling & Stone, Reference Culling, Stone, Middlebrooks, Simon, Popper and Fay2017): it occurs when the target and masking stimuli share similar acoustic content, for example, when two people are talking at the same time (temporal similarity) or have similar spectral content (frequency similarity). On the other hand, there is no positive, accepted definition of informational masking. It is often defined negatively, as what is left of the difficulty in understanding a masked speaker, after energetic masking has been accounted for (Kidd, Mason, Richards, Gallun & Durlach, Reference Kidd, Mason, Richards, Gallun, Durlach, Yost, Popper and Fay2008; Bronkhorst, Reference Bronkhorst2015).
To illustrate, it is possible to generate artificial stimuli that approach the spectro-temporal content of masking voices and thus present a similar level of energetic masking yet being largely devoid of linguistic units (Hawley, Litovsky & Culling, Reference Hawley, Litovsky and Culling2004; Deroche & Culling, Reference Deroche and Culling2013; Leclère et al., Reference Leclère, Lavandier and Deroche2017). When instructed to attend to a target speaker, listeners tend to find such artificial maskers easier to ignore than the voices on which they were modelled. The presence of distracting linguistic units tends to grasp listener’s attention and adds a level of cognitive difficulty due to the automatic processing that occurs upon hearing speech. This is (at least partly) what we refer to as informational masking in a cocktail party environment. Though it is possible to uncover the presence of informational masking with attentional tasks devoid of energetic masking (e.g., random-frequency multitone bursts; Oxenham. Fligor, Mason & Kidd, Reference Oxenham, Fligor, Mason and Kidd2003), in ecological speech-on-speech situations there is always a blurry line between the energetic and informational components (Brungart, Reference Brungart2001; Brungart, Simpson, Ericson & Scott, Reference Brungart, Simpson, Ericson and Scott2001; Kidd, Mason, & Gallun, Reference Kidd, Mason and Gallun2005). As a result, when a cue allows listeners to segregate the target from the masker perceptually (e.g. a difference in voice pitch or spatial position) to obtain a better performance (aka masking release), the energetic and informational components of the masking release are often difficult to disentangle (e.g. Deroche et al., Reference Deroche, Culling, Lavandier and Gracco2017a). The language of competing speakers is yet another cue involving both energetic and informational masking, but we have little understanding of their respective contributions.
1.2. The masking of competing languages
Let us consider for a moment the sort of masking that occurs specifically because a target and interfering speaker speak in the same language. On the energetic side, sounds that belong to the phonetics of a particular language would likely mask each other more than sounds that belong to another language. For example, English vowels would be more likely to occupy certain spectral regions (despite talker variability), and English syllables more likely to possess the sort of envelope modulations common to other English syllables (this is what makes the rhythm of a given language so unique). Classically, this phenomenon is demonstrated in studies of monolingual participants, who are better able to ignore a masker speaking a foreign language as compared to their first-language L1 (Van Engen & Bradlow, Reference Van Engen and Bradlow2007; Calandruccio, Dhar & Bradlow, Reference Calandruccio, Dhar and Bradlow2010). However, note that the energetic account only relies on the fact that the masker language is different (not foreign) such that the lower target-to-masker acoustic similarity provides a substantial masking release. An informational masking account is also at play in this scenario, and it can be framed in two ways. The first one is also based on a target-to-masker similarity idea but at a linguistic (or more cognitive) level. To illustrate, a listener attending to the sentence “the dog eats a bone” would be more distracted by a competing word taken from the same language or lexical field (e.g. the English word “chocolate”) than a word in a different language or lexical field (e.g. the French word “chocolat”) despite both words approaching a similar spectro-temporal content. The second perspective is that a distracting voice conversing in the listener’s L1 might be particularly difficult to ignore as the brain cannot help but process linguistic units that are familiar (and often native) to the listener. This idea is generally referred to as the listener’s language familiarity hypothesis. Notably, with monolingual participants, or with a task restricted to L1 targets, the two hypotheses make similar predictions: a foreign masker would result in lower target-to-masker similarity and would capture less efficiently the listener’s native language system. With targets speaking in a second language (L2), however, the two hypotheses make different predictions: an L2 masker would result in higher target-to-masker similarity than an L1 masker (and this is true at both energetic and informational levels) while less efficiently capturing the listener’s native language system than an L1 masker. Of course, being able to even conduct a speech recognition task with L2 targets requires participants to have a minimum level of L2 proficiency, and this is where studies with bilinguals allow testing these competing hypotheses.
1.3. Bilingual studies
Several studies have focussed on monolingual-bilingual comparisons. For example, English monolinguals and Mandarin-English bilinguals (L1 Mandarin, who had demonstrated lower proficiency in English than monolinguals) were tested on their recognition of English targets versus English or Mandarin maskers (Van Engen, Reference Van Engen2010). Mandarin maskers were less difficult for both groups (partly due to less energetic masking) but particularly less for English monolingual listeners. However, it is unclear whether the bilinguals’ knowledge of both masker languages (English and Mandarin) or lower proficiency in the target language (English) drove this difference. In a similar study, English monolinguals and Greek-English simultaneous bilinguals (whose proficiency in English was close to that of the English monolinguals) were tested on their recognition of English targets versus English or Greek maskers (Calandruccio & Zhou, Reference Calandruccio and Zhou2014). Again, Greek maskers were less difficult overall, consistent with the energetic masking account, but no interaction between the group and masker language was found. A few more studies (Kilman, Zekveld, Hällgren & Rönnberg, Reference Kilman, Zekveld, Hällgren and Rönnberg2014; Brouwer, Van Engen, Calandruccio, & Bradlow, Reference Brouwer, Van Engen, Calandruccio and Bradlow2012 – specifically experiment 2; and Mepham, Bi and Mattys, Reference Mepham, Bi and Mattys2022 – specifically experiment 3) have more readily tested the similarity versus familiarity hypotheses: in a nutshell (but see General Discussion for a further description) their results emphasised the primary importance of target-to-masker similarity while spotting signs that the listener’s language background played a role.
One limitation of studies to date is that none used a full factorial design, where bilingual participants were tested in conditions with L1 and L2 targets, against L1 and L2 (or, Lf) maskers. Examples of such ‘missing’ conditions include bilingual listeners with English dominance performing the task with Mandarin targets in Van Engen’s study, Swedish targets in Kilman et al.’s study, Dutch targets in Brouwer et al.’s study and Mandarin targets in Mepham et al.’s study. Without such conditions, it is not possible to fully disentangle the two accounts. Furthermore, there are always (true in all the studies aforementioned) non-negligible differences in energetic masking between the materials spoken in two different languages. This is a notoriously difficult problem to solve but one way to approach it is using mirror populations to counterbalance these differences. In this study, we addressed this shortcoming by using a full factorial design of L1 or L2 targets versus L1, L2 and Lf maskers (a completely foreign language, Tamil) in addition to speech-shaped noise maskers. Noise maskers were shaped in the long-term spectrum of each language, respectively, which provided an additional check on differences primarily driven by energetic masking. We tested these conditions in mirror populations: namely French-L1 or English-L1 bilingual listeners with varying degrees of proficiency respectively, English or French as L2. This mirror design was key in allowing us to define L1, L2 or Lf relative to the listener’s profile, and not relative to the languages’ identity.
1.4. Hypothesis
This investigation (primarily focussed on how the masker language induces interference in speech recognition) allowed us to disentangle two competing hypotheses: the target-to-masker similarity hypothesis and the language familiarity hypothesis. The target-to-masker similarity hypothesis predicts that participants would perform worse when the target and masker are in the same language compared to distinct languages. Explicitly, situations of L1 target vs L1 masker or L2 target vs L2 masker should be harder than situations of L1 target vs L2 masker or L2 target vs L1 masker, respectively. And importantly, performance with an Lf masker should be similar to performance with an L2 masker, as neither has substantial energetic nor informational overlap with L1. The language familiarity hypothesis predicts that participants would have worse performance when the masker spoke in a language that is familiar to the listener, regardless of the target language. Explicitly, situations of L1 target vs L1 masker or L2 target vs L1 masker should be harder than situations of L1 target vs L2 masker or L2 target vs L2 masker, respectively, themselves harder than situations of L1 target vs Lf masker or L2 target vs Lf masker. By measuring the speech reception threshold (SRT) under all these experimental conditions, and relative to the listener’s language background, we could test which pattern of predictions occurred.
2. Method
2.1. Participants
A total of 200 French-English bilingual participants were recruited through the Prolific platform. All participants spoke either English or French as their L1, and the other language as their L2, and were between 18 and 50 years of age. A total of 72 participants were excluded for technical issues, incomplete data, inability to complete study in L2, or not following the instructions. This resulted in 57 participants in the L1 ENGLISH group (40 women and 17 men) and 71 participants in the L1 FRENCH group (34 women, 36 men, 1 not reported). The two groups were matched in student status (43.4% students; χ 2(2) = 0.2, p =.916) and employment status (53.1% employed; χ 2(2) = 1.5, p =.462). Note that some participants were missing data for their student and employment status leading to a third level (unknown status) in these two analyses.
The L1 ENGLISH and L1 FRENCH groups differed in the country of residence (χ 2(3) = 97.0, p <.001), which was expected as the countries from which participants were recruited have different official languages (English in the UK and USA, French in France and both English and French in Canada). Similar statistics were found with country of birth (χ 2(11) = 98.1, p <.001) and nationality (χ 2(3) = 99.1, p <.001). Unintentionally, participants in the two groups differed in sex distribution (χ 2(1) = 6.0, p =.014), as well as chronological age (t(126) = 3.0, p =.003). The L1 ENGLISH group had a majority (70%) of female participants and was on average (std) 31.7 (9.5) years old, while the L1 FRENCH group was more balanced in sex (49% female) and was a little younger, on average (std) 27.3 (6.9) years old. This likely had a negligible effect: be it for noise maskers (section 4.1) or speech maskers (section 4.2), sex did not interact with the factor of interest (group, target language or masker language, all p-values ≥.085). As for chronological age, it is well known that performance in speech perception tasks can degrade with age (e.g., Murphy, Daneman & Schneider, Reference Murphy, Daneman and Schneider2006; Schneider, Daneman & Pichora-Fuller, Reference Schneider, Daneman and Pichora-Fuller2002; Schneider, Li & Daneman, Reference Schneider, Li and Daneman2007) but these effects are not expected until later in life (often >60 years of age, e.g., Schneider, Speranza & Pichora-Fuller, Reference Schneider, Speranza and Pichora-Fuller1998; Bilodeau-Mercure, Lortie, Sato, Guitton & Tremblay, Reference Bilodeau-Mercure, Lortie, Sato, Guitton and Tremblay2015). Curiously, the two groups also differed in the amount of time they took to complete the study (t(126) = −2.1, p =.037). Participants in the L1 ENGLISH group took a mean (std) of 73.7 (21.4) minutes to complete the study, while the L1 FRENCH group took a mean (std) of 82.1 (22.9) minutes. This observation is presumably not very important: people generally wrote more words in their L1 than their L2, and written French is known to be longer than written English (Durieux, Reference Durieux1990).
Most importantly, the two groups differed (as intended) in their language background. This was assessed very simply by asking the age of acquisition (AOA), the listening proficiency and speaking proficiency (a number from 0 to 10), and the use of listening and use of speaking (a number from 0 to 10). This was done in L1 and L2, being either French or English, as all participants identified themselves as French-English bilinguals with varying degrees of fluency. Of the participants, 17.2% spoke three or four languages. Additional languages were Bulgarian, Chinese, German, Italian, Luxembourgish, Moroccan Arabic, Russian, Spanish and Welsh, with an average (std) listening proficiency of 6.3 (2.7) and an average (std) speaking proficiency of 5.1 (2.4). These L3 and L4 data were ignored in this study. A mixed analysis of variance (ANOVA) was conducted for each bilingualism metric with one between-subjects factor (group) and one within-subjects factor (language: L1 or L2). The main effects and interactions are reported in Table 1 (along with post-hoc tests to probe significant interactions at each level). The main effect of language was always significant, confirming that all participants acquired their L1 much earlier than their L2 (0.9 [SD = 2.1] vs 9.3 [SD = 4.4] years old) and had better listening proficiency (9.9 [SD = 0.4] vs 7.9 [SD = 1.3]), speaking proficiency (9.9 [SD = 0.4] vs 7.3 [SD = 1.6]), listening use (9.8 [SD = 0.8] vs 6.3 [SD = 2.7]) and speaking use (9.8 [SD = 0.9] vs 5.2 [SD = 2.9]) in their L1 compared to their L2. None of this is surprising, but it gives a sense of the imbalance of this sample of French-English bilinguals between their two languages. Less expected was the main effect of the group and its interaction with language: it was significant for every variable except AOA. Post-hoc pairwise comparisons between the two groups were never significant in L1 (i.e., the fluency of the L1 FRENCH group in French was comparable to that of the L1 ENGLISH group in English) but were always significant in L2, namely that the L1 FRENCH group had better proficiency and more frequent use in English than the L1 ENGLISH group in French (mean difference (MD) for listening proficiency = 0.9, MD for speaking proficiency = 0.7, MD for listening use = 2.4, MD for speaking use = 1.7). This was unintended (but did not seem to have much impact – see section 4.1) and may reflect the higher global mastery of English compared to French.
We expected these bilingualism measures to be highly correlated with one another. This was the case among all L2 proficiency and use variables (all p <.001, all R 2 ≥.195), but none of them correlated with L2 AOA (all p ≥.298, all R 2 ≤.01).
2.2. Stimuli
The English target stimuli were sourced from the Institute of Electrical and Electronics Engineers’ recommended practice for speech quality measurements, often termed the Harvard sentences (IEEE, 1969). This corpus of 720 phonetically balanced, standardised English sentences was originally created to test audio quality in various telephone systems but has since expanded in use in psychoacoustic research. The speaker of these target stimuli was a North American male. The French target stimuli were a phonetically balanced translation of the Harvard sentences, termed the Fharvard corpus (Aubanel, Bayard, Strauß & Schwartz, Reference Aubanel, Bayard, Strauß and Schwartz2020; openly available), produced by a French male speaker (a different individual than the North American adult). In both corpora, the stimuli were trimmed to leave roughly 150 ms of silence before onset (and 300 ms after offset) in an attempt to make targets start roughly at the same time across trials. Lists contained ten sentences each and were arranged on the basis of sentence duration so that targets were always shorter than the corresponding 2-sentence maskers.
In contrast to target stimuli, the masker stimuli were created for the purpose of this study. All English, French and Tamil masking stimuli were recorded by a single trilingual woman to keep speaking characteristics of the masker relatively constant. She acquired all three languages roughly simultaneously. Her sex was selected to lessen to some degree the energetic masking between the target and the masking voices. As competing speech tasks are already very challenging in one’s first language, let alone one’s second language, we wanted to provide salient cues to direct attention to the correct voice. She first translated all English transcripts into Tamil and then used the iPhone Voice Memo application using the internal microphone, holding the iPhone 10–15 cm away from her mouth, in a quiet room in her home. She read each sentence from a script in her natural speaking voice with 2 seconds between each production. Recordings were broken down into 8 lists of 10 sentences, and she was instructed to leave 1 minute of silence at the start of a recording, which was subsequently used to filter out any background noise using a spectral subtraction method (Boll, Reference Boll1979), conducted on Audacity version 2.1.1 (https://www.audacityteam.org/). Audio files were cut in Audacity for disfluencies and extended pauses. The most fluent and natural productions were then selected, ensuring that (1) they did not belong to any of the target lists and (2) they contained few pauses between syllables (ideally continuously voiced). To create a masker list (of 10 maskers, each consisting of 2 simultaneous sentences spoken in the same language), five sentences were selected and added in pairs in all permutations. We manually shifted the timing of each sentence in a pair to optimise the pseudo-stationarity of the combination waveform, leaving relatively few temporal dips where listeners could glimpse target words (Collin & Lavandier, Reference Collin and Lavandier2013; Leclère, Lavandier & Deroche, Reference Leclère, Lavandier and Deroche2017). All maskers were finally root-mean-square equalised at the same level as the targets (i.e., a target was as intense as a 2-voice masker).
2.3. Design and protocol
The competing speech task consisted of 20 blocks per participant, with 10 trials per block. Each participant began the study with two practice blocks: the first with English target sentences masked by English sentences and the second with French target sentences masked by Tamil sentences. None of the materials in the practice blocks were used in the rest of the study. Transcripts of the two masking sentences were displayed on the screen, both during the practice blocks and the trial blocks, to aid participants in understanding which voices not to listen to (depiction of experimental interface illustrated in Figure 1). Listeners were instructed to ignore the sentences depicted on the screen and to listen instead to the third sentence (a relatively common practice – see e.g. Hawley, Litovsky & Culling, Reference Hawley, Litovsky and Culling2004).
The first trial of each block started with a target-to-masker ratio (TMR) at −16 dB, that is, with a target sentence much quieter than the two maskers. Participants were allowed to repeat the first trial as many times as necessary, with each repetition increasing the target level by 4 dB while the combined masker level was fixed. Participants were instructed to move on to the next trial once they were able to hear about half of the target sentence. At the end of each trial, participants were asked to type as much of the target sentence as they could. They were then presented with the correct transcript and asked to self-score the number of keywords they correctly typed (see Supplementary Material 1 for a detailed analysis of the self-scoring accuracy). Each target sentence contained five keywords, written in capital letters. If the listener identified three or more keywords correctly, the target level decreased by 2 dB, making the next trial more difficult. If the listener identified two or fewer keywords correctly, the target level increased by 2 dB, making the next trial easier. At the end of each block, this 1-up/1-down adaptive threshold method (Plomp & Mimpen, Reference Plomp and Mimpen1979) provided one value calculated as the mean TMR over the last eight trials; it was assumed to bracket the TMR required to achieve 50% intelligibility. This final SRT value serves as the dependent variable in our experiment.
After completing two practice blocks, participants completed 12 blocks measuring two SRTs for each of the six speech-in-speech conditions (two target languages by three masker languages). While each of the target sentences was presented to every listener in the same order, the order of the masking conditions was rotated for successive listeners, to counterbalance effects of order and material. They then completed six blocks measuring three SRTs for each of the two target languages against speech-shaped noise, where no transcript was displayed on the screen. Once again, these six blocks were counterbalanced.
2.4. Equipment
Because the experiment was delivered online during the COVID-19 pandemic, we were unable to control the audio quality presented to each participant. Instead, we asked participants to report whether they were listening through earbuds, headphones, loudspeakers or through the default output of their computer. The two groups differed in the type of audio output (χ 2(3) = 12.3, p =.006). In the L1 ENGLISH group, the most common audio output was the default output of their computer (36.8% of the group), while it was headphones (52.9% of the group) in the L1 FRENCH group. This difference was unfortunate but likely negligible since the two groups did not differ from one another in their SRTs against either noise or speech maskers (see results). We also asked them to report on a scale of 1–5 how good their audio quality was, where 1 was “poor” and 5 was “excellent”. The two groups did not differ in these subjective ratings (χ 2(2) = 2.9, p =.232). We found no impact of audio quality on SRT performance with either noise or speech maskers (all p-values ≥.252). Participants were instructed to set the volume of their output to a comfortable level during the practice blocks at the beginning of the task and to not touch the volume afterwards. All stimuli were presented at a sampling frequency of 44.1 kHz, with a 32-bit resolution. All subjects provided informed consent online in accordance with the Institutional Review Board at Concordia University (ref: 30013650) and were compensated £7.50 for completing the study, or £3.75 in the case of withdrawal from the study.
3. Data analysis
3.1. Speech-in-noise conditions
The effect of target language was first examined from the SRTs collected against speech-shaped noise maskers. A linear mixed-effect (LME) model was fitted on the DV (SRT in noise) with two fixed factors: group (L1 ENGLISH and L1 FRENCH) and target language (L1 and L2). We included random intercepts and slopes by participants and by lists. Each main effect and each interaction was tested by likelihood ratio tests progressively adding fixed terms to the final formula: DV ~ target*group + (1 + target | participant) + (1 + target | list).
3.2. Speech-in-speech conditions
An LME model was fitted on the SRT obtained across the six speech-in-speech conditions: with group (L1 ENGLISH and L1 FRENCH), target language (L1 and L2) and masker language (L1, L2 and Lf as fixed factors). We considered similar random terms as earlier, namely random intercepts and slopes (for the effect of target language) by participants and by lists. Furthermore, we also considered by-participant random slopes for the effect of masker (which improved the final model slightly further), while the model complexity could not support by-list random slopes for the effect of masker. Each main effect and each interaction was tested by likelihood ratio tests progressively adding fixed terms to the final formula: DV ~ target*masker*group + (1 + target+masker | participant) + (1 + target | list).
4. Results
4.1. Speech-in-noise conditions
The LME analysis (whose final output is shown in Supplementary Material 2) revealed a main effect of the target language (χ2(1) = 26.4, p <.001) reflecting that SRTs were estimated at 11.1 dB lower when listening to L1 rather than L2 (as illustrated in the left-hand sides of both panels in Figure 2). There was no main effect of group (χ2(1) = 1.0, p =.311) and no interaction (χ2(1) = 0.1, p =.749). Participants performed better with L1 targets than with L2 targets, and this pattern was found equally in both groups. Given that participants in the L1 FRENCH group reported being more fluent in English relative to the L1 ENGLISH group in French (section 2.1), one might have suspected a smaller SRT difference in L1 vs L2 in the L1 FRENCH than in the L1 ENGLISH group (i.e., an interaction), but this was not the case. This was rather a confirmation that the two groups were relatively good mirror images in this task.
4.2. Speech-in-speech conditions
The LME analysis (whose final output is shown in Supplementary Material 2) confirmed the main effect of target language as above (χ2(1) = 48.5, p <.001), but the size of the effect was slightly reduced: SRTs were 8.7 dB lower when listening to L1 rather than L2 (Figure 2). There was also a main effect of masker language (χ2(2) = 23.6, p <.001), a key result, suggesting that SRT was respectively 0.7 and 2.3 dB lower with an L2 and an Lf masker compared to an L1 masker. Importantly, this masker effect did not interact with target language (χ2(2) = 0.3, p =.846). To our knowledge, this has never been demonstrated before. There was no main effect of group (χ2(1) = 0.6, p =.426) and group did not interact with target (χ2(1) = 0.7, p =.402), with masker (χ2(2) = 1.7, p =.437), or in a 3-way (χ2(2) = 0.3, p =.882). To summarise, participants found the task easier when attempting to listen to sentences spoken in L1 rather than spoken in L2, and this was true for the two groups of participants (just like it was in background noise). On the other hand, participants found it most challenging to ignore the female voices speaking their L1, and least challenging when they spoke a completely foreign language. Critically, this pattern was similar whether the male target spoke in the participants’ L1 or L2, whether it was French or English, supporting the hypothesis of the listener’s familiarity with the masker language.
5. General discussion
5.1. Key finding
In this study, we used bilinguals to address a research question that had only partially been answered regarding the role of the listener’s language experience in cocktail party situations. Traditional descriptions of informational masking frame it in terms of target-to-masker similarity. For example, the word ‘dog’ spoken by a masking speaker could easily interfere with the target sentence “cats and pigs are selfish creatures” because of semantic similarity, or with the target sentence “the woodcutter searches for the missing log” because of phonetic similarity. This would occur irrespective of the additional energetic masking the word ‘dog’ may create at a given location on the target’s spectrogram. Although not explicitly stated, this conceptualisation of informational masking disregards the listener altogether. It is supposed to make no difference if the listener is a lumberjack or a veterinarian, or whether they spent the last year exploring the woods or dog-sitting. However, these experiential factors are likely to act like priming, that is pushing the listener to process speech in a certain manner and cueing them to guess a word (which could be highly degraded, embedded in noise or not even spoken yet). From this perspective, one would ideally want to redefine informational masking relative to the listener’s mind, not only relative to the similarity between the materials at play. This is the key message of this article, and we call for this redefinition because we showed that listeners experience the most difficulty in speech recognition with an L1 masker, not just with a masker that shares the same language as the target. This being said, we do not mean that target-to-masker similarity is irrelevant; it absolutely plays a role in cocktail party situations whether the similarity is phonetic or semantic in nature. In the current study, these two accounts were pitted against each other in a way that led to different predictions, but in real life, the two accounts are not mutually exclusive and certainly act together in situations of L1 targets.
5.2. Non-native maskers are weak
All participants exhibited the weakest interference with foreign language maskers. Our estimate of a 2.3-dB difference in performance between L1 and Lf maskers is in reasonably good agreement with previous reports in monolingual samples. For example, it is comparable to Rhebergen et al. (Reference Rhebergen, Versfeld and Dreschler2005) who reported a 3.0-dB difference, and Calandruccio et al. (Reference Calandruccio, Leibold and Buss2016) who reported a 2.8-dB difference in adults and a 3.0-dB difference in children for SRT against L1 or Lf maskers. To compare children to adults, this latter study used sentences from the Bamford-Kowal-Bench (BKB) Standard Sentence Test, which is based off the speech of children aged 8–15. The fact that similar effect sizes were found with very different materials (BKB database vs IEEE database here) and different age groups is a solid indication that the additional interference caused by the presence of native speech in the background is a reliable and replicable phenomenon.
Our findings also agree, though less directly, with Calandruccio et al. (Reference Calandruccio, Brouwer, Van Engen, Dhar and Bradlow2013) and Lecumberri and Cooke (Reference Lecumberri and Cooke2006). In both of these studies, results were reported in terms of percentage of keywords that participants entered correctly, not in terms of SRT. In the first one, monolingual English participants were tested on English targets under three different masking conditions: English, Dutch and Mandarin. Both Dutch and Mandarin were foreign languages, but Dutch is phonetically and grammatically similar to English, in contrast to Mandarin. Listeners obtained 20% increase in performance for Dutch relative to English maskers, and another 14% increase for Mandarin maskers. Considering that the slope of the psychometric functions underlying performance in such tasks is generally around 10% per dB in the vicinity of the inflection point (see for example Deroche et al., Reference Deroche, Limb, Chatterjee and Gracco2017b for an illustration of these estimates), this translates to roughly 2.0 dB and 3.4 dB decrease in SRT between L1 maskers and Dutch and Mandarin maskers, respectively. Thus, once again, languages that are phonetically and/or grammatically more different from L1 act as weaker maskers. Curiously, this opens up the possibility to assess how foreign different languages may be to one another, that is, demonstrating that Mandarin is perceptually (rather than lexically or syntactically) more foreign to English than Dutch is. Following this sort of reasoning here, we might speculate that Tamil is similarly foreign to English and French.
Unlike Calandruccio et al. (Reference Calandruccio, Brouwer, Van Engen, Dhar and Bradlow2013), Lecumberri and Cooke (Reference Lecumberri and Cooke2006) used bilingual participants in their experiment. They compared the performance of L1-English participants and L1-Spanish participants with L2-English targets (consonant phoneme sounds) in a variety of noise conditions, including English and Spanish speech maskers. The L1-English participants improved slightly by around 3% when the competing speech was Spanish (Lf) compared to English (L1), while L1-Spanish participants barely improved (1% difference) when competing speech was English (L2) compared to Spanish (L1). These differences between L1 and L2/Lf maskers are consistent with the direction of the present findings (and offered mirror populations, an important asset) but they were minimal in size. We speculate that it is because their materials were much simpler (consonant phonemes compared to full sentences in our study) leaving little room for informational masking to take place in the linguistic domain.
A particular note is warranted on the two studies which are perhaps closest to the current design. The observation that a situation of L1 vs Lf is easier than L1 vs L1 (experiments 1 and 3 in Brouwer et al., Reference Brouwer, Van Engen, Calandruccio and Bradlow2012; and experiments 1 and 2 in Mepham et al., Reference Mepham, Bi and Mattys2022) could be explained both by a target-to-masker similarity account or by a masker language familiarity account. However, both studies also tested bilinguals with L2 targets against L1 or L2 maskers. Surprisingly, Brouwer et al. (Reference Brouwer, Van Engen, Calandruccio and Bradlow2012) found L2 vs L2 to be more challenging than L2 vs L1, in direct contradiction to the present results, but they also acknowledged a role for the listener’s familiarity with the masker language via an indirect route (comparison of masking release across their experiments). Mepham et al. (Reference Mepham, Bi and Mattys2022)’s results were not straightforward (since the linguistic interference was captured using a difference between forward and time-reversed maskers, compared across groups), but they also acknowledged a role for the listener’s familiarity with the masker language. It is somewhat surprising that we found such a clear pattern in support of the familiarity account, while neither Brouwer et al. (Reference Brouwer, Van Engen, Calandruccio and Bradlow2012) nor Mepham et al. (Reference Mepham, Bi and Mattys2022) observed it as clearly as we did. Aside from methodological aspects,Footnote 1 we suspect that the gender difference between target and maskers in our study was a critical parameter. Voice pitch (like spatial location) is a powerful cue to group pieces of a target utterance into a coherent stream. Without it (Brouwer et al. used the same speakers for targets and maskers) or when it is too subtle (Mepham et al. used different female speakers but their F0 was close), it may be that listeners are too confused about which voice to direct their attention to, leaving little room for the listener’s familiarity with a masker language to play its role. In other words, the task may be overloaded by the failure of grouping mechanisms based on F0 and vocal tract length. If this interpretation were correct, it would imply that observing the phenomenon of masker language familiarity is facilitated when strong grouping cues are present, making this phenomenon very valid from an ecological perspective.
5.3. Future directions
Like many of the articles discussed above, our study looked at the performance of adult participants. In the vein of Calandruccio et al. (Reference Calandruccio, Leibold and Buss2016), replicating our experimental design in children would be valuable both for developmental purposes (to better understand the role of bilingualism in development) and because children are generally known to be more prone to informational masking (e.g., Wightman, Kistler & O’Bryan, Reference Wightman, Kistler and O’Bryan2010), so perhaps the masker language familiarity account would be exacerbated in paediatric populations. This poses important challenges though because this endeavour would require modifications of task and materials to be accessible to children (e.g., simpler task, close-set, smaller vocabulary). Unfortunately, doing so will reduce informational masking (which is generally more involved with complex speech materials and open-set tasks). Also, replicating this study in-person is warranted to make sure that these findings obtained online are generalisable. Our comparison with in-person experiments (traits of self-scoring accuracy in Supplementary Material 1, from prior studies) at least supports the idea that the method remained generally valid and that this online dataset was of decent quality. However, the samples of bilinguals found online vary in a number of other factors, typically the amount of music training (see e.g. Neumann et al., Reference Neumann, Sares, Chelini and Deroche2023) uncontrolled here but which could have an impact on inhibitory processes involved in speech. Another avenue would be replicating this experiment with bilinguals whose two languages are more distinct. Though English is a Germanic language and French is a Romance language, the French language has had a large influence on the English language as a result of the French invasion of England in the 11th century (Britannica, 2021), and both are Indo-European. Perhaps, we would find more dramatic results by recruiting bilinguals whose two languages were less related, such as English and Mandarin.
6. Conclusion
Our results indicate that a background babble is the most disruptive masker when speaking in the listener’s L1, and the least disruptive masker when speaking in a language foreign to the listener. Critically, this finding was observed irrespective of the target language and not tied to the language’s identity (be it French or English). These results call for redefining the concept of informational masking in relation to the listener’s linguistic profile, not just in terms of target-to-masker similarity.
Supplementary material
To view supplementary material for this article, please visit http://doi.org/10.1017/S1366728924000944.
Data availability
The data that support the findings (and additional exploratory analyses raising the possibility of qualitative differences between balanced versus unbalanced bilinguals) are openly available in OSF at https://osf.io/2x653/.
Acknowledgements
This research was supported by a pilot grant awarded to M.D. and K.B.H. from the Center for Research on Brain, Language and Music (Research Incubator Award, reference FRQ-NT RS-203287). We wish to thank Ms. Ramiya Veluppillai for her help in making the trilingual recordings used in this study.
Competing interest
The authors declare none.