Introduction
Mismatch Negativity (MMN), a component of auditory Event-Related Potentials (ERP), indicates sound-change detection. In a typical oddball paradigm, where listeners perceive a series of “standard” stimuli, the brain generates a short-term memory trace of the stimuli and uses it to predict upcoming ones. If listeners hear a sound that is different from their prediction (i.e., a sound change from a standard to a deviant stimulus), a prediction error occurs, as reflected by the MMN (Luck, Reference Luck2005; Näätänen et al., Reference Näätänen, Kujala and Light2019). The MMN amplitude is affected by the frequency of standard and deviant stimuli that listeners hear in the experiment (Garrido et al., Reference Garrido, Kilner, Stephan and Friston2009; Imada et al., Reference Imada, Hari, Loveless, McEvoy and Sams1993; Javitt et al., Reference Javitt, Grochowski, Shelley and Ritter1998; May et al., Reference May, Tiitinen, Ilmoniemi, Nyman, Taylor and Näätänen1999; Näätänen, Reference Näätänen1992; Sams et al., Reference Sams, Alho and Näätänen1983), the acoustic difference between standards and deviants (Lang et al., Reference Lang, Nyrke, Ek, Aaltonen, Raimo and Näätänen1990; Näätänen et al., Reference Näätänen, Schröger, Karakas, Tervaniemi and Paavilainen1993; Tervaniemi et al., Reference Tervaniemi, Ilvonen, Karma, Alho and Näätänen1997), prototypicality of standard and deviant stimuli (e.g., native vs. nonnative phones: Grimaldi et al., Reference Grimaldi, Sisinni, Gili Fivela, Invitto, Resta, Alku and Brattico2014; Näätänen et al., Reference Näätänen, Lehtokoski, Lennes, Cheour, Huotilainen, Iivonen, Vainio, Alku, Ilmoniemi, Luuk, Allik, Sinkkonen and Alho1997; Peltola et al., Reference Peltola, Kujala, Tuomainen, Ek, Aaltonen and Näätänen2003; allophonic variants: Bühler et al., Reference Bühler, Schmid and Maurer2017), and discriminability between standards and deviants (Dehaene-Lambertz & Baillet, Reference Dehaene-Lambertz and Baillet1998; Lovio et al., Reference Lovio, Pakarinen, Huotilainen, Alku, Silvennoinen, Näätänen and Kujala2009; Näätänen, Reference Näätänen2001; Näätänen et al., Reference Näätänen, Lehtokoski, Lennes, Cheour, Huotilainen, Iivonen, Vainio, Alku, Ilmoniemi, Luuk, Allik, Sinkkonen and Alho1997; Pakarinen et al., Reference Pakarinen, Lovio, Huotilainen, Alku, Näätänen and Kujala2009; Rinker et al., Reference Rinker, Alku, Brosch and Kiefer2010; Yang Zhang et al., Reference Zhang, Kuhl, Imada, Kotani and Tohkura2005). We address how the MMN is modulated according to the conscious discriminability between standard and deviant stimuli and the prototypicality of standard and deviant stimuli in listeners’ native languages.
One factor that affects the MMN amplitude is conscious discriminability between standard and deviant stimuli (Amenedo & Escera, Reference Amenedo and Escera2000; Lang et al., Reference Lang, Nyrke, Ek, Aaltonen, Raimo and Näätänen1990; Näätänen, Reference Näätänen2001; Näätänen et al., Reference Näätänen, Paavilainen, Rinne and Alho2007; Näätänen & Alho, Reference Näätänen and Alho1997; Pakarinen et al., Reference Pakarinen, Takegata, Rinne, Huotilainen and Näätänen2007). Conscious discriminability refers to whether listeners can behaviorally detect an auditory difference between stimuli; for example, people can often discriminate native vowels better than nonnative ones. For the perception of nonnative phones, listeners tend to assimilate unfamiliar nonnative phones into the most articulatorily similar native-language (L1) phonemes. According to the Perceptual Assimilation Model (PAM; Best, Reference Best, Rovee-Collie and Lipsitt1994a, Reference Best1994b, Reference Best and Strange1995), when two nonnative phones are assimilated into two different L1 phonemes (Two Category assimilation), discrimination of the two nonnative phones is excellent. When two nonnative phones are assimilated into a single L1 phoneme and are equally good exemplars of it (Single Category assimilation), discrimination is poor. Finally, when two nonnative phones are assimilated into a single L1 phoneme, but one is a better exemplar than the other (Category-Goodness difference assimilation), discrimination is moderate to good. Previous studies have shown that the behavioral discriminability predicted by those assimilation patterns is correlated with the MMN amplitude (Grimaldi et al., Reference Grimaldi, Sisinni, Gili Fivela, Invitto, Resta, Alku and Brattico2014; Näätänen et al., Reference Näätänen, Lehtokoski, Lennes, Cheour, Huotilainen, Iivonen, Vainio, Alku, Ilmoniemi, Luuk, Allik, Sinkkonen and Alho1997; Peltola et al., Reference Peltola, Kujala, Tuomainen, Ek, Aaltonen and Näätänen2003). Nonnative phonetic contrasts belonging to the Category-Goodness difference assimilation type are relatively easy to discriminate and elicit a larger MMN when compared to the less discriminable contrasts belonging to the Single Category assimilation type (Grimaldi et al., Reference Grimaldi, Sisinni, Gili Fivela, Invitto, Resta, Alku and Brattico2014).
Discriminability and prototypicality of speech sounds are often correlated. For example, two separate native phonemes are more prototypical (i.e., frequently used) and discriminable than are two nonnative phones (e.g., less prototypical allophonic variants in a native phoneme). However, under the standard-deviant oddball paradigm where MMN is elicited, the prototypicality of standard stimuli affects the MMN amplitude regardless of the discriminability between standard and deviant stimuli. When listeners hear a series of standard stimuli, they generate a short-term memory trace of them and predict upcoming ones. As this involves mapping the standards to a listener’s phonological category (Näätänen et al., Reference Näätänen, Lehtokoski, Lennes, Cheour, Huotilainen, Iivonen, Vainio, Alku, Ilmoniemi, Luuk, Allik, Sinkkonen and Alho1997; Phillips et al., Reference Phillips, Pellathy, Marantz, Yellin, Wexler, Poeppel, McGinnis and Roberts2000), their phonological categories and the phonetic representations in their long-term memory affect generating the short-term memory trace of the standard stimuli (Näätänen et al., Reference Näätänen, Paavilainen, Rinne and Alho2007; Shafer et al., Reference Shafer, Kresh, Ito, Hisagi, Vidal, Higby, Castillo and Strange2021). Shafer et al. (Reference Shafer, Schwartz and Kurtzberg2004) found that a larger MMN is elicited when people hear a more prototypical consonant as standard and a less prototypical consonant as deviant, compared to the reversed order. Regarding vowels, Shafer et al. (Reference Shafer, Kresh, Ito, Hisagi, Vidal, Higby, Castillo and Strange2021) revealed that a larger MMN is elicited when the standard stimulus is a vowel sharing common phonetic features of a native vowel and the deviant stimulus is a vowel having less common phonetic features, compared to the reversed order. Shafer et al. (Reference Shafer, Schwartz and Kurtzberg2004, Reference Shafer, Kresh, Ito, Hisagi, Vidal, Higby, Castillo and Strange2021) described these results as a potential reflection of the strength of the short-term memory traces constructed for standard stimuli based on long-term phonological representations of listeners. That is, prototypical phones used as standard stimuli lead to more stable memory traces that generate more precise predictions.
Based on the theoretical explanation of MMN (Näätänen et al., Reference Näätänen, Paavilainen, Rinne and Alho2007) and the prototypicality effects (Shafer et al., Reference Shafer, Schwartz and Kurtzberg2004, Reference Shafer, Kresh, Ito, Hisagi, Vidal, Higby, Castillo and Strange2021), we claim that the prototypicality of a standard stimulus modulates the MMN amplitude more robustly than does the discriminability of standard and deviant stimuli. If standard stimuli are good exemplars of the listener’s L1 phoneme, a more stable short-term memory trace will be formed, thus making more precise predictions about upcoming stimuli. As a result of this neural processing, when listeners perceive a deviant stimulus, a large prediction error occurs, leading to a large MMN. In contrast, if standard stimuli are not prototypical to listeners, the process of generating a short-term memory trace will be less certain, leading to less precise predictions about upcoming ones, resulting in a smaller MMN (Jacobsen, Reference Jacobsen, Horváth, Schröger, Lattner, Widmann and Winkler2004; Peltola et al., Reference Peltola, Kujala, Tuomainen, Ek, Aaltonen and Näätänen2003; see predictive coding: Baldeweg, Reference Baldeweg2006; Friston, Reference Friston2002, Reference Friston2005; Garrido et al., Reference Garrido, Kilner, Stephan and Friston2009; Wacongne et al., Reference Wacongne, Changeux and Dehaene2012; Winkler & Czigler, Reference Winkler and Czigler2012). We hypothesized that this prototypicality effect happens even when both standard and deviant stimuli are categorized as a single native phoneme.
This study examined which of the two, namely the standard-stimulus prototypicality and discriminability between standards and deviants, modulates the MMN responses more. The Japanese speakers’ English vowel perception was tested in a previous behavioral study (Shinohara et al., Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019), and their MMN responses were measured in this study. Table 1 presents the predictions from the two opposing accounts. Although the English /æ/, /ʌ/, and /ɑ/ are all categorized as a single Japanese /a/ phoneme, it is more difficult for Japanese speakers to discriminate between the English /ɑ/ and /ʌ/ stimuli than between the English /æ/ and /ʌ/, because the former are equally good exemplars of the Japanese /a/ (English /ɑ/-/ʌ/: Single Category assimilation), and the English /æ/ is a worse exemplar of the Japanese /a/ than is the English /ʌ/ (English /æ/-/ʌ/: Category-Goodness difference assimilation) (Lengeris, Reference Lengeris2009; Shinohara et al., Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019; Strange et al., Reference Strange, Akahane-Yamada, Kubo, Trent, Nishi and Jenkins1998). The discriminability account predicts that a larger MMN response would be observed when the English /æ/ is used as a standard and the English /ʌ/ is a deviant, compared to when the English /ɑ/ is used as a standard and the English /ʌ/ is a deviant. In contrast, according to the prototypicality effect, when Japanese speakers hear the English /ɑ/ as standard and the English /ʌ/ as deviant, a larger MMN is expected. Under the condition in which Japanese speakers hear a series of English /ɑ/ as standard stimuli, given its prototypicality as the Japanese /a/ (Lengeris, Reference Lengeris2009; Shinohara et al., Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019; Strange et al., Reference Strange, Akahane-Yamada, Kubo, Trent, Nishi and Jenkins1998), they easily form a short-term memory trace while hearing standards, which results in a stronger prediction. Thus, a deviant stimulus /ʌ/ would elicit a larger MMN. However, if they hear the English /æ/ as standard, owing to the poor fit to the Japanese /a/ (Lengeris, Reference Lengeris2009; Shinohara et al., Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019; Strange et al., Reference Strange, Akahane-Yamada, Kubo, Trent, Nishi and Jenkins1998), a less robust memory trace will be generated. A smaller MMN amplitude is predicted for the /æ/-/ʌ/ contrast by the prototypicality account.
When examining the discriminability of natural recordings with MMN responses, care must be taken to control for confounding acoustic difference. The discrimination accuracy of the /æ/-/ʌ/ contrast is higher than that of the /ɑ/-/ʌ/ contrast for Japanese speakers (Lengeris, Reference Lengeris2009; Strange et al., Reference Strange, Akahane-Yamada, Kubo, Trent, Nishi and Jenkins1998), but this result is often attributed to both perceptual assimilation and acoustic distance. The English /æ/ and /ʌ/ are acoustically more different from each other than the /ɑ/-/ʌ/ contrast (Hillenbrand et al., Reference Hillenbrand, Getty, Clark and Wheeler1995). In this study, we controlled for this confound by using resynthesized stimuli of English /æ/, /ʌ/, and /ɑ/. This study used two English vowel contrasts (/æ/-/ʌ/, /ɑ/-/ʌ/) with the acoustic distances equalized (see the “Method” section for details) and measured the MMN amplitudes of the /æ/-/ʌ/ and /ɑ/-/ʌ/ contrasts in Japanese speakers. We claim that as the MMN is elicited by a regularity violation, the prototypicality of standard stimuli that generate the regularity (i.e., short-term memory trace) was predicted to be a factor modulating the MMN amplitude more than the discriminability between standard and deviant stimuli. Native English speakers were also recruited as a control group. The two MMN amplitudes should be about the same for English speakers, who have three separate phonological categories for /æ/, /ʌ/, and /ɑ/. The results describe how the MMN mechanism is phonetically driven and present the factors that need to be considered when measuring the MMN amplitude.
Method
Participants
This study was approved by the ethics review boards at Waseda University (Tokyo, Japan) and University of Delaware (Delaware, US); all participants signed informed consent forms. A total of 56 people participated in the electroencephalography (EEG) recording sessions at two different laboratories. All participants at the University of Delaware were native monolingual speakers of American English and all participants at Waseda University were native monolingual speakers of Japanese. Participants reported no history of speech or hearing impairments, no experience living outside their home country for more than 4 months, and spoke only their native language in their daily lives. In addition, each participant’s parents were native speakers of the participant’s native language. Table 2 shows the age, gender, and handedness data of the participants. Although gender was not balanced between the language groups, age was nearly the same.
Stimuli
Using linear predictive coding (LPC) analysis and resynthesis in Praat (Boersma & Weenink, Reference Boersma and Weenink2017), three stimuli were generated. LPC is often used in signal processing to control acoustic cues, such as formant frequencies (i.e., one of the acoustic cues that people use for identifying vowels). In this study, a neutral LPC residual (i.e., a female voice with formant frequencies cancelled out) was filtered using a spectral envelope with F1 to F4 information. Shinohara et al. (Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019) examined the categorization with goodness-rating test for 28 English and 30 Japanese speakers, using a stimulus continuum varying in only F2. The F1, F3, and F4 were set at 979 Hz, 2,886 Hz, and 4,151 Hz, respectively. The stimuli with F2 at 2,017 Hz, at 1,755 Hz, and at 1,493 Hz were most frequently identified as /æ/, /ʌ/, and /ɑ/, respectively, by English speakers, whereas the same stimuli were all categorized as Japanese /a/ by Japanese speakers. The goodness-rating test results for the 30 Japanese speakers demonstrated that the /æ/ stimulus was a significantly worse exemplar of the Japanese /a/ than the /ʌ/ stimulus, while there was no significant difference in the goodness rating between the /ɑ/ and /ʌ/ stimuli. Thus, it was confirmed that the English /æ/-/ʌ/ contrast belongs to the Category-Goodness difference assimilation type, whereas the English /ɑ/-/ʌ/ contrast belongs to the Single Category assimilation type for Japanese speakers. Shinohara et al. (Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019) examined the discriminability of the English /æ/-/ʌ/ and /ɑ/-/ʌ/ contrasts for Japanese speakers and found that it was significantly more difficult to discriminate the /ɑ/-/ʌ/ contrast than the /æ/-/ʌ/ contrast, whereas the acoustic distance between the sounds in each contrast was the same in Hertz.
Table 3 shows the acoustic information of the stimuli resynthesized for the present auditory ERP experiment. Figure 1 displays the F1 and F2 frequencies of those stimuli in the Bark scale (i.e., a frequency scale corresponding with human perception). To minimize the discrepancy between the acoustic and perceptual distance, the F2 frequency of the three stimuli of the English /æ/, /ʌ/, and /ɑ/ used in Shinohara et al. (Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019) were modified to suit the Bark scale for the auditory ERP experiment in this study. As depicted by the red circles in Figure 1, the F2 of the English /æ/ was set at 2,027 Hz (13.1 in Bark), that of the English /ʌ/ was at 1,746 Hz (12.1 in Bark), and that of the English /ɑ/ was at 1,502 Hz (11.1 in Bark). Another eight stimuli (Random 1–8 in Table 3) were created and are represented by the blue dots in Figure 1. F1 was set at 378 Hz (3.8 in Bark) for the three close vowels (Random 6, 7, and 8 in Table 3), 644 Hz (6.1 in Bark) for the three mid vowels (Random 3, 4, and 5), and 979 Hz (8.4 in Bark) for the two open vowels (Random 1 and 2). F2 was set at 2,463 Hz (14.4 in Bark) for the three front vowels (Random 1, 3, and 6), 1,746 Hz (12.1 in Bark) for the two central vowels (Random 4 and 7), and 1,229 Hz (9.8 in Bark) for the three back vowels (Random 2, 5, and 8). Other acoustic cues (F3, F4, bandwidth for each formant, duration) were the same as those in the behavioral auditory discrimination test of Shinohara et al. (Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019). The sound intensity was normalized among the 11 stimuli by the root mean square method in Praat (Boersma & Weenink, Reference Boersma and Weenink2019).
Procedure
EEG Recording
A total of 56 people (26 American English and 30 Japanese speakers) participated in the EEG recording sessions, conducted at two locations (United States, Japan). One English-speaking participant’s data were not saved because of a technical problem, which resulted in the data of 55 participants being analyzed. The participants were tested in a soundproof booth, where they were seated in a comfortable reclining chair. For English speakers, continuous EEG was recorded from 128 carbon fiber core/silver-coated electrodes in an elastic electrode net (HydroCel Geodesic Sensor Net) at the University of Delaware. The continuous EEG was digitized with the EGI Net Station software v. 4.5 with a sampling rate of 250 Hz. Before data acquisition, electrode impedances were lowered to below 50 kΩ. Participants’ electroocular activity was recorded from four bipolar channels. The vertical electrooculogram (EOG) was recorded with the supraorbital and infraorbital electrodes of both eyes; the horizontal EOG was recorded with the electrodes located at the outer canthi of both eyes. Channel E129 (corresponding to Cz electrode in the 10-10 system) placed at the center point of the scalp was used as the online reference site. The data of English speakers were passed through a 0.3 Hz FIR high-pass filter after recording, and the channel set was later remapped to match the 32 channels of the BrainAmp (Brain Products GmbH) used in the other laboratory in Japan, so that both datasets were analyzed together.
For Japanese speakers, the continuous EEG was recorded from 32 sintered Ag/AgCl passive electrodes of BrainAmp which adhered to the subject’s scalp using an EEG recording cap (Easy Cap 40 Asian cut, Montage No. 24) at Waseda University. Of the 32 channels, one was used for recording horizontal eye movement (HEOG) and placed next to the outer canthus of the right eye. One channel (otherwise used as AFz) was used for grounding, and one was used as an online reference electrode attached to the FCz (i.e., fronto-central point of the head). The remaining 29 channels were mounted onto the cap, according to the 10/20-system, with the electrode adaptor A06 using high-chloride, abrasive electrolyte gel. The impedance level for all electrodes was below 5 kΩ. The analog signal was digitized at 250 Hz. As the data had passed through an online 0.016 Hz high-pass filter in the Brain Vision Recorder software as a default setting, it was not necessary to use an offline filter for the Japanese speakers’ data.
There were three blocks in the EEG recording session. The first was the random-standard control condition (Horváth et al., Reference Horváth, Czigler, Jacobsen, Maess, Schröger and Winkler2008), which comprised a randomized sequence of eight resynthesized vowel sounds (as illustrated in Figure 1) and the /ʌ/ stimulus that was used as a deviant in the following two blocks. As the eight random standards stimuli did not form a single category, participants could not predict the upcoming stimuli. Therefore, the /ʌ/ stimulus should not elicit MMN in this condition, and the Auditory Evoked Potential (AEP) it generates should just be the ERP response to that particular sound, not affected by MMN modulation. It therefore serves as a control condition to show that the attenuation of the AEP in the following two MMN-testing conditions is because of the MMN mechanism. Thus, the /ʌ/ stimulus in the random-standard control condition was not a deviant because it just appeared among the eight random standards. However, it is described as “deviant” for the statistical analysis to compare the AEP change from standards to deviant with that in the following two MMN-testing conditions.
In the second and third blocks, the participants heard either /æ/ or /ɑ/ as a standard (or frequent) stimulus, whereas /ʌ/ was always the deviant. The block order was counterbalanced between the participants in each language group. If the second block presented /æ/ as the standard and /ʌ/ as the deviant stimuli, the third block presented /ɑ/ as standard and /ʌ/ as deviant, or vice versa. In both blocks, as the standard stimulus is of one kind, participants can predict upcoming stimuli while hearing a repetition of the standard stimulus. When they hear a deviant, an MMN is expected to be elicited (i.e., the MMN-testing conditions).
In summary, two native-language groups (English and Japanese) heard two stimulus types (standard and deviant) in three vowel contrast conditions (control, front, and back). Each of the three blocks had 900 tokens (800 standards and 100 deviants), resulting in a total of 2,400 standards and 300 deviants. A continuous sequence of standards and deviants were presented through ER1 insert earphones (Etymotic Research) in Japan, but from two analog speakers in the United States. At both sites, the stimuli were played at 70 dB on average, and the interstimulus interval was varied randomly, around 717 ms (median = 716 ms, SD = 50 ms). Each block lasted for about 11 minutes, and the entire EEG recording took about 45 minutes, including breaks. During the EEG recording, the subjects were instructed to ignore the auditory input and watch a movie, Wall-E (Stanton, Reference Stanton2008) with no sound, because the MMN is elicited even in the absence of attention (Atienza & Cantero, Reference Atienza and Cantero2001; Atienza et al., Reference Atienza, Cantero and Gómez1997, Reference Atienza, Cantero and Gómez2000, Reference Atienza, Cantero and Escera2001, Reference Atienza, Cantero and Stickgold2004, Reference Atienza, Cantero and Quian Quiroga2005; Nashida et al., Reference Nashida, Yabe, Sato, Hiruma, Sutoh, Shinozaki and Kaneko2000; Sallinen et al., Reference Sallinen, Kaartinen and Lyytinen1994, Reference Sallinen, Kaartinen and Lyytinen1996; Sculthorpe et al., Reference Sculthorpe, Ouellet and Campbell2009).
Table 4 presents the six cells that resulted from the three testing conditions with two types of stimuli: (1) random standards in the control condition, (2) /ʌ/ in the random standards control condition, (3) standard /æ/ in the front /æ/-/ʌ/ vowel condition, (4) deviant /ʌ/ in the front /æ/-/ʌ/ vowel condition, (5) standard /ɑ/ in the back /ɑ/-/ʌ/ vowel condition, and (6) deviant /ʌ/ in the back /ɑ/-/ʌ/ vowel condition. We hypothesized that the MMN elicited when the sound changes from (5) standard /ɑ/ to (6) deviant /ʌ/ is larger than the one that is elicited when the sound changes from (3) standard /æ/ to (4) deviant /ʌ/ for Japanese speakers, but there is no such difference in the MMN effects between the two conditions for English speakers.
EEG Signal Processing
Segmentation and Artifact Correction. After recording, the raw continuous EEG data of the electrodes were imported into the ERP PCA toolkit v. 2.77 (Dien, Reference Dien2010) run on MATLAB R2019b (Delorme & Makeig, Reference Delorme and Makeig2004). The continuous EEG was first segmented into epochs from –200 ms to 800 ms relative to the stimulus onset. The segmented data were baseline-corrected by subtracting the mean of the 200 ms baseline period (i.e., –200 ms to 0 ms from the onset of stimuli) from the whole segment. The data were then submitted to an automatic process of eyeblink subtraction using Independent Component Analysis (ICA). An eyeblink template was automatically generated for each subject. An eyeblink component was marked and subtracted from the data if it was correlated at r = 0.9 or greater with the eyeblink template. Next, the bad channels were marked if their best absolute correlation with their neighboring channels fell below 0.4 across all time points. Those bad channels were replaced using spline interpolation from neighboring good channels. A channel was also declared globally bad if it was bad for more than 20% of the trials. A trial was marked bad and zeroed out if it contained more than 10% bad channels. All channels were rereferenced to the average of two mastoid electrodes. Finally, the remaining amplitude data from all tokens were categorized into the six cells in Table 4 and were averaged for each participant. If there were more than 10% global bad channels or fewer than 15 good trials in any of the six categories, those participants’ data were not included in the statistical analysis. None of the American English speakers’ data were excluded from the analysis based on this, whereas four Japanese speakers’ data were excluded.
PCA Preprocessing and Selection of the Time Windows and Electrode Regions
We took two steps to measure the MMN amplitudes. First, the time windows and electrode regions were objectively selected using a sequential temporospatial Principal Component Analysis (PCA), with a Promax rotation for the temporal PCA and an Infomax rotation for the spatial PCA. PCA decomposes the temporal and spatial dimensions into a linear combination of a smaller set of abstract ERP factors based on covariance patterns among time points and electrode sets. Before conducting the PCA, the electrode montage used to collect the English speakers’ data was remapped to the same montage used for the Japanese speakers in the ERP PCA toolkit 2.93 (Dien, Reference Dien2010), so that the data could be combined into a single dataset with the speaker’s language as a between-subject variable. Using the combined data of Japanese and English speakers, three difference waves were calculated by subtracting the absolute waveforms of standard stimuli (i.e., random standards, /æ/, /ɑ/) from those of the deviant ones (i.e., /ʌ/) in three vowel conditions (control, front, and back). Then, the PCA was conducted with the three difference waves to identify the temporal and spatial distribution of the MMN. The temporal PCA generated 22 temporal factors, accounting for 83% of the total variance, and the spatial PCA identified four spatial factors for each of the temporal factors, accounting for 69% of the total variance. The temporal factors that accounted for less than 5% of the total variance were excluded, leaving only four temporal factors (TF1 = 16.7%, TF2 = 12.8%, TF3 = 7.3%, TF4 = 6.8%).
Figure 2 displays the difference waves reconstructed as voltage based on a temporospatial factor loading. The PCA identified the temporospatial factors that showed attenuation of AEP between standard and deviant stimuli in the MMN-testing condition (front and back) compared to that in the control condition. The four temporospatial factors (TF1SF1, TF2SF1, TF3SF1, and TF4SF1) showed negative responses in the front and back conditions, but not in the control condition. Those factors’ electrode regions also corresponded to the MMN responses. However, one factor (TF2SF1), which peaked at 744 ms during the 632–796 ms time window was excluded because of the late response (Garrido et al., Reference Garrido, Kilner, Stephan and Friston2009; Luck, Reference Luck2005; Yun Zhang et al., Reference Zhang, Yan, Wang, Wang, Wang, Wang and Huang2018). Thus, the three temporospatial factors displayed in Figure 2 (TF1SF1, TF3SF1, and TF4SF1) were selected for further analysis.
Table 5 describes the selected temporospatial factors. Their time windows carry a factor loading score of more than 0.6, and their electrodes carry a factor loading score of more than 0.9. In the final step, using the time windows and electrode regions identified by PCA, the amplitude of the absolute waveform to each token (2,400 standards and 300 deviants) was computed and categorized as one of the six cells (random standards in the control condition, /ʌ/ in the control condition, /æ/ standards in the front /æ/-/ʌ/ condition, /ʌ/ deviants in the /æ/-/ʌ/ condition, /ɑ/ standards in the back /ɑ/-/ʌ/ condition, and /ʌ/ deviants in the /ɑ/-/ʌ/ condition). Finally, we averaged the tokens in each cell for each participant for statistical analyses.
Results
Figure 3 displays the voltage amplitude of absolute waveforms of standards and deviants and their difference waves (deviant minus standard) in the control (random standards vs. /ʌ/), the front (standard /æ/ vs. deviant /ʌ/) and the back (standard /ɑ/ vs. deviant /ʌ/) vowel conditions. The boxplots show the voltage amplitude of difference waves of the standard and deviant stimuli collected from the time windows and electrode regions of the temporospatial factors reflecting MMN. The difference wave was calculated by subtracting the voltage amplitude of standard stimuli from that of deviant stimuli in each vowel condition (control, front, and back). A linear mixed-effects model was used for statistical analysis, with the difference wave as the dependent variable. The best-fitting model was selected through a top-down approach (i.e., excluding unnecessary fixed and random factors from the model that included all potential factors), using the R package lme4 (Bates et al., Reference Bates, Mächler, Bolker and Walker2015). The linear mixed-effects model included the fixed factors of language group (English, Japanese), vowel contrast condition (control: random vowels-/ʌ/, front: /æ/-/ʌ/, back: /ɑ/-/ʌ/), and their interactions. Orthogonal contrast was set for each fixed factor. The random factors were the crossed intercepts of participant and temporospatial factor.
Table 6 presents the results of the planned contrast analyses of the linear mixed-effects model. The significant effect of vowel condition (control vs. the front and back MMN-testing conditions), β = –0.48, SE = 0.03, t = –15.56, p < .001, suggests that the attenuation of the absolute waveforms from standard to deviant stimuli in the MMN-testing (i.e., front and back) conditions was significantly larger than that in the control condition. Another significant vowel condition contrast (front vs. back), β = –0.24, SE = 0.05, t = –4.49, p < .001, suggests that the MMN effect was larger in the back condition than that in the front condition across language groups. However, there was a significant two-way interaction of language group (English vs. Japanese) and vowel condition (front vs. back), β = –0.22, SE = 0.05, t = –4.07, p < .001, suggesting that the MMN effect in each of the front and back conditions was different between English and Japanese speakers.
Further analyses were conducted for each language group using separate linear mixed-effects models. For Japanese speakers, the MMN amplitude was larger for the back (i.e., standard /ɑ/ vs. deviant /ʌ/) than for the front (i.e., standard /æ/ vs. deviant /ʌ/) condition, as shown by a significant vowel condition (front vs. back), β = –0.45, SE = 0.08, t = –5.58, p < .001. The other vowel condition contrast (control vs. front and back) was also significant, β = –0.72, SE = 0.05, t = –15.39, p < .001, suggesting that the MMN effect was significant in the MMN-testing conditions compared to the control condition.
However, for the English speakers, a significant MMN amplitude difference between the front (i.e., standard /æ/ vs. deviant /ʌ/) and back (i.e., standard /ɑ/ vs. deviant /ʌ/) conditions was not observed, β = –0.02, SE = 0.07, t = –0.33, p > .05. This means that the MMN effect was not different between the two conditions, although it was significant in the MMN-testing conditions as demonstrated by a significant vowel condition effect (control vs. front and back), β = –0.23, SE = 0.04, t = –5.97, p < .001.
These results show that the deviant /ʌ/ in a series of standard /ɑ/ (the back condition) elicited a larger MMN than the deviant /ʌ/ in a series of standard /æ/ (the front condition) for Japanese speakers, but there was no such vowel condition effect on MMN for English speakers.
Discussion
The present study examined the opposing effects of prototypicality and discriminability of standard and deviant stimuli and investigated which of the two modulated the MMN amplitude more. According to the discriminability account, the front condition (i.e., /ʌ/ deviants in a series of the /æ/ standard stimuli) elicits a larger MMN than does the back condition (i.e., /ʌ/ deviants in a series of the /ɑ/ standard stimuli), as the front contrast (/æ/ vs. /ʌ/) is more discriminable than the back one (/ɑ/ vs. /ʌ/) for Japanese speakers (Shinohara et al., Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019). However, we hypothesized the opposite, following the prototypicality account, where the back condition elicits a larger MMN than the front one. When Japanese speakers hear the English /ɑ/ as standard stimuli in an oddball paradigm, the prototypical phonetic status of the English /ɑ/ as Japanese /a/ easily generates a short-term memory trace, and their prediction of upcoming stimuli becomes robust. When they hear a deviant /ʌ/, the stronger prediction error occurs, resulting in a larger MMN, compared to when they hear a deviant /ʌ/ in a series of standard /æ/, which is less prototypical for Japanese speakers. The results of this study supported this prototypicality account, indicating that the prototypicality of standard stimuli modulates the MMN amplitude more than the discriminability of standard and deviant stimuli.
The results of this study showed that generating a short-term memory trace by hearing a series of standard stimuli that is easily mapped onto a listener’s L1 phonological category is more important in eliciting a larger MMN than the discriminability between standard and deviant stimuli. This interpretation is supported by the results of previous studies. For example, Shafer et al. (Reference Shafer, Schwartz and Kurtzberg2004) found that no MMN was elicited when English speakers heard a nonnative phone as standard and a native phone as deviant, although an MMN was observed when they heard a native phone as standard and a nonnative phone as deviant. This is because hearing a nonnative phone repeatedly does not generate a robust short-term memory trace, whereas hearing repetitions of a native phone stably generates a memory trace. Thus, the phonetic status in listeners’ phonological representations, namely the prototypicality as listeners’ L1 phoneme, affects the prediction of upcoming stimuli, resulting in a modulation of the MMN amplitude.
A more recent study testing Japanese speakers’ perception of English vowels also demonstrated similar results. Shafer et al. (Reference Shafer, Kresh, Ito, Hisagi, Vidal, Higby, Castillo and Strange2021) found that a larger MMN was elicited when a prototype stimulus (e.g., a nonnative stimulus sharing phonetic features of a Japanese phoneme) was used as standard and a nonprototype stimulus was used as deviant, compared to the reversed order. In Shafer et al. (Reference Shafer, Kresh, Ito, Hisagi, Vidal, Higby, Castillo and Strange2021), naturally recorded stimuli that vary in both spectral and durational cues were used and its prototypicality in L1 phonology was determined by the feature-based analysis. The current study controlled the stimuli more carefully, and the difference between the English vowel stimuli was set only in F2 frequency. Japanese speakers’ goodness-rating scores of the English stimuli were measured in Shinohara et al. (Reference Shinohara, Han, Hestvik, Calhoun, Escudero, Tabain and Warren2019) and were statistically compared to confirm the perceptual assimilation patterns and their prototypicality as the Japanese vowel /a/. Even after careful stimulus control, the current study found similar results. Although there are contradictory results in the literature (e.g., Aaltonen et al., Reference Aaltonen, Eerola, Hellström, Uusipaikka and Lang1997), given that Shafer et al. (Reference Shafer, Kresh, Ito, Hisagi, Vidal, Higby, Castillo and Strange2021) and the present study obtained similar results, it is plausible to conclude that the phonetic status of standard stimuli (prototypicality/familiarity) in an oddball paradigm has a significant effect on the MMN amplitude, at least for the perception of the English /æ/, /ʌ/, and /ɑ/ by Japanese speakers.
The findings in the present study show that the MMN amplitude can be used as an index of the phonetic status in listeners’ phonological representations. Newborns statistically learn the acoustic-phonetic features of an ambient language and develop the discrimination of frequently perceived speech sounds (i.e., native phonemes), but decline that of infrequently perceived ones (i.e., nonnative phones) (Kuhl, Reference Kuhl2010; Kuhl et al., Reference Kuhl, Stevens, Hayashi, Deguchi, Kiritani and Iverson2006, Reference Kuhl, Conboy, Coffey-Corina, Padden, Rivera-Gaxiola and Nelson2008). When such frequently perceived speech sounds (e.g., two native phonemes) are used as standards and deviants in an oddball paradigm, a large MMN is elicited (Näätänen et al., Reference Näätänen, Lehtokoski, Lennes, Cheour, Huotilainen, Iivonen, Vainio, Alku, Ilmoniemi, Luuk, Allik, Sinkkonen and Alho1997; Peltola et al., Reference Peltola, Kujala, Tuomainen, Ek, Aaltonen and Näätänen2003). In addition, a phonetic training study demonstrated that when two sounds that have been learned in a bimodal distribution are used as standard and deviant stimuli, a larger MMN is elicited, compared to when two sounds that have been learned in a unimodal distribution are used as standards and deviants (Wanrooij et al., Reference Wanrooij, Boersma and van Zuijen2014). These results in both the current and previous studies suggest that the size of MMN amplitude indicates statistical phonetic learning. If a standard stimulus used in an oddball paradigm has been learned by listeners as a prototype in a phoneme distribution, a larger MMN is elicited when they hear a deviant stimulus. If the standard stimulus is not a prototype in a phoneme distribution, a smaller MMN is elicited. Thus, the MMN amplitude can be used as an index of the phonetic status in listeners’ phonological representations.
MMN asymmetry is affected by many factors, as the perceptual salience is attributed to both language-universal (e.g., favoring focal vowels: Masapollo et al., Reference Masapollo, Polka and Ménard2015, Reference Masapollo, Polka, Molnar and Ménard2017; Polka & Bohn, Reference Polka and Bohn2003, Reference Polka and Bohn2011; Schwartz et al., Reference Schwartz, Abry, Boë, Ménard and Vallée2005) and language-specific perception bias (e.g., favoring native phoneme prototypes: Iverson & Kuhl, Reference Iverson and Kuhl1995; Kuhl, Reference Kuhl1991; Kuhl et al., Reference Kuhl, Williams, Lacerda, Stevens and Lindblom1992; and underspecification: Eulitz & Lahiri, Reference Eulitz and Lahiri2004; Hestvik & Durvasula, Reference Hestvik and Durvasula2016). Although the current study demonstrated that the prototypicality of the standard stimuli in listeners’ phonological representations modulates the MMN amplitude more robustly than does the conscious discriminability between standard and deviant stimuli, it was not possible to isolate the prototypicality and discriminability effects from other factors or to confirm that the same result is seen in the perception of other nonnative phones. Future studies should conduct another experiment with other pairs of stimuli to test additional predictions. The ways in which the factors intervene with each other and the MMN amplitude gets affected by the interaction of those factors must be investigated.
In conclusion, the auditory ERP experiment showed that MMN is not a mere reflection of the discriminability of speech sounds. The prototypicality of the standard stimulus modulates the MMN amplitude more than the discriminability.