1. Introduction
It is estimated that approximately 60% of the world's population speaks two or more languages. Indeed, with global economic development and cultural exchanges at home and abroad, the demand for mastering more than one language continues to rise. Research on bi/multilingualism has shown, however, that foreign language learning can present challenges for many individuals, particularly for adults. While some assumptions emphasize a critical period of language learning (Johnson & Newport, Reference Johnson and Newport1989), other theories and empirical evidence point out that learning contexts play a key role (Kuhl et al., Reference Kuhl, Tsao and Liu2003; Laufer & Hulstijn, Reference Laufer and Hulstijn2001; Legault et al., Reference Legault, Fang, Lan and Li2019a; Li & Jeong, Reference Li and Jeong2020). For example, compared to traditional classroom settings, immersive study abroad settings offer a contextualized, real-life experience that can enhance foreign language learning outcomes (Jackson & Schwieter, Reference Jackson, Schwieter, Schwieter and Benati2019; Klassen et al., Reference Klassen, Ferreira and Schwieter2021; Linck et al., Reference Linck, Kroll and Sunderman2009). With the widespread promotion and development of innovative technologies, immersive virtual reality (VR) contexts may have similar effects given their ability to simulate language-enriching experiences through exposure to multiple types of sensory and motor information. However, very little is known about how VR contributes to foreign language learning, especially the early stage of lexical form acquisition. Accordingly, in the present study, we analyze behavioral and electrophysiological data to investigate the effects of immersive VR on lexical form acquisition in a new foreign language, and compare these effects to a traditional picture-word (PW) association context.
1.1 Novel word learning in a multisensory context
Current evidence and theories propose that compared to a unisensory context, the presence of complementary information across multiple sensory modalities during learning (e.g., flash cards, video) is beneficial for learning performance (Mayer et al., Reference Mayer, Yildiz, Macedonia and von Kriegstein2015; von Kriegstein & Giraud, Reference von Kriegstein and Giraud2006). For the novel word learning domain, learning context is also crucial such that rich sensory information and body movements (e.g., gestures) can facilitate the learning process. In a recent study, Jeong et al. (Reference Jeong, Li, Suzuki, Sugiura and Kawashima2021) asked Japanese first language (L1) speakers to learn novel words in Korean in either a multisensory or translation context. Participants in the translation context learned the novel words by hearing them and seeing their written L1 translation. While in the multisensory context with a rich sensory experience, participants watched video clips in which novel words were used in real-life communicative situations. Results revealed that novel word learning in the multisensory context largely recruited brain regions typically associated with social and perception-action-related processing. In addition, word learning with a rich sensory experience was positively correlated to subsequent lexical retrieval performance, suggesting that a learning context involving social, multisensory, action-perception processing may lead to more efficient encoding, retention, and retrieval of novel words (see also Jeong et al., Reference Jeong, Sugiura, Sassa, Wakusawa, Horie, Sato and Kawashima2010). One of the theoretical motivations for investigating the role of a rich sensory experience in novel word learning stems from embodied cognition theory which emphasizes that whole-body interactions with an environment shape experience and knowledge (Barsalou, Reference Barsalou2008). Mayer et al. (Reference Mayer, Yildiz, Macedonia and von Kriegstein2015) compared novel word learning in a verbal learning condition, a viewed picture condition, and a self-performed gesture condition. The results showed that the gesture condition with multiple sensory information enhanced learning by engaging visual and motor brain regions.
It is widely accepted that novel word learning in multisensory contexts provides the opportunity for learners to make direct links between novel words and the concepts they represent (Jeong et al., Reference Jeong, Sugiura, Sassa, Wakusawa, Horie, Sato and Kawashima2010; Lan et al., Reference Lan, Fang, Legault and Li2015; Mayer et al., Reference Mayer, Yildiz, Macedonia and von Kriegstein2015). According to social L2 learning theory (SL2, Li & Jeong, Reference Li and Jeong2020), in traditional classroom contexts with very limited sensory experiences, individuals often learn novel words through picture-word or word-word associations in which there is a significant reliance on learning the novel words through their L1 translation equivalents (Kroll & Stewart, Reference Kroll and Stewart1994). Whereas in multisensory contexts that include rich sensory and motor experiences, learning novel words is reinforced by perceiving and interacting with experiences in the environment. Consequently, learners establish an L1-like lexical-semantic representation in the novel words that contains rich sensory experiences, such as perceptual, spatial, and motor features (Li & Jeong, Reference Li and Jeong2020).
1.2 Immersive virtual reality and novel word learning
Although there are crucial benefits of novel word learning in multisensory contexts, it may not be feasible for or accessible to every learner. Nonetheless, recent advances in technology allow researchers to create an immersive virtual reality (VR) environment which can be used to explore novel word learning in simulated real-life environments with rich sensory and motor information (Fuhrman et al., Reference Fuhrman, Eckerling, Friedmann, Tarrasch and Raz2021; Legault et al., Reference Legault, Fang, Lan and Li2019a; Li & Lan, Reference Li and Lana2021). Immersive VR technology can provide a high degree of immersion through an interactive environment that is similar to real-life situations (Legault et al., Reference Legault, Fang, Lan and Li2019a). Using a head-mounted display, individuals are immersed in a 360° immersive experience in which they can turn and look in any direction by moving their eyes and body.
Based on the evidence examining the role of a rich sensory experience on novel word learning using immersive VR technology, the mainstream opinion proposes that immersive VR can be an effective tool for novel word learning (for a review, see Li & Lan, Reference Li and Lana2021). One study by Legault et al. (Reference Legault, Fang, Lan and Li2019a) asked English monolinguals to learn two sets of L2 Chinese words in either a word-word association context or immersive VR context. Behavioral data collected immediately after the treatment revealed a positive effect of immersive VR context on L2 word learning. The authors argued that when learners are exposed to rich sensory and motor information in VR contexts, they perform better at allocating attentional resources and inhibiting irrelevant information, thus having a positive effect on learning performance. Fuhrman et al. (Reference Fuhrman, Eckerling, Friedmann, Tarrasch and Raz2021) also reported significant effects when participants learned novel words in VR contexts that incorporated motor enactment (e.g., hand gestures and object manipulation).
However, recent studies have revealed inconsistent findings, showing that the higher immersive experience achieved by VR technology may not necessarily lead to better performance of novel word learning (Papin & Kaplan-Rakowski, Reference Papin and Kaplan-Rakowski2022). One potential explanation of these divergent results is that when immersed in a VR context, learners bear an excessive cognitive load and their level of distraction is high due to rich complementary information (Papin & Kaplan-Rakowski, Reference Papin and Kaplan-Rakowski2022; Sweller et al., Reference Sweller, Ayres and Kalyuga2011). The Cognitive Affective Model of Immersive Learning (CAMIL) points out that the increased visual field in an immersive VR context brings an abundant of subtle details that may not be necessary for learning but are processed, leading to increased processing load and inhibition demands (Makransky et al., Reference Makransky, Andreasen, Baceviciute and Mayer2021). Therefore, the individual difference in inhibitory control may be associated with novel word learning in VR context. Specifically, learners with more efficient inhibitory control may allocate more attention to learning targets with fewer distractions (Kapa & Colombo, Reference Kapa and Colombo2014).
Moreover, some studies have examined the role of prior language experience in novel word learning and note that learning performance may be influenced by prior learning contexts (Bogulski et al., Reference Bogulski, Bice and Kroll2019; Hirosh & Degani, Reference Hirosh and Degani2018). According to the viewpoint of the transfer hypothesis, for the bilingual learners in present study, if the learning context of a novel language (e.g., classroom) resembles the context in which their L2 was learned in (i.e., also in a classroom), this shared context may facilitate learning of the novel language (Nair et al., Reference Nair, Biedermann and Nickels2016). Given that all participants of the present study learned their L2 through classroom context with a lack of immersion experience, we selected the age of L2 acquisition (L2 AoA) of bilinguals as reflective of their prior language learning experience. To some extent, L2 learned at a late age with less classroom learning experience is expected to better adapt to an immersive context, while an early age may work against immersion effects. Although preliminary findings on novel word learning in a VR context suggest similar sensitivity to individual differences (Legault et al., Reference Legault, Fang, Lan and Li2019a), more research is needed to fully understand these effects by considering inhibitory control and prior language experience.
1.3 Present study
The present study used EEG technology to investigate how an immersive VR context affects novel word learning, particularly early lexical form acquisition, and compares these effects with a picture-word (PW) association learning context. A group of Chinese–English bilinguals received three days of training in which they learned novel words in German in both a PW association context and an immersive VR context. After the training, participants completed a recognition task to test whether the target words had been learned or not. During the task, we measured participants’ electrophysiological activity using EEG. Moreover, before training, we administered a language background questionnaire and a modified Flanker task to collect data on potential individual differences in L2 AoA and inhibitory control ability (Jiao et al., Reference Jiao, Zhang, Plummer, Liu and Chen2019, Reference Jiao, Duan, Liu and Chen2022).
The high-temporal resolution of EEG data allows us to investigate the exact time-course of novel word learning. Based on previous work and our research objectives, we focus on mean amplitudes of three negative-going ERP components: N100, N200, and N400 (Basirat et al., Reference Basirat, Brunellière and Hartsuiker2018; Liu & van Hell, Reference Liu and van Hell2020). The N100 occurs just after stimulus onset and has been associated with visually selective attention and sensory/perceptual processing (Biau et al., Reference Biau, Fromont and Soto-Faraco2018; Vogel & Luck, Reference Vogel and Luck2000). The N200 component has a scalp distribution across the fronto-central electrode sites and is widely defined as a marker of early lexical selection and phonological processing (Connolly & Phillips, Reference Connolly and Phillips1994; Friedrich & Friederici, Reference Friedrich and Friederici2008; van den Brink et al., Reference van den Brink, Brown and Hagoort2001), as well as conflict monitoring (Mathalon et al., Reference Mathalon, Whitfield and Ford2003; van Veen & Carter, Reference van Veen and Carter2002). The N400 component, which peaks around 400 ms after stimulus onset (Kutas & Federmeier, Reference Kutas and Federmeier2011), is said to reflect lexical-semantic access of words (Liu & van Hell, Reference Liu and van Hell2020).
There are three hypotheses tested in the present study. First, in the behavioral performance, we expect that the multisensory learning context achieved by VR technology will enhance learning outcomes in the recognition task compared with the PW context. Second, with respect to electrophysiological activity, we expect that superior performance of VR-learned words will emerge in the N100 and N200 components, but not in the N400 component. The recognition task in the present study mainly measures the initial formation of lexical representations, while the N400 component is widely associated with semantic access of lexical representations. Third, based on the viewpoints of the CAMIL model and transfer hypothesis, we hypothesize that individual differences in inhibitory control and prior language experience will be related to learning outcomes of VR-learned words. We anticipate that a small flanker effect will be associated with better performance because efficient inhibitory control will decrease distractions in the VR context. Moreover, we expect that the early L2 AoA of bilingual participants may hinder learning performance because individuals who have prior language learning experience in classroom settings may be harder to adapt to an immersive learning context.
2. Method
2.1 Participants
Thirty-five Chinese learners of L2 English were recruited to take part in the study. Five participants were excluded because of excessive EEG artifacts, leaving data from 30 participants (22 females, 8 males) to form part of the statistical analyses. All participants were right-handed adults (mean age = 20, range = 18–23) with normal or corrected-to-normal vision. Before beginning the formal experiment, the participants were asked to complete a language background questionnaire in which they also rated their L1 and L2 proficiency levels. The participants reported having no experience living abroad and that they had begun learning L2 English on average at 8.76 (SD = 1.35) years. The self-ratings of language proficiency were based on a 7-point scale (1 = very poor, 7 = excellent) and revealed that L1 Chinese proficiency (mean = 6.18, SD = .66) was significantly higher than L2 English (mean = 3.93, SD = .79), t = 10.72, p < .01. Moreover, all participants reported no prior knowledge of German, the language in which novel words were to be learned in the experiment. The local ethics committee approved the study, and all individuals provided written consent prior to participating in the experiment and received a modest payment for their participation.
2.2 Materials
A total of 40 German words were auditorily presented to participants who learned in two learning conditions (20 words learned in an immersive VR condition and 20 words learned in a PW condition). All 40 words included common concepts that could be found in a home setting (e.g., Schüssel ‘bowl’, Messer ‘knife’). The words were recorded by a highly-proficient Chinese-German female speaker in a sound-proof room. We asked a control group of 21 L1 Chinese speakers with a similar L2 English proficiency level as the participants to assess whether the German words sounded like any words they knew in Chinese or English. Their judgements on a 5-point scale (1 = very dissimilar, 5 = very similar) showed that all German words were considered dissimilar to Chinese or English. The rationale for choosing German as the language of the to-be-learned word is that all participants reported no prior knowledge or experience with German. Moreover, previous work examining Chinese–English bilinguals has also selected German as the target language (Jiao et al., Reference Jiao, Liu, Schwieter and Chen2021, Reference Jiao, Duan, Liu and Chen2022).
Stimuli for the PW condition consisted of line-drawings from Snodgrass and Vanderwart's (Reference Snodgrass and Vanderwart1980) standardized picture database (Zhang & Yang, Reference Zhang and Yang2003) and stimuli for the VR condition included a fully immersive environment and colored three-dimensional (3D) objects selected from a standardized database (Peeters, Reference Peeters2018). The immersive VR condition simulated an apartment consisting of a living room, bedroom, and kitchen. This was presented and edited using the software Unity (https://unity.com). To identify which 3D objects to include as experimental materials, based on the standardized 3D object database (Peeters, Reference Peeters2018), we first recruited a group of 108 L1 Chinese speakers from the same population but who did not participate in the formal experience, to assess each 3D object's image-name agreement, familiarity with the image, and visual complexity on a 5-point scale. We chose the target objects considering these ratings and the appropriateness of their fit in the virtual environment of the present study (i.e., an apartment).
2.3 Procedure and measures
Participants completed the experimental procedure over four days. Days 1–3 involved learning sessions and Day 4 was the testing session on which a recognition task was administered. Also on Day 1, participants completed a language background questionnaire and a modified Flanker task.
2.3.1 Learning sessions
During the learning sessions, participants were instructed to learn a set of 20 German words in a VR condition equipped with headgear and handsets and an additional 20 German words in a PW condition on a desktop computer. On each of the three days, participants performed one learning session in the VR condition for 15 minutes and one session in the PW condition for 15 minutes. Participants were given a brief break between the two learning conditions to prevent fatigue. The order of learning conditions was counterbalanced across participants (i.e., half of the participants performed the PW condition first and the other half performed the VR condition first). Moreover, the sets of words presented in the two conditions were counterbalanced across participants (i.e., half of the participants learned one set of words in the PW condition, while the other half learned that same set in the VR condition).
In the PW learning condition, individual 2D line drawings were displayed in the center of a computer screen and were accompanied by their spoken name in German. These recordings were played through headphones. Participants then were required to press the response key to move to the next word. During the 15 minutes, participants could repeat through the PWs as many times as they desired. Before starting the PW learning session, participants were administered a practice block of 5 trials in Chinese so that they could familiarize themselves with the experimental procedure.
In the immersive VR learning condition, HTC VIVE headgear and handsets delivered high quality visual fidelity and an engaging experience. Prior to learning the German words, the participants were shown how to use and interact with the VR equipment. During this practice, participants saw the same virtual apartment setting as in the experiment, but when selecting objects, the words were played in Chinese. After they were familiar with the equipment, the 15-minute learning session began in which they physically moved throughout the virtual apartment and used the handset to laser point to 3D objects. Upon selecting the objects, they heard the corresponding German words through the headphones.
After the flexible timetable for learning, all participants were required to complete the word recall task as practice on each day. The recall task was cued by the pictures of target words and participants were asked to name them in German. Their responses were recorded and accuracy was assessed by two experimenters. The data of two participants failed to record due to equipment malfunction. The results of 28 participants showed a gradual increase in accuracy across the three learning sessions (Day 1: PW = 41%, VR = 43%; Day 2: PW = 71%, VR = 75%; Day 3: PW = 86%, VR = 90%).
2.3.2 Testing session
On Day 4, the participants were asked to perform a recognition task in which they were asked to determine whether trials of auditorily presented German words had been among those including in the learning sessions. During the task, participants’ behavioral performance and electrophysiological activity were measured. The task consisted of three types of trials (i.e., PW-learned words, VR-learned words, and not-learned words). Not-learned words were real German words recorded by the same human voice, but did not appear in the learning sessions. The entire task consisted of 2 blocks with each block consisting of 60 trials presented in a random order. Each participant performed 40 PW-learned trials, 40 VR-learned trials, and 40 not-learned trials.
Trials began with a fixation cross which was presented at the center of a computer screen for 400 ms. After a blank interval of 200 ms, a target word was played through headphones and participants were required to identify whether it had been learned or not by pressing the left or right response keys, respectively. The response keys were counterbalanced across participants. Once a response was given or after a maximum duration of 8000 ms, a blank screen was presented for 1000 ms prior to the next trial. Before the formal experiment, the participants were presented with 10 practice trials to become familiar with the procedure.
2.3.3 Cognitive task
Inhibitory control was measured by a modified Flanker task (Fan et al., Reference Fan, McCandliss, Sommer, Raz and Posner2002; Legault et al., Reference Legault, Fang, Lan and Li2019a). In the task, arrows or lines were presented on the screen and participants were instructed to respond to the direction of the center arrow by pressing left or right response key. The task consisted of three blocks, with 48 trials in each block presented in random order. There were three types of trials which were equally presented throughout the task: congruent trials, incongruent trials, and neutral trials. In congruent trials, the flanker arrows pointed towards the same direction as the central target arrow. In incongruent trials, the four flanker arrows pointed towards the opposite direction as the center arrow. In neutral trials, the center arrow was surrounded by lines without direction information. In our analyses, we used the flanker effect (i.e., the performance difference between incongruent and congruent trials) to index inhibitory control.
2.4 Electrophysiological recordings and preprocessing
Electrophysiological data were recorded using 64 Ag/AgCl electrodes placed according to the extended 10–20 positioning system and were online referenced to FCz electrode. All channels were amplified with a band pass of .05–100 Hz and a sampling rate of 1000 Hz. Electrode impedance was kept below 5 kΩ. EEG preprocessing and analyses were conducted using the EEGLAB toolbox. The signal was band-pass filtered at a 1–40 Hz and re-referenced offline to the averaged left and right mastoids (TP9 and TP10). The signal containing eye blinks and other artefacts were corrected for each subject by independent component analysis (ICA). Epochs of 200 ms before to 800 ms after the onset of target word were extracted. Baseline correction was performed in reference to pre-stimulus activity (Liu et al., Reference Liu, Wang, Timmer and Jiao2022).
2.5 Data analyses
Both behavioral and ERP data from the recognition task were analyzed with linear mixed-effect models in R using the lme4 package (Bates et al., Reference Bates, Maechler, Bolker and Walker2014). Each model included a three-level variable of learning type as a fixed effect, with subject and item as random effects. The learning type variable (PW-learned, VR-learned, and not-learned) was coded with orthogonal contrasts in which the first contrast compared not-learned words to learned words (i.e., the average of PW- and VR-learned words) and the second contrast compared PW-learned words to VR-learned words, consistent with our research objectives. We started with a full model including the fixed effect, random intercepts for subjects and items, and random slopes for all predictors (Barr et al., Reference Barr, Levy, Scheepers and Tily2013). When the full models failed to converge, we followed a backward-fitting procedure to identify a model that would converge.
Response times (RTs) were measured from the onset of target words. Trials with error responses (2.94%), RTs higher than 3000 ms (3.89%), and RTs more than 2.5 standard deviations (SD) from the mean (3.14%) were removed from the analyses. For the ERP data, we examined the mean amplitude of waveforms across the selected time-windows of N100 (100–150 ms), N200 (250–350 ms), and N400 (400–600 ms) from the recognition task. Based on previous studies examining foreign language learning (Biau et al., Reference Biau, Fromont and Soto-Faraco2018; Kutas & Federmeier, Reference Kutas and Federmeier2011), the N100 and N200 components were analyzed at frontocentral electrode sites (F1, Fz, F2, FC1, FCz, FC2, C1, Cz, C2), and the N400 component was analyzed at central-parietal sites (C1, Cz, C2, CP1, CPz, CP2, P1, Pz, P2). Finally, in order to reveal the effect of individual differences, we calculated the correlation between the measures from the recognition task (RT and ERP indicators) and the individual differences indicators (flanker effect and L2 AoA).
3. Results
3.1 Behavioral results
Figure 1 presents the mean accuracy (ACC, left) and RTs (right) from the recognition task. For ACC, we fit a logistic mixed-effects model with learning type as a fixed effect and by-subject and by-item intercepts as random effects. The variable of learning type was the orthogonal contrast. The results showed that the contrast between not-learned words (M = 95%, SD = 22) and all learned words – namely the average ACC of VR-learned (M = 97%, SD = 17) and PW-learned words (M = 99%, SD = 9) – was not significant, Estimate = .13, SE = .43, z = .30, p = .76. Moreover, the difference between VR-learned words and PW-learned words was not significant, Estimate = −.43, SE = .42, z = −1.03, p = .30.
For the analyses on RTs, the linear mixed-effects model included learning type as a fixed effect and included the by-subject random intercept and the by-item random slope for learning type as random effects. The results showed that there was no significant difference between not-learned words (M = 1557 ms, SD = 302) and all learned words, Estimate = 4.64, SE = 19.75, t = .23, p = .81. However, the comparison between VR-learned and PW-learned words reached significance, with faster responses for VR-learned words (M = 1542 ms, SD = 329) compared to PW-learned words (M = 1563 ms, SD = 341), Estimate = −29.63, SE = 14.14, t = −2.09, p = .04. Further, considering not-learned words as a baseline, the results showed that neither the comparison between not-learned and PW-learned words (Estimate = 19.45, SE = 20.10, t = .97, p = .34), nor the comparison between not-learned and VR-learned words (Estimate = -10.18, SE = 21.82, t = −.47, p = .64) reached significance. Overall, in behavioral performance, participants reacted to VR-learned words more quickly as compared to PW-learned words while there was no difference in accuracy.
3.2 ERPs Results
Figure 2 shows the grand average ERP waveforms elicited during the recognition task. The mixed-effect model for N100 amplitude included the fixed effect of learning type, and the random effects of by-subject and the by-item intercepts. As in the analyses on behavioral data, the three-level variable of learning type was the orthogonal contrast. The results of N100 amplitude showed that the VR-learned words elicited more negative waveforms than PW-learned words, Estimate = −.94, SE = .36, t = −2.58, p = .01. But the comparison between not-learned words and all learned words was not significant, Estimate = −.19, SE = .35, t = −.53, p = .60. For N200 amplitude, the model included learning type as a fixed effect with by-subject and by-item intercepts as random effects. The results of N200 amplitude showed that the not-learned words elicited significantly more negative waveforms than learned words, Estimate = .74, SE = .34, t = 2.16, p = .03. Moreover, the comparison between the two learning contexts showed more negative waveforms for VR-learned words than PW-learned words, Estimate = −.74, SE = .37, t = −2.01, p = .04. However, the model for the N400 component, with the same fixed effect and random effects as in the N200 analyses, revealed no significant differences between all words learned and not-learned words, Estimate = .27, SE = .28, t = .97, p = .33, nor between PW and VR, Estimate = −.13, SE = .30, t = −.42, p = .67. Overall, the differences between PW- and VR-learned words began at the N100 time window, while significant differences between not-learned and learned words emerged at the N200 time window.
3.3 Correlation results
To examine whether the effects of an immersive VR environment on novel word learning was related to individual differences, we conducted Pearson correlation analyses between performance on the recognition task and participants’ inhibitory control (as indexed by the flanker effect) and L2 AoA. We ran the analyses for both behavioral (RTs) and neurophysiological data (N100 and N200 components, but not N400 amplitude as there was no effect of learning context on this component). Results showed no significant effects on RTs (flanker effect: r = .11, p = .57; L2 AoA: r = .03, p = .86) nor on the N100 component (flanker effect: r = −.07, p = .72; L2 AoA: r = −.10, p = .60). For N200 amplitude in the VR condition, the correlation between N200 and the flanker effect was not significant (r = −.33, p = .07) but a significant correlation was observed with L2 AoA (r = −.46, p = .01). These effects can be seen in Figure 3.
4. Discussion
Combining behavioral performance and electrophysiological activity, the present study examined how an immersive VR context influences novel word learning and compares it with a picture-word (PW) association context. Recent research suggests that rich sensory experiences (e.g., video, VR technology) offer benefits for the lexical meaning access of novel words – however, less attention has been given to early lexical form acquisition (Legault et al., Reference Legault, Fang, Lan and Li2019a, Reference Legault, Zhao, Chi, Chen, Klippel and Li2019b; Li & Jeong, Reference Li and Jeong2020; Makransky & Petersen, Reference Makransky and Petersen2021). Using the high-temporal resolution offered by EEG technology, we compared novel word learning in an immersive VR context and a PW association context among a group of Chinese–English speakers. Overall, our findings on a recognition task showed a positive effect of the rich sensory experience of an immersive VR context on novel word learning. We will elaborate on these findings in the next subsections.
One main finding of the present study revealed how a rich sensory experience of a VR context affects lexical form acquisition of novel words by comparing it with a unisensory PW context. The behavioral performance demonstrated that responses to words learned in the immersive VR context were faster than to words learned through PW context. This finding is consistent with previous studies demonstrating that enriched perceptual and sensorimotor experiences benefit L2 word learning (Ibáñez et al., Reference Ibáñez, García, Galán, Maroto, Morillo and Kloos2011; Johnson-Glenberg et al., Reference Johnson-Glenberg, Birchfield, Tolentino and Koziupa2014; Legault et al., Reference Legault, Fang, Lan and Li2019a, Reference Legault, Zhao, Chi, Chen, Klippel and Li2019b). For example, Legault et al. (Reference Legault, Fang, Lan and Li2019a) asked English speakers to learn Mandarin Chinese words through immersive VR and word-word association. Semantic access of novel words was examined by an alternative forced-choice recognition task and results revealed a beneficial effect of the VR learning context compared to the word-word association context. Different from the study of Legault et al. (Reference Legault, Fang, Lan and Li2019a), the task in the present study was to recognize the pronunciation of words and only judge whether they had been learned or not. This allowed us to examine the lexical form acquisition, rather than semantic acquisition of novel words.
The comparison between VR and PW in the electrophysiological data showed enhanced N100 and N200 amplitudes elicited by the VR context, but not in N400 amplitude. Compelling evidence for the effect of the VR context on novel words is the N100 component showing that the amplitude on words learned via immersive VR was significantly more negative compared to words learned in PW context. Given that the N100 component has been linked to early acoustic processing (Biau et al., Reference Biau, Fromont and Soto-Faraco2018; Vogel & Luck, Reference Vogel and Luck2000), our results suggest that the VR-learned words and PW-learned words are different in the early stage of phonetic processing and recognition. The results of the N200 component, which reflects early lexical selection and phonological processing (Connolly & Phillips, Reference Connolly and Phillips1994; van den Brink et al., Reference van den Brink, Brown and Hagoort2001), revealed that the word-form recognition during initial processing was distinct between VR-learned and PW-learned words. For words learned in the VR context, their sound information could link both to the corresponding 3D object though visual perception and to the surrounding environment through physical interaction. Whereas for words learned in the PW context, their sound information only linked to visual perception (i.e., corresponding picture). Hence, these results suggest that the rich sensory experience afforded by VR affects lexical form acquisition of novel words. This hypothesis is particularly supported by the differences in the N100 effect between VR and PW contexts.
In addition, given that word-form recognition processes can only occur in learned words, it is less likely that the pronounced N200 changes we observed in the not-learned words reflects phonological processes. Therefore, we speculate that the enhanced N200 effect elicited by not-learned words originated from cognitive rather than phonological processes. Previous studies have shown that N200 effects are sensitive to conflict monitoring and attention control (Mathalon et al., Reference Mathalon, Whitfield and Ford2003; van Veen & Carter, Reference van Veen and Carter2002). Drawing on this background, the N200 changes we found in not-learned words might imply that this condition placed differential demands on the monitoring/controlling for conflicting information compared to words learned. The involvement of the N200 component during lexical form acquisition and discrimination is worthy of a more comprehensive investigation.
The theoretical support for the effect of VR context is closely related to embodied cognition theory (Barsalou, Reference Barsalou2008; Li & Jeong, Reference Li and Jeong2020). This theory argues that the interaction between a learner's body and their learning environment plays a key role in defining their experience and the extent to which knowledge is acquired (Barsalou, Reference Barsalou2008). To some extent, through traditional picture-word/word-word paired association, learners rely on rote memorization in which they are presented novel words alongside their L1 translation equivalents. Contrarily, immersive learning context supported by VR technology provides a multiple sensory experience which allows learners to move around in and interact with a contextualized environment, thus leading to superior learning outcomes. Overall, the rich sensory experience involved in VR contexts enables learners to better connect novel words and their lexical forms, pronunciation, and conceptual representations, and thus, improves their recognition (Malt et al., Reference Malt, Li, Pavlenko, Zhu and Ameel2015; Zinszer et al., Reference Zinszer, Malt, Ameel and Li2014).
In our analyses, we also explored potential modulating effects of L2 AoA. We observed a relationship between L2 AoA and N200 amplitude, but not with inhibitory control. Based on the transfer hypothesis and our objectives, we selected L2 AoA as an indicator of prior L2 learning experience because it reflects the amount of language learning exposure in traditional classroom contexts. To some extent, an earlier L2 AoA among participants in the present study represents an accumulation of language learning experiences in classroom contexts (i.e., non-immersive environments). We speculate that such accumulated experience may benefit novel word learning through similar learning methods (e.g., picture-word/word-word paired association), but not through immersive contexts. This finding was demonstrated in the negative relationship between L2 AoA and N200 amplitude. Given that there are few EEG studies examining word learning in VR contexts, we approached the potential effect of L2 AoA from a rather explorative perspective. A limitation of the present study is that we only examined one individual difference (i.e., L2 AoA). Future studies should consider other aspects of prior language experience, such as language proficiency, language dominance, and other linguistic skills.
We acknowledge that there were some unexpected findings in the present study. First, we failed to observe a pronounced difference between not-learned and novel-learned words in the behavioral performance: as novel word learning is a gradual process, the link between novel words and the lexical network is not fully established (Liu & van Hell, Reference Liu and van Hell2020); thus, the recognition of novel-learned words from not-learned words is likely not automatic, and fails to be detected by the less sensitive measure of RTs. Second, the gender makeup of our sample with the female majority. Although some previous studies have also included a similar distribution in their samples, there is no direct evidence revealing whether gender may affect learning outcomes in a VR context. Therefore, future studies focusing on immersive VR learning should also consider the potential role of demographic factors, e.g., gender, age, etc.
5. Conclusion
In the present study, Chinese–English bilinguals learned novel words in German, a language with which they had no prior experience, in an immersive VR context and compared learning outcomes with a PW context. The results of a recognition task revealed faster responses and enhanced N100 and N200 for words learned in the VR context compared to words learned in the PW context, suggesting that learning context affects early lexical form acquisition. Moreover, we found that L2 AoA was related to N200 amplitude of VR-learned words. The findings of the present study provide the first evidence that an immersive VR context with a rich sensory experience can have facilitative effects on early lexical form acquisition. Our study also provides neural evidence for embodied cognition theory by considering immersive learning. It will be highly beneficial for our understanding of novel word learning if future studies continue to systematically investigate the effects of multisensory learning environments.
Acknowledgments
The study is supported by the National Natural Science Foundation of China (62107024), the Natural Science Foundation of Shandong (ZR2021QF012), Project of Humanities and Social Science of Ministry of Education (22YJC190015), and the Institute of Psychology, CAS (GJ202004).