Better early than late: the temporal dynamics of pointing cues during cross-situational word learning

Rachael W. Cheung; Calum Hartley; Padraic Monaghan

doi:10.1017/langcog.2024.39

Better early than late: the temporal dynamics of pointing cues during cross-situational word learning

Published online by Cambridge University Press: 08 October 2024

Rachael W. Cheung

Calum Hartley and

Padraic Monaghan

Show author details

Rachael W. Cheung*: Affiliation:
Department of Health Sciences, University of York, Heslington, York YO10 5DD, UK Department of Psychology, Fylde College, Lancaster University, Bailrigg, Lancaster LA1 4FY, UK
Calum Hartley: Affiliation:
Department of Psychology, Fylde College, Lancaster University, Bailrigg, Lancaster LA1 4FY, UK
Padraic Monaghan: Affiliation:
Department of Psychology, Fylde College, Lancaster University, Bailrigg, Lancaster LA1 4FY, UK
*: Corresponding author: Rachael W. Cheung; Email: rachael.cheung@york.ac.uk

Article contents

Abstract
Introduction
The current study
General discussion
Conclusion
Data availability statement
Competing interest
Footnotes
References

Rights & Permissions

Abstract

Learning the meaning of a word is a difficult task due to the variety of possible referents present in the environment. Visual cues such as gestures frequently accompany speech and have the potential to reduce referential uncertainty and promote learning, but the dynamics of pointing cues and speech integration are not yet known. If word learning is influenced by when, as well as whether, a learner is directed correctly to a target, then this would suggest temporal integration of visual and speech information can affect the strength of association of word–referent mappings. Across two pre-registered studies, we tested the conditions under which pointing cues promote learning. In a cross-situational word learning paradigm, we showed that the benefit of a pointing cue was greatest when the cue preceded the speech label, rather than following the label (Study 1). In an eye-tracking study (Study 2), the early cue advantage was due to participants’ attention being directed to the referent during label utterance, and this advantage was apparent even at initial exposures of word–referent pairs. Pointing cues promote time-coupled integration of visual and auditory information that aids encoding of word–referent pairs, demonstrating the cognitive benefits of pointing cues occurring prior to speech.

Keywords

cross-situational word learning endogenous cues language acquisition pointing gesture referential ambiguity

Information

Type: Article
Information: Language and Cognition , Volume 16 , Issue 4 , December 2024 , pp. 1960 - 1986

DOI: https://doi.org/10.1017/langcog.2024.39 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2024. Published by Cambridge University Press

Introduction

The environment surrounding the language learner is busy and multifaceted, with many sources of information that convey meaning (Holler & Levinson, Reference Holler and Levinson2019), such as auditory cues (e.g. sound-based information in speech) and visual cues (e.g. facial expressions and body movements). How does the language learner navigate this complexity of information to aid their learning? In this study, we investigate how the temporal production of two such information sources – words and gestures – is combined by the adult language learner to disambiguate and retain novel word–referent relationships.

Learning new vocabulary involves determining how unfamiliar words relate to aspects of the environment (referent selection) and then encoding these pairings for later retrieval (retention).

Even when restricted to learning only associations between nouns and objects, there are multiple possible mappings between words and the correct object (‘referent’) available to the learner (Yu & Ballard, Reference Yu and Ballard2007). Consequently, constraints that have been proposed to address how to correctly pair words and referents have tended to focus on biases internal to the learner that guide their referent selection, such as mutual exclusivity (Halberda, Reference Halberda2006; Markman & Wachtel, Reference Markman and Wachtel1988) or assuming a novel label refers to a novel object (Carey & Bartlett, Reference Carey and Bartlett1978; Golinkoff et al., Reference Golinkoff, Hirsh-Pasek, Bailey and Wenger1992). However, these strategies cannot be applied by learners in situations where all potential referents are novel.

An alternative approach is to consider how information from the wider environment can contribute to general learning processes, such as cross-situational statistics (Siskind, Reference Siskind1996). Cross-situational statistics refers to the aggregation of information and commonalities across several, rather than single, learning instances (Yu & Smith, Reference Yu and Smith2007). Thus, a learner can acquire novel label–object pairs by tracking the co-occurrence of words and objects across multiple exposures (e.g. Fitneva & Christiansen, Reference Fitneva and Christiansen2011; Monaghan & Mattock, Reference Monaghan and Mattock2012; Roembke & McMurray, Reference Roembke and McMurray2016; Smith et al., Reference Smith, Smith and Blythe2011; Yu & Smith, Reference Yu and Smith2007; Yurovsky et al., Reference Yurovsky, Yu and Smith2013).

However, cross-situational statistics represent only one source of environmental information that a learner can utilise when faced with multiple unknown referents. Other environmental cues, such as gaze direction, prosody and gesture cues (e.g. Hollich et al., Reference Hollich, Hirsh-Pasek, Golinkoff, Brand and Brown2000), might be combined with cross-situational word learning to facilitate mapping of word–referent pairs (Dunn et al., Reference Dunn, Frost and Monaghan2024; Hartley et al., Reference Hartley, Bird and Monaghan2020; Monaghan et al., Reference Monaghan2017; Yu & Ballard, Reference Yu and Ballard2007). For instance, pointing cues (e.g. deictic gestures or gaze direction) may modulate the degree of referential ambiguity by directing learners towards the intended referent, reducing the formation of spurious word–object associations (MacDonald et al., Reference MacDonald, Yurovsky and Frank2017). In a cross-situational word learning study, Dunn et al. (Reference Dunn, Frost and Monaghan2024) found that including reliable gaze direction as a cue to target referents for novel words increased looks to targets over foil objects compared to when gaze was less reliably coordinated with cross-situational statistics.

In adult cross-situational word learning, the presence of a visual gesture cue (implemented as an arrow pointing to the intended referent) resulted in higher accuracy (Monaghan et al., Reference Monaghan2017), showing that learners are able to combine information from speech and gesture to constrain their formation of novel word–referent associations. In more naturalistic learning situations, deictic pointing cues in parent–infant communication have been shown to support a high degree of accuracy in identifying a word’s intended referent when adults watch recordings of the interactions (Cartmill et al., Reference Cartmill, Armstrong, Gleitman, Goldin-Meadow, Medina and Trueswell2013; Frank et al., Reference Frank, Tenenbaum and Fernald2013). Taken together, these studies show that auditory and gesture information can be combined to reduce referential uncertainty and support word learning.

Co-occurrence of gesture and speech during communication is prevalent in communication, both in terms of deictic gestures indicating place and iconic gestures indicating form of referents (Goldin-Meadow, Reference Goldin-Meadow2003; Kita, Reference Kita2009; McNeill, Reference McNeill2000). Furthermore, gestures tend to precede referential speech in production (Beun & Cremers, Reference Beun and Cremers1998; Levelt et al., Reference Levelt, Richardson and La Heij1985; McNeill, Reference McNeill1985), with gesture onset seeming to be exquisitely linked in timing to the production of the referring word rather than constrained by the production requirements of the utterance (Chu & Hagoort, Reference Chu and Hagoort2014). In a multimodal corpus study of a large number of utterances coded for co-occurring gestures, Donnellan et al. (Reference Donnellan, Özder, Man, Grzyb, Gu and Vigliocco2022) found that deictic gesture onset to a referent tended to occur approximately 370ms prior to the onset of a referential word.

Despite the numerous studies of temporal arrangement of gesture and speech production, the utility of this gesture–speech sequencing has not been studied in detail. In the Human Simulation Paradigm (HSP; Gillette et al., Reference Gillette, Gleitman, Gleitman and Lederer1999), adult participants guess ‘missing’ words from parent–child interaction videos, where the target word is obscured by an auditory ‘beep’ (e.g. ‘where’s the [obscured target word]?’). Scoring participants’ accuracy of guess provides a measure of how informative any surrounding cues are when identifying the target word. Trueswell et al. (Reference Trueswell, Lin, Armstrong, Cartmill, Goldin-Meadow and Gleitman2016) found that timing of gestures made by parents within parent–child interaction videos predicted the accuracy of other adult participants’ guesses regarding the intended referent. Shifting the obscuring ‘beep’ 2–4 seconds away from actual word occurrence significantly reduced guessers’ accuracy in identifying the target referent.

Nirme et al. (Reference Nirme, Haake, Gulz and Gullberg2020) investigated how timing of deictic gesture and speech affected judgements of naturalness for communicative acts. They found that gestures occurring 500ms before or after labelling an object resulted in no effect, except when the gesture coincided with a pause in speech which reduced naturalness ratings. Habets et al. (Reference Habets, Kita, Shao, Özyurek and Hagoort2011) manipulated timing of iconic gestures and referential naming and found that gesture preceding word onset by 360ms resulted in effective semantic integration of gesture and speech information, as measured by EEG N400 signals, whereas gesture preceding words by 180ms or simultaneous occurrence resulted in less efficient integration (Habets et al., Reference Habets, Kita, Shao, Özyurek and Hagoort2011). Furthermore, Cavicchio and Busà (Reference Cavicchio and Busà2023) found that moving an iconic gesture from co-occurring with a verb reference in English to the beginning of the sentence containing the verb resulted in slower identification of the action by English additional language learners, though there was no significant difference for first language English speakers. These results indicate that not only the presence but also the timing of gestural cues relative to speech may be critical for supporting word–referent mapping (Trueswell et al., Reference Trueswell, Lin, Armstrong, Cartmill, Goldin-Meadow and Gleitman2016), though research has yet to directly demonstrate this effect in word learning.

Gesture occurring before speech, to orient attention to the intended referent, is consistent with studies of cued attention (e.g. Hauer and Macleod Reference Hauer and Macleod2006). Such attentional cue studies distinguish endogenous cues (e.g. arrows or eye gaze), where attention is directed voluntarily to a target, from exogenous cues (e.g. flashing lights), where attention is directed automatically due to sudden salient stimuli (Jonides Reference Jonides1981; Posner, Reference Posner1981). Naturalistic social cues during word learning, such as pointing cues, likely act as endogenous cues similar to those that are examined during attention shifting experiments (Brignani et al., Reference Brignani, Guzzon, Marzi and Miniussi2009). There appears to be temporal sensitivity to the role of these cues in adults; whereas exogenous cues quickly shift focused attention between a cue and a target at 50ms, shifts of focused attention due to endogenous cues may take up to 500ms (Berger et al., Reference Berger, Henik and Rafal2005; Shepherd & Müller, Reference Shepherd and Müller1989).

Therefore, the timing of a cue in relation to label utterance could be crucial to successful word–referent mapping and how attention is directed. Focusing attention on a referent shortly after it is labelled may be significantly less optimal for learning than focusing attention during label utterance following early cues prior to the naming event. Such an effect would suggest that the occurrence of gesture before naming in naturalistic communication (e.g. Donnellan et al., Reference Donnellan, Özder, Man, Grzyb, Gu and Vigliocco2022) may be optimal for language learning due to the (endogenous) attentional shift that it precipitates. However, these predictions from observational studies about the importance of gesture timing to word learning have not yet been tested in controlled studies. Observational studies are unable to systematically control the distribution of cues, their timing or other potential sources of information that may interact with gesture and speech.

In particular, we do not yet know whether a pointing cue to an intended referent occurring immediately before (versus after) speech may be critical for learning, nor whether the time window of sensitivity might be less than the 2s observed in Trueswell et al. (Reference Trueswell, Lin, Armstrong, Cartmill, Goldin-Meadow and Gleitman2016) and is perhaps closer to the 360ms asynchrony investigated by Habets et al. (Reference Habets, Kita, Shao, Özyurek and Hagoort2011) in their study of iconic gestures. Furthermore, although multiple sources of information may aid accurate referent selection, disambiguation of meaning does not necessarily reflect long-term learning. Accurate referent selection under referential ambiguity may reflect ‘fast’ in-moment problem-solving by the learner, whereas retention of novel words may occur as a ‘slow’ and gradual process, during which multiple exposures strengthen or weaken word–referent pairs over time (McMurray et al., Reference McMurray, Horst and Samuelson2012).

In these respects, investigating the timing of pointing cues is critical for refining models of word learning. If word learning is influenced by when, in addition to whether, the learner is directed to the intended referent, then this would suggest that strength of associations when acquiring word–referent mappings is influenced by the quality (and not only the quantity) of integration of visual and speech information (Bhat et al., Reference Bhat, Spencer and Samuelson2022). Such findings would signify the need to refine standard associative learning models where temporal contiguity has not been considered (McMurray et al, Reference McMurray, Horst and Samuelson2012; Yu & Smith, Reference Yu and Smith2012) and would provide evidence that the temporal relation found between gesture and speech production also has an effect on language learning. An alternative perspective is that the relative timing is more of an accident of production constraints (e.g. Chu & Hagoort, Reference Chu and Hagoort2014) and has no impact on word learning.

The current study

In this study, across two studies we examine how adult learners identify word–referent pairings by using environmental cues to reduce referential ambiguity and how this might affect their subsequent retention of novel words. Our research addresses three novel questions: (1) What are the effects on learning accuracy of pointing cues that occur before, versus after, a referent is labelled? (2) Do any facilitative effects of pointing cues on referent selection accuracy also apply to longer-term retention of words? (3) What temporal dynamics of looking behaviour reflect learning from pointing cues presented before, versus after, labelling the referent?

Studies 1 and 2 investigated the temporal process of how pointing cues are integrated with auditory and visual information to support accurate cross-situational word learning. We manipulated the timing of a pointing cue (Study 1) and employed an eye-tracker to uncover how the dynamics of visual attention are affected by pointing cue timing (Study 2). In each study, we tested how our manipulations affected both immediate recall and retention (after a delay) of novel word–referent mappings. Given that Nirme et al. (Reference Nirme, Haake, Gulz and Gullberg2020) found some evidence that gesture–speech timing affected judgements of naturalness of the communicative situation, we also measured the extent to which participants were aware of the variation in timing between gesture and speech. We used a static photograph of a finger and hand as a pointing cue. This stimulus was chosen to limit additional visual information, such as oromotor movements associated with speech or eye gaze. Previous studies of pointing gestures and speech in human–machine interaction have sometimes used a virtual avatar (e.g. Kranstedt et al., Reference Kranstedt, Lücking, Pfeiffer, Rieser and Wachsmuth2006; Nirme et al. Reference Nirme, Haake, Gulz and Gullberg2020) or recorded human gestures (e.g. Cavicchio & Busà, Reference Cavicchio and Busà2023; Habets et al., Reference Habets, Kita, Shao, Özyurek and Hagoort2011). However, naturalistic gestures extend over a few hundred milliseconds (Donnellan et al., Reference Donnellan, Özder, Man, Grzyb, Gu and Vigliocco2022) and determining when a deictic gesture begins to provide referential information is imprecise. In this study, we aimed to investigate the close temporal relation of speech and pointing relative to word learning with control over the precise timing of the gesture cue. Furthermore, previous research has identified that operationalisation of gesture as a pointing hand is effective as a cue to learning in cross-situational word learning (Monaghan et al., Reference Monaghan2017). We reflect on potentially using more naturalistic gestures in future investigations in the General Discussion. All pre-registrations, data, experimental stimuli and tasks, and code for all analyses in this study are available on the Open Science Framework (OSF): https://osf.io/2m9pe/?view_only=9d64688d03d84704aa5f2e8f8eb34dc9.Footnote ¹

Study 1: When are pointing cues in word learning most useful?

Study 1 investigated whether cue timing effects apply to adults’ use of pointing cues in cross-situational word learning. As endogenous cues appear to induce slower attention shifts than exogenous cues (Shepherd & Müller, Reference Shepherd and Müller1989), pointing cues that occur sometime before, rather than after, a label may be critical to encoding robust label–target associations and minimizing spurious label–foil associations. We manipulated the timing of pointing cues relative to label utterance across two conditions: pointing appeared before or after the verbal label. In the HSP, Trueswell et al. (Reference Trueswell, Lin, Armstrong, Cartmill, Goldin-Meadow and Gleitman2016) found that shifting an obscured word 2 seconds earlier than the word’s original position was sufficient to reduce the accuracy score of those guessing the missing word from ~ 60% to ~ 43%. Furthermore, if the obscuring ‘beep’ was moved too early, guessers did not relate the visual event to the missing word, as they were perceived as too temporally discontinuous. However, shifting attention between potential referents during word learning can happen very quickly (e.g. within 225ms, Halberda, Reference Halberda2006), and Habets et al. (Reference Habets, Kita, Shao, Özyurek and Hagoort2011) already found a semantic integration change from 360ms to 180ms asynchronies for iconic gestures and word naming. We therefore assessed whether sensitivity to cue timing can be observed in a smaller temporal window than tested by Trueswell et al. (Reference Trueswell, Lin, Armstrong, Cartmill, Goldin-Meadow and Gleitman2016) in the HSP by presenting pointing cues just 1 second before and after a novel label, at a point in between the parameters of Trueswell et al.’s (Reference Trueswell, Lin, Armstrong, Cartmill, Goldin-Meadow and Gleitman2016), Nirme et al. (Reference Nirme, Haake, Gulz and Gullberg2020) and Habets et al.’s (Reference Habets, Kita, Shao, Özyurek and Hagoort2011) studies.

We hypothesised that participants would respond more accurately on both immediate and retention trials when tested on words trained in the early pointing condition compared to the late condition. Early pointing cues may support cross-situational word learning by highlighting the target prior to (or at) label utterance, reducing spurious associations between the label and non-target foils. Late pointing cues may be less useful for word–referent mappings as any attentional shift that occurs due to the pointing cue will be after the crucial information (the label) has been uttered, reducing the chance to reconcile the auditory label and the visual referent together and robustly encode the association.

Method

Participants were twenty monolingual English-speaking adults without any sensory deficits (age M = 20.9 years, SD = 5.16, range = 18.0 – 39.0; 5 male, 15 female), as specified in the pre-registration. They were recruited via leaflets and the *** University research participation system, which allows all members of the University community to partake in research. Informed, written consent was obtained from all individuals prior to participation. Participants were either paid £3.50 or received course credit for taking part. The number of participants was specified in the pre-registration and based on previous studies that test cross-situational word learning using a similar paradigm (e.g. Monaghan et al., Reference Monaghan, Mattock, Davies and Smith2015; Monaghan & Mattock, Reference Monaghan and Mattock2012).

Materials All stimuli used can be found on OSF. Thirty-two novel objects and 32 novel two-syllable words were taken from the NOUN database (Horst & Hout, Reference Horst and Hout2016). Sound files for each word were made using the Serena system voice (Macintosh computer, OS 10.13). Each object and word were paired randomly for each participant to produce 32 word–object mappings. Pictures and audio were presented on a Macintosh computer (OS 10.13, 21.5-inch monitor, 1920 × 1080 resolution) using PsychoPy3 (Pierce & MacAskill, Reference Pierce and MacAskill2018). Participants used closed cup headphones.

Procedure Testing took place in a quiet room. Both studies included two training and test conditions and were run using a similar procedure. Participants first completed a warm-up with two familiar objects and words presented as they would be during training. The order of conditions was counterbalanced across all participants. During the first condition, participants were administered the first training block with one set of 16 word–referent pairs, followed by an immediate testing block, then a 5-minute distractor task (colouring in a geometric picture), before completing a retention testing block. They then repeated this process with another set of 16 word–referent pairs for the second condition.

Each correct word–referent pairing appeared four times per training condition, with 16 word–referent pairings to be learnt per condition. Screen position of the objects was pseudo-randomised so that the target appeared an equal number of times on the left and on the right. The order of trials within training blocks was pseudo-randomised with the constraint that referents appeared no more than twice in a row. Target objects also acted as foils for their non-associated words and were pseudo-randomised with the constraint of appearing an equal number of times across all trials. To ensure that participants could disambiguate words and referents based on cross-situational information, co-occurrences of the same targets and foils were minimised across trials.

Training blocks Participants completed two cue conditions, an ‘early’ and a ‘late’ pointing cue condition, in counterbalanced order. These cues were blocked, which enabled us to probe participants’ awareness of cue timing differences at debrief, without the need for leading questions about the asynchrony. At all times, participants saw two novel objects on screen with a pointing cue – a picture of a hand pointing to the target appeared simultaneously with the referent. The target word in both conditions was played 500 milliseconds after referent presentation. In the early condition, participants saw the pointing cue 1 second before word utterance (Figure 1a). In the late condition, the pointing cue appeared 1 second after word utterance. In both conditions, the two referents appeared for the duration of the trial (3 seconds), label utterance occurred at the same time at the 2 second mark after the referents had first appeared, and the cue lasted for 1 second (Figure 1b). The timing of the pointing cue with the novel label was adjusted to ensure an equal amount of time before and after label utterance in both conditions.

Figure 1. Studies 1 and 2: Training trials, a) early pointing cue, b) late pointing cue condition.

Testing blocks In order to test learning accuracy for the word–referent pairs, participants were administered two testing blocks: immediate, which occurred immediately after training, and retention, which occurred after a 5-minute distractor task (colouring in a complex picture). Each word was tested on one immediate trial and on one retention trial. During test trials, all 16 referent objects were presented simultaneously on screen, and the learner was asked to click on the correct referent for each target word, requested in a random order (‘which is the [target word]?’; chance level = 0.0625; Figure 2). The on-screen positions of the referents differed for immediate and retention trials. Participants were asked at debrief after the study had finished if they had noticed any difference between the two training blocks and their response was recorded.

Figure 2. Studies 1 and 2 testing trial example: participants see all 16 referents for given condition and are asked to click on the corresponding object for novel words.

Statistical analysis

As pre-registered, accuracy of correct word–referent pairs was scored as 1 (correct) or 0 (incorrect) and entered into general linear mixed effects models (GLMEs), using glmer from the lme4 package (v.1.1-20, Bates et al., Reference Bates, Mächler, Bolker and Walker2015) in R Studio [v1.1.463; R v.3.6.3]. Separate analyses were conducted for immediate testing blocks, retention testing blocks and all testing blocks combined (i.e. immediate and retention testing blocks). This enabled direct comparison between trial types, reflecting the discrete processes that may underlie immediate referent selection and retention of novel words after a delay. All model fitting sequences began with a baseline model that contained only random effects. Subsequent models were then built progressively by adding individual fixed effects and comparing each model to the previously best-fitting model using log-likelihood comparisons (Barr et al., Reference Barr, Levy, Scheepers and Tily2013), selecting the more complex model if it was a significantly better fit. A frequentist approach was utilised, where comparisons p <.05 were classed as statistically significant.

For all models, we used sum-to-zero coding. For models predicting immediate testing accuracy, a fixed effect of pointing cue condition (‘1’ = early, ‘−1’ = late) was included. For models predicting retention accuracy, a fixed effect of pointing cue condition (‘1’ = early, ‘−1’ = late) was tested, followed by a fixed effect of accuracy for each word on immediate testing trials (‘1’ = correct, ‘−1’ = incorrect) and then for presence of interactions between condition and immediate accuracy. For models predicting overall accuracy across trial types, we first tested the fixed effects of pointing cue condition (‘1’ = early, ‘−1’ = late), then the effect of trial type (‘1’ = immediate, ‘−1’ = retention) and then for the presence of interactions between condition and trial type. For all models, random effects of participant, target word and target object and test order (early or late condition first) were included, and random slopes of condition were fitted for each random intercept unless this prevented the model from converging.

Results and discussion

The final best-fitting models and results for all three analyses are presented in Table 1 and Figure 3. Participants performed statistically above chance in both conditions on immediate and retention trials.

Table 1. Study 1: Best-fitting general linear model results predicting trial accuracy by pointing cue condition

Figure 3. Study 1: The effect of pointing cue timing on behavioural response – mean accuracy across testing trials with standard error bars of all participants, grouped by pointing cue condition and trial type.

The final random-effects structure included by-participant and by-target word random intercepts with slopes for condition and random intercepts of target object and test order, across all models. Slopes of condition for target object and test order did not converge despite using allFit() procedures; these had the lowest variance so were removed (Barr et al., Reference Barr, Levy, Scheepers and Tily2013). The best-fitting model for immediate testing trials demonstrated a fixed effect of pointing cue condition (χ²(1) = 4.35, p =.037). Participants were significantly more likely to respond accurately in the early pointing condition compared to the late pointing condition (p =.028). The best-fitting model for retention trials included fixed effects of immediate accuracy and condition (χ²(1) = 146.1, p <.001), although there was no significant effect of condition once immediate accuracy was also included (p =.062). Participants were significantly more likely to respond correctly on retention trials if they responded correctly on immediate trials for the same word (p <.001)

Overall, participants had higher accuracy (immediate and retention test trials) in the early condition (M = 0.69) compared to the late condition (M = 0.60). For overall accuracy, the best-fitting model contained fixed effects of both pointing cue condition and trial type (χ²(1) = 4.50, p =.034), indicating that participants were more likely to respond correctly when tested on words learnt in the early pointing cue condition compared to the late pointing cue condition (p =.006) and were more likely to respond correctly in retention than immediate test trials overall (p =.032).

At debrief, only four of the 20 participants reported noticing a difference between conditions related to the pointing cue. This was unexpected, as the conditions were split into two distinct training blocks and the timing differences between words and pointing spanned a 1-second interval, which we expected to be easily detectable.

The results of Study 1 indicate that temporal ordering of cues with word utterance is important when initially establishing word–referent pairs, consistent with the cued attention literature (Hauer and Macleod Reference Hauer and Macleod2006; Yoshida and Burling, Reference Yoshida and Burling2012). Our results not only confirm the importance of cue timing to referent selection (Trueswell et al., Reference Trueswell, Lin, Armstrong, Cartmill, Goldin-Meadow and Gleitman2016) but also indicate that the effect of temporal co-occurrence is more fine-grained than −2 to + 2 seconds. Pointing cues during training that occur just 1 second before label utterance significantly improved accuracy at test when compared to those that occurred 1 second after word utterance. Whether gestures occurring even closer to naming, as in the 500ms used in Nirme et al. (Reference Nirme, Haake, Gulz and Gullberg2020) or the 360ms used in Habets et al. (Reference Habets, Kita, Shao, Özyurek and Hagoort2011), may boost learning further is an open question that we revisit in the General Discussion. Further, the effect of temporal synchrony of pointing and spoken label during referent selection also influenced retention accuracy in our cross-situational paradigm.

Interestingly, only four of the 20 participants reported noticing that the pointing cue appeared at different time points within trials across the two conditions. This suggests that the temporal synchrony of pointing cue and spoken information was not explicitly available to the majority of participants, indicating that strategic use of information was likely not driving performance and that differences in test accuracy between conditions were not due to conscious manipulation of attention by learners. These results, however, do not yet indicate how learners’ attention to objects is affected by the timing of a pointing cue and what pattern of visual attention relates to learning – we therefore examine this using an eye-tracker in Study 2.

Study 2: How do early pointing cues support more accurate word learning than late pointing cues?

We hypothesised that the advantage of early pointing cues over late cues was due to where attention was allocated during, rather than following, label utterance. Early pointing may benefit learning by endogenously cuing orientation of visual attention to the target referent before the word is named (Hauer and Macleod, Reference Hauer and Macleod2006), thus strengthening the link between word and referent. That is, participants may have learned more effectively in the early condition because they were already looking at the target object when they heard the referring label. We therefore repeated the procedure of Study 1 to replicate the behavioural effects, but also monitored participants’ gaze during training trials using an eye-tracker, allowing us to pinpoint where their attention was directed during label utterance. We made two additional predictions relating to the temporal dynamics of multiple cue integration during word learning: (1) if the early pointing cue promotes attention to the target over the foil, participants would have increased overall relative looking time to the target compared to the foil during training trials in the early condition (relative to the late pointing cue condition), and (2) if the early pointing cue advantage for learning is due to where attention is located when the word is spoken, then greater accuracy would be predicted by fixations to the target during and immediately after the spoken label, but not prior to the spoken label.