EXAMINING RATER PERCEPTION OF HOLDS AS A VISUAL CUE OF LISTENER NONUNDERSTANDING

Kim McDonough; Rachael Lindberg; Pavel Trofimovich

doi:10.1017/S0272263122000018

EXAMINING RATER PERCEPTION OF HOLDS AS A VISUAL CUE OF LISTENER NONUNDERSTANDING

Published online by Cambridge University Press: 14 February 2022

Kim McDonough

Rachael Lindberg and

Pavel Trofimovich

Show author details

Kim McDonough*: Affiliation:
Concordia University, Montreal, Canada
Rachael Lindberg: Affiliation:
Concordia University, Montreal, Canada
Pavel Trofimovich: Affiliation:
Concordia University, Montreal, Canada
*: *Corresponding author. Email: kim.mcdonough@concordia.ca

Article contents

Abstract
Introduction
Experiment 1
Experiment 2
General Discussion
Funding statement
Footnotes
References

Rights & Permissions

Abstract

This study examined whether university students perceive holds (i.e., a listener’s temporary cessation of dynamic movement) as a visual cue of nonunderstanding. Conversations between English second language (L2) university students were sampled to extract episodes of other-initiated repair through open clarification requests (e.g., what?, sorry?). Brief, silent video clips were presented to 60 raters across two experiments who assessed the listener’s comprehension, which was their perception about how well the listener had understood the speaker. Experiment 1 tested whether raters can differentiate between the onset and release of listener holds while Experiment 2 examined whether they are sensitive to the sequential organization of holds. Results indicated that raters clearly differentiated between hold onsets and releases and were sensitive to the temporal position of holds in the entire repair sequence. Taken together, these findings suggest that holds are a reliable signal of nonunderstanding with potential implications for L2 teaching and assessment.

Type: Research Article
Information: Studies in Second Language Acquisition , Volume 44 , Issue 5 , December 2022 , pp. 1240 - 1259

DOI: https://doi.org/10.1017/S0272263122000018 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Introduction

The goal of interaction is to communicate successfully, which entails delivering messages that can be understood by an interlocutor as well as correctly perceiving an interlocutor’s intended meaning. Remarkably, the vast majority of interaction occurs without any disruptions to the communication of meaning. However, sometimes a listener fails to understand a speaker’s utterance and chooses to seek clarification, which is a type of communication breakdown called nonunderstanding. Having a repertoire of methods for seeking clarification is an important component of interactional competence, which refers to a speaker’s ability to access, deploy, and adapt resources for the achievement of mutual understanding in a given interactional context (Roever & Kasper, Reference Roever and Kasper2018). In addition to verbal means of expression, interactionally competent interlocutors also deploy a wide range of nonverbal behaviors, such as eye contact, gestures, facial expressions, and posture. The importance of nonverbal behaviors for interactional competence has been recognized in second language (L2) assessment research. For example, prior research has shown that test takers who were rated as linguistically weak but used nonverbal behaviors associated with active listening (e.g., head nodding and backchannel cues) were viewed as interactionally competent (Jenkins & Parra, Reference Jenkins and Parra2003). In addition, rater perception studies have shown that nonverbal features of communication, such as eye gaze, facial expressions, gestures, and body language, contribute to authentic interaction and rater evaluations (Ducasse & Brown, Reference Ducasse and Brown2009; May, Reference May2011). In light of the importance of nonverbal behaviors within interactional competence, the current study examines the visual component of nonunderstanding episodes with clarification requests.

When nonunderstanding occurs, its resolution is locally accomplished through the collaboration and coconstruction of meaning by interlocutors (Firth, Reference Firth1996; Wagner, Reference Wagner1996). Nonunderstanding has been studied through the focus on repair, which includes practices for interrupting ongoing conversation to deal with problems in speaking, hearing, or understanding (e.g., Schegloff, Reference Schegloff1997, Reference Schegloff2007; Schegloff et al., Reference Schegloff, Jefferson and Sacks1977), to examine how interlocutors use both verbal messages and nonverbal behaviors to remediate problems. An example of nonunderstanding in which the listener initiates repair through a clarification request is provided in Example 1.

Example 1. Nonunderstanding episode

P61: I’m assuming you’re a little older?

P62: Sorry?

P61: How old are you?

P62: I’m twenty-nine.

When the listener (P62) failed to understand the speaker’s (P61) initial utterance, she initiated repair through a general or open clarification request. Within the repair sequence, the listener’s verbal repair cue (sorry?) serves as the first part of an adjacency pair that initiates action, while the speaker’s response (how old are you?) is the second part that carries out the repair. The resolution of nonunderstanding is demonstrated when the listener provides her age in the final turn. The fact that P61 reformulates her initial question in the third turn indicates that P62’s request for repair was understood as such, which exemplifies the next-turn proof procedure for providing evidence of participant understanding of repair practices (Edwards, Reference Edwards, Lewis-Beck, Bryman and Liao2004; Sidnell, Reference Sidnell, Sidnell and Stivers2014).

As defined by Schegloff and Sacks (Reference Schegloff and Sacks1973), adjacency pairs, such as the request for clarification and response illustrated in Example 1, consist of two turns produced by different speakers that are adjacent and ordered so that the first part necessarily precedes the second part. The two parts are related such that certain types of responses are expected, such as an invitation followed by either acceptance or refusal. In Example 1, the first adjacency pair began with P61’s query about P62’s age in Turn 1. However, the listener was not able to complete that pair with an answer about her age because she failed to understand the question. This nonunderstanding triggered the insertion of an expansion adjacency pair in the form of repair, which is sequentially ordered with the request for clarification (Turn 2) followed by a reformulation of the speaker’s question (Turn 3). The second part of the original adjacency pair (i.e., answering the question) is only given in Turn 4 after the inserted repair sequence was complete. As described by Stivers (Reference Stivers, Sidnell and Stivers2013), analyzing the sequential organization of conversation, such as adjacency pairs, is a key tenet of conversation analysis distinguishing it from other approaches to interaction that examine utterances in isolation. When identifying conversational practices, such as repair practices, a key goal is to identify features that have distinctive characteristics, appear in specific locations in a turn or sequence, and serve meaningful actions (Heritage, Reference Heritage and Silverman2011).

In studying the practices of repair, conversation analysis researchers have pointed out that repair sequences like the one illustrated in Example 1 often occur with nonverbal behaviors that also follow a sequential organization. For example, Seo and Koshik (Reference Seo and Koshik2010) analyzed tutoring sessions between university students for whom English was either their first language (L1) or their L2, reporting that listeners used two types of head movements when initiating repair: (a) a sharp head tilt or turn to the side with eye gaze and (b) a head poke (i.e., extending the head forward) accompanied by a forward lean. The listener initiated the movements after the speaker completed the utterance with the problematic feature and held the position until the problem had been resolved. Although these held movements most often cooccurred with verbal repair initiators (e.g., huh? sorry?), there were episodes in which the visual cue initiated repair in isolation. Also focusing on English interaction, Kendrick (Reference Kendrick2015) similarly described two visual-only repair sequences in which either a lateral head tilt or a frown was sufficient to initiate repair, although those visual cues more typically co-occurred with a verbal repair initiator.

Turning to nonverbal components of repair in other languages, Floyd et al. (Reference Floyd, Manrique, Rossi and Francisco2016) found that listeners in Northern Italian, Cha’palaa, and Argentine sign language who initiate verbal other-repair often temporarily hold a dynamic movement static, which the researchers refer to as holds. For the two spoken languages specifically, the behavior held static was most often eye gaze, followed by head direction (left/right), upper body lean, eyebrow position, and head position (up/down). Their analysis of the sequential organization of the holds found that listeners initiated a hold (i.e., hold onset) and maintained it through the end of their clarification request, and they disengaged the hold (i.e., hold release) during or shortly after the speaker performed the repair. Forward leans have also been found in Mandarin other-initiated repair in the form of intervening questions (i.e., repair initiated during rather than after the speaker’s problematic utterance), with listeners leaning forward and holding their lean until a response is provided (Li, Reference Li2014). These cross-linguistic findings are similar to the role of head movements and forward leans in English repair sequences identified previously (Kendrick, Reference Kendrick2015; Seo & Koshik, Reference Seo and Koshik2010) and provide further evidence that the onset and release of the held movements signal the beginning and resolution of nonunderstanding, respectively.

Additional studies have provided evidence for the nonverbal component of repair across languages. For example, in Swiss German sign language, turn-final holds are released when the listener has understood the speaker or the speaker has acknowledged the listener’s request (Groeber & Pochon-Berger, Reference Groeber and Pochon-Berger2014), which provides additional evidence that releasing a temporarily static movement is a signal of resumed understanding. The cessation of movement during repair initiation has also been found in Argentinian sign language (Manrique, Reference Manrique2016) and Yélî Dnye (Levinson, Reference Levinson2015) in the form of a freeze look. In these nonunderstanding episodes, the listener initiates repair by staring at the speaker without moving and maintains the freeze until the problem has been resolved or the listener pursues repair verbally. In sum, although repair initiation typically occurs through both verbal and nonverbal components, researchers acknowledge that some repair initiation utilizes primarily nonverbal resources (Dingemanse, Reference Dingemanse2015; Levinson, Reference Levinson2015; Manrique, Reference Manrique2016; Seo & Koshik, Reference Seo and Koshik2010).

The extensive conversation analysis research has provided valuable insight into the nonverbal behaviors associated with repair practices, specifically the types of movements that are held static during listener holds and their sequential organization as onsets and releases. By identifying the nonverbal signals of repair practiced by multiple speakers of different languages in diverse conversational settings, these researchers have demonstrated generality in repair practices, which can be understood as the extent to which practices are organized in the same way across contexts (see Chenail, Reference Chenail2010, for discussion of generalizability and related constructs in qualitative research). Inspired by this line of research, we were also interested in generality and carried out a series of studies that examined whether nonverbal aspects of repair practices, specifically clarification requests (McDonough et al., Reference McDonough, Trofimovich, Lu and Abashidze2019, Reference McDonough, Lindberg, Trofimovich and Tekin2021) and recasts (McDonough et al., Reference McDonough, Crowther, Kielstra and Trofimovich2015, Reference McDonough, Trofimovich, Dao and Abashidze2020a, Reference McDonough, Trofimovich, Lu and Abashidze2020b), were organized similarly in conversations between university students.

Besides providing evidence of generality, however, we were also interested in exploring whether these nonverbal behaviors are distinctive characteristics of repair practices. If specific visual cues (such as a head poke or forward lean) are uniquely associated with nonunderstanding, then they should not occur when a listener has understood the speaker. In such understanding episodes, a listener might ask a follow-up question rather than initiate repair, as illustrated in Turn 2 of Example 2.

Example 2. Understanding episode

P230: Yeah it’s good for me now but yeah.

P229: Did you like French?

P230: It’s really hard. It is harder than English.

P229: Yes.

Unlike the first part of an adjacency sequence in Example 1, the listener’s follow-up question in Example 2 (did you like French?) does not initiate a repair sequence. The speaker’s response in Turn 3 completes the adjacency pair by providing an answer to the question, which indicates that the follow-up question was understood as a request for additional information as opposed to clarification. Because there was no breakdown in the communication of meaning, the listener is unlikely to deploy a hold during the follow-up question if holds are uniquely associated with nonunderstanding. By comparing the listener visual cues that occur during both understanding and nonunderstanding episodes, we aimed to identify whether holds and other visual cues previously identified in repair sequences are distinctive (Heritage, Reference Heritage and Silverman2011), in the sense that they are uniquely and reliably associated with nonunderstanding.

Conversation analysis researchers would likely adopt a micro-analytic approach to address the question of distinctiveness, such as by comparing the nonverbal behaviors that occur during repair sequences and follow-up questions. However, as primarily quantitative researchers, we adopted an alternative approach that elicited the perceptions of naïve observers to determine whether they can differentiate between understanding and nonunderstanding episodes. Clearly, visual cues of nonunderstanding are “real” because interlocutors respond to them by reformulating their prior utterances, and conversation analysts have used next-turn proof procedures to document their occurrence. Our question was whether these cues are sufficiently distinctive that they can be perceived by external observers from the same speech community as the interlocutors (henceforth, raters), which in this case was university students. As pointed out by Toerin (Reference Toerin and Flick2014), quantitative research that applies the findings of conversation analysis typically explores the association between specific interactional practices and other aspects of the social world. Reflecting this orientation, our work explores whether the nonverbal behaviors of nonunderstanding have implications for L2 teaching and assessment by first demonstrating that these behaviors can be perceived. If members of a speech community can neither detect a nonverbal behavior nor associate it with a distinct interactional meaning, then this would raise doubts about its potential relevance or application to broader issues in L2 learning.

Adopting this methodological orientation, McDonough et al. (Reference McDonough, Trofimovich, Lu and Abashidze2019) compared listener visual cues and rater perceptions of understanding and nonunderstanding episodes from conversations between L2 English speakers (N = 21) and a French–English bilingual listener who had been instructed to provide feedback as appropriate. Analysis of video-recorded conversations revealed that nonunderstanding episodes contained more listener holds and head nods than the understanding episodes, which provided evidence of the generality of nonverbal repair practices. Next, students (N = 66) from the same university were randomly assigned to rating conditions that manipulated access to the speaker’s voice (clear or distorted) and the listener’s face (clear or blurred) to rate speaker comprehensibility (Hard for me to understand and Easy for me to understand) and listener understanding (He understood 0% and He understood 100%) using a 100-millimeter scale. Although decontextualizing and manipulating the videos poses challenges to the ecological validity of the interactions, the experimental control allows for the identification of the unique contribution of nonverbal behaviors (i.e., what the visual component adds to the verbal repair signal). The ratings showed that raters with access to the listener’s face rated listener comprehension lower during nonunderstanding episodes than raters who only heard the speaker’s voice. Put simply, seeing the listener’s face provided the raters with visual information that helped them determine when the listener had trouble understanding the speaker. However, as an exploratory study, the findings were based on the behavior of a single listener who had been asked to provide feedback, which limited the generalizability of the study’s findings.

To confirm the association between holds and nonunderstanding and explore the salience of visual cues to raters, McDonough et al. (Reference McDonough, Lindberg, Trofimovich and Tekin2021) carried out a replication study drawing on a corpus of conversations between L2 English university students. Analyzing the transcripts, they identified 79 nonunderstanding episodes of the same type tested in the initial study. They then analyzed the video-recordings to determine whether those episodes contained holds and other visual cues identified in the initial study (e.g., head nods, blinks). They selected a subset of those episodes (n = 35) for rating and paired them with an understanding episode (n = 35) from the same interlocutors. Students at the same university (N = 90) rated the 70 episodes in terms of speaker comprehensibility and listener comprehension using the same sliding scales, with raters randomly assigned to conditions that manipulated access to the speaker’s voice and face as in the initial study. Both the analysis of the 79 episodes and the ratings of the 35 matched episodes confirmed the association between holds and nonunderstanding reported in conversation analysis studies and the initial exploratory study. New analysis to classify holds based on the type of held movements, where some holds involved a single movement while others had multiple movements, revealed that 67% of the holds included a head movement (e.g., tilts, pokes, turns) while 40% had an open mouth, and 37% had a forward lean. Although raters clearly recognized that listeners had comprehension difficulties for nonunderstanding episodes, they could differentiate between understanding and nonunderstanding episodes equally well through access to the speaker’s voice or the listener’s face or both, which raises questions about any additive benefits for visual cues when assessing listener comprehension.

Taken together, the findings of the two rating studies with university students (McDonough et al., Reference McDonough, Trofimovich, Lu and Abashidze2019, Reference McDonough, Lindberg, Trofimovich and Tekin2021) confirm that L2 English university students clearly recognize their peers’ holds and associate them with listener nonunderstanding, which confirms the observations of conversation analysts. However, the extent to which those holds provide additional meaningful information beyond the listener’s verbal repair initiators remains unclear due to the conflicting findings for rating conditions. Although holds were uniquely associated with nonunderstanding, their occurrence did not consistently aid observers in detecting challenges with listener comprehension. Previous studies demonstrated that some repair initiation occurs visually only (Dingemanse, Reference Dingemanse2015; Levinson, Reference Levinson2015; Manrique, Reference Manrique2016; Seo & Koshik, Reference Seo and Koshik2010), which suggests that the nonverbal cues of nonunderstanding in isolation are meaningful enough to elicit repair between interlocutors. It is unknown, however, if such signals are sufficiently useful for identifying nonunderstanding to warrant pedagogical interventions to raise L2 speakers’ awareness of nonverbal components of repair practices.

In summary, previous conversation analysis research has provided rich information about the occurrence and sequential organization of holds and other nonverbal features of repair initiation, and subsequent quantitative studies have confirmed that L2 English university students uniquely associate holds with problems in nonunderstanding. Although visual only repair initiation (i.e., holds and freeze looks) has been shown to occur during conversation, previous research has not specifically examined if external observers can recognize them as signals of nonunderstanding when presented without any speech. If holds communicate meaning visually, then the nonunderstanding that they convey should be detectable even in the absence of the speaker’s initial utterance or the listener’s clarification request. To test this possibility, the current study presents silent videos showing holds during other-initiated repair with clarification requests (e.g., sorry, huh) and elicits rater perceptions about the listener’s comprehension (i.e., to what extent the listener appeared to understand the speaker) in two experiments. Reflecting the sequential organization of holds, Experiment 1 tests raters’ ability to differentiate between hold onsets that signal a problem versus hold releases that indicate a return to understanding. To further test the association between holds and nonunderstanding and the importance of their sequential organization, Experiment 2 tests raters’ ability to differentiate among understanding episodes and to distinguish holds presented in their naturally occurring four-turn sequence and those presented in reversed order. If the meaningfulness of holds is linked to the sequential order of onsets and releases, then raters should be more successful at identifying problems with listener comprehension when the holds appear in their naturally occurring sequence. Based on prior research that elicited rater perceptions (McDonough et al., Reference McDonough, Trofimovich, Lu and Abashidze2019, Reference McDonough, Lindberg, Trofimovich and Tekin2021), we predicted that perceived listener comprehension would be lower for hold onsets (as compared to hold releases) and lower in naturally occurring hold episodes (as opposed to understanding episodes or reversed hold episodes).

Experiment 1

Conversation Corpus Overview

The videos rated in the current study were drawn from the Corpus of English as a Lingua Franca Interaction (CELFI), which consists of 224 paired conversations between L2 English students at Montreal-area universities (McDonough & Trofimovich, Reference McDonough and Trofimovich2019) with most of them studying at Concordia University (67%). As students, they had met a minimum English proficiency requirement to be admitted to their universities, which was a TOEFL iBT score of 75 or equivalent plus university EAP language courses. When asked to report their latest standardized proficiency test results, 62% of the students reported scores from the TOEFL iBT (Mdn = 110, IQR = 21) or IELTS (Mdn = 7, IQR = 1) tests. Based on the minimum requirement and the range of reported proficiency test scores, the students in the CELFI corpus range from the B2 to C2 levels in the Common European Framework of Reference. Students were randomly assigned to carry out three communicative tasks (posing solutions to problems encountered when moving to a new city, a close-call narrative, and an academic discussion task) with someone from a different L1 background. The self-reported gender composition of the pairs was controlled so that there were approximately the same number of male–male, female–female, and female–male dyads. The students’ interaction while carrying out the three tasks was audio- and video-recorded, their eye gaze was tracked, and their skin conductance was monitored. They also completed a battery of questionnaires (anxiety, motivation, social networks, and acculturation), a working memory task, rating scales after each task (motivation, anxiety, flow, comprehensibility, collaboration), a stimulated recall session about the final task, and a debriefing interview eliciting explanations for their task ratings. These data were collected as part of CELFI, but only transcripts of the audio-recordings and video extracts from their conversations were used for the two experiments reported here.

Sampling Nonunderstanding Episodes from CELFI

All 224 CELFI transcripts were analyzed for nonunderstanding episodes that consisted of the four-turn sequence: (a) the speaker’s initial utterance (Turn 1), (b) the listener’s nonspecific, open clarification request, such as sorry, pardon, what, or huh (Turn 2), (c) the speaker’s repair (Turn 3), and (d) the listener’s response showing understanding (Turn 4). Example 3 illustrates a nonunderstanding episode in which the listener requests clarification of the speaker’s initial question in Turn 2 (sorry?) after which the speaker rephrases the question in Turn 3 and the listener answers the question in Turn 4.

Example 3: Nonunderstanding episode

P294: Yeah … do you need to take course in your master?

P293: Sorry?

P294: Do you need to take courses or you only do research?

P293: No I–mine is course based masters.

This analysis identified 139 listeners in the corpus (139/448 or 31%) who produced at least one clarification request of this type.

To ensure comparability across the listeners’ nonunderstanding episodes to be used in this experiment, the following inclusion criteria were applied: (a) the speaker’s initial utterance contained at least three words; (b) there was minimal speaker–listener overlap between turns; (c) the hold onset occurred in Turn 2; (d) the hold release occurred in Turn 4; and (e) the hold movement (e.g., head tilt or forward lean) was controlled. Application of the inclusion criteria led to the selection of 25 nonunderstanding episodes with four types of holds: forward lean only (5), forward lean with raised eyebrows or smile (7), head poke only (6), and head poke with raised eyebrows or smile (7). In terms of their background information, the listeners for these hold videos (14 women and 11 men) were students in undergraduate (56%) and graduate (44%) degree programs and spoke 13 different L1s with the most frequent being Mandarin (28%), Tamil (16%), Farsi (12%), and Bengali (8%). They ranged in age from 18 to 29 with a mean age of 22.6 years (SD = 2.8). They had been living in Canada for a mean of 2.9 years (SD = 4.6) and had studied English for a mean of 14.2 years (SD = 4.8). Their reported proficiency test scores were similar to the median values for the students in the larger corpus.

Rating Stimuli

To create the hold onset and release videos, the four-turn nonunderstanding episodes were extracted using video editing software (VideoPad) into two clips. The 25 hold onset videos showed the listener from the last second of Turn 1, the hold onset in Turn 2, and the first second of Turn 3 when the hold was maintained. The 25 hold release videos showed the listener from the last second of Turn 3 with the hold, the hold release, and the remainder of Turn 4. As the fourth turn varied in length across episodes, it was cut at a natural speaking point to be the same length for all release videos (~2 seconds). On average, the hold onset videos were 3.76 seconds long (SD = 0.91), and the hold release videos were 3.08 seconds long (SD = 0.74). As the video clips were short, a 3-second countdown was added to the beginning of each one to allow raters time to prepare for the start of the video. The videos were presented without sound so that no verbal contributions from the speaker or listener could influence raters’ judgments of listener comprehension. If raters heard the listeners request clarification (e.g., sorry? what? pardon me?), then it would clearly indicate that the listener had not understood, and the raters could give low comprehension scores without considering the listeners’ visual cues. Without sound, however, the raters could only orient to the listeners’ nonverbal behavior when assessing their degree of comprehension.

Raters

Raters included 30 students (21 females, 9 males) recruited from the same Montreal universities as the listeners in the videos on the assumption that they would represent the same student population (i.e., potential peers of the students in the videos). They were undergraduate (67%) and graduate (33%) students between the ages of 19 and 41 (M = 25.03, SD = 5.74). Their L1s included Canadian or World Englishes (11), Portuguese (4), Arabic (3), French (3), Mandarin (2), Spanish (2), Tamil (2), Manipuri, Hebrew, and Danish. The L2 English raters had been studying English on average for 17.44 years (SD = 6.20) and the non-Canadian born raters reported a mean length of residence of 3.8 years (SD = 6.1). As compared to the listeners in the hold videos, the English L2 raters had a similar length of residence in Canada, similar amount of prior English study, and equally diverse L1 backgrounds. The greater proportion of raters from English L2 backgrounds (63%) reflected the distribution of English L2 (52%) and English L1 (48%) students at Concordia University where the majority of the listeners and raters were studying, and the linguistic diversity of Montreal where only 16% of the population reports English as their L1 and only 23% report using English as their predominant home language (Statistics Canada, 2017).

Rating Materials and Procedure

The entire procedure was conducted using the LimeSurvey online interface (https://www.limesurvey.org), where raters first completed the consent form (2 minutes), and then were given instructions for the rating procedure and explanations of the rating criteria (2 minutes). After practicing rating two video clips from listeners whose data were not included in the study (2 minutes), the 50 target video clips were presented to raters in a unique random order. Each video appeared on a separate survey page and played automatically, allowing the rater to only view it once. Below the videos were 100-millimeter slider scales which raters used to evaluate the listener’s comprehension (i.e., how much they thought the student in the video understood the speaker), which was the key variable of interest for this experiment. The endpoints for the comprehension sliding scale were this student understood 0% (negative endpoint on the left side) and this student understood 100% (positive endpoint on the right side). The initial slider position was set in the middle of the scale. An additional sliding scale was used to elicit the raters’ perceptions about how easily the listener seemed to understand the speaker (extremely difficult and extremely easy), which was intended to capture the listeners’ processing effort. A third sliding scale was used to check whether the edited videos looked natural (extremely unnatural and extremely natural). After the video rating task (20 minutes), the raters filled out a background questionnaire (5 minutes), a personality test (3 minutes), and facial expression recognition test (15 minutes). The analysis focuses on the listener comprehension ratings only because they provide the most direct assessment of whether raters associated listener holds with a lack of understanding.Footnote ¹ Participants were remunerated $30 for their time.

Results

Prior to addressing the research question, we first examined whether the raters were consistent in their evaluation of listener comprehension by calculating the two-way mixed average-measures intraclass correlation coefficients for the hold and release videos. The coefficient was .93 for both video types, which indicates a high level of agreement across raters. To obtain one listener comprehension score for hold videos and one score for release videos for each rater, we obtained a mean score by summing the ratings and dividing by total videos for the hold and release videos separately. To determine whether raters can recognize the hold onset as a signal of nonunderstanding and the hold release as a signal of resumed understanding, their listener comprehension ratings were compared. Raters assessed listener comprehension lower in the onset videos (M = 34.93, SD = 9.40) than the release videos (M = 68.70, SD = 14.21). A paired-samples t test indicated that the difference was statistically significant, t(29) = 13.06, p < .001, d = 2.80, with a large effect size (d ≥ 1.40) based on benchmarks for applied linguistics research (Plonsky & Oswald, Reference Plonsky and Oswald2014).

To explore whether their perceptions about listener comprehension varied by hold type, we compared the raters’ hold onset ratings for episodes with leans, head pokes, leans with facial expressions, and head pokes with facial expressions. As shown in Table 1, the raters provided the lowest listener comprehension ratings for lean holds, followed by head poke holds, lean holds with facial expression, and head poke holds with facial expression.

TABLE 1. Listener comprehension ratings by held behavior (out of 100)

A repeated-measures ANOVA (sphericity assumed) indicated that there was a statistically significant difference in perceived listener comprehension ratings, F(3, 87) = 29.55, p < .0001, partial η² = .51. Post hoc comparisons with a Bonferroni adjustment indicated that there were significant differences (p ≤ .015, d ≥ 0.88) for all paired comparisons except for head poke versus lean with facial expression (p = 1.00, d = .12) (see Table 2).

TABLE 2. Post hoc tests for comprehension ratings by hold type

Discussion

To summarize the findings of Experiment 1, these raters clearly perceived hold onsets as a signal of a listener’s difficulty comprehending the speaker and hold releases as signaling a return to understanding. Thus, it appears that English L1 and L2 raters from the same university community can interpret L2 English speakers’ holds accurately as providing visual signals of both the initiation and resolution of nonunderstanding. The findings support the results of prior rating studies that reported an association between holds and nonunderstanding (McDonough et al., Reference McDonough, Trofimovich, Lu and Abashidze2019, Reference McDonough, Lindberg, Trofimovich and Tekin2021) and provide evidence that raters associate hold releases with a return to understanding, which has not been tested previously. Furthermore, analysis of the specific hold types revealed that the raters attributed the lowest comprehension ratings to body lean holds and head poke holds. When these held movements also occurred with held facial expressions, comprehension ratings were higher. It is possible that the facial expressions, such as smiling, might dilute the nonunderstanding signal of the lean or head poke because smiling is often interpreted as a sign of understanding (McDonough et al., Reference McDonough, Lindberg, Trofimovich and Tekin2021). Indeed, when asked to explain what visual cues they based their ratings on, 16 raters mentioned smiling as a signal of understanding as compared to only three raters who stated that it was a sign of nonunderstanding. In addition, smiling might occur with a variety of nuanced expressions that are temporally linked to the verbal utterances in ways that communicate unique meanings, which could be confirmed through micro-analytic techniques.

Raters’ ability to differentiate between hold onsets and releases provides a possible explanation for the null findings for rating condition (speaker’s voice vs. listener’s face) reported in McDonough et al. (Reference McDonough, Lindberg, Trofimovich and Tekin2021). By providing raters with a visual image of the listener’s face during the speaker’s initial utterance only (i.e., before the hold onset), their rating stimuli failed to present the hold onset in isolation or in its naturally occurring sequence, thereby likely diminishing its impact. It is also possible that holds are more meaningful as a sign of nonunderstanding when raters have access to their entire sequential organization with both the onset and release. Although Experiment 1 demonstrated that raters can differentiate between hold onsets and releases when they are presented in isolation, it is not known whether raters are sensitive to their sequential organization across an entire four-turn sequence. Based on the key premise of conversation analysis that talk is sequentially organized, it seems plausible that raters could differentiate between four-turn visual sequences with a hold (i.e., nonunderstanding episodes) and without a hold (i.e., understanding episodes). Furthermore, as raters clearly differentiated between the visual cues associated with a hold onset and release, their ability to interpret hold episodes should be greater when those two signals are presented in their naturally occurring sequence (i.e., in Turn 2 and Turn 4, respectively) as opposed to the opposite order.

To test these possibilities, Experiment 2 compared rater perceptions about listener comprehension for four-turn sequences that showed listeners’ visual cues either of understanding episodes (i.e., no hold) or nonunderstanding (with hold). We expected that raters would assign lower comprehension ratings to the nonunderstanding episodes as these included listener holds. To test raters’ sensitivity to the sequential organization of holds, we also elicited their perceptions about listener comprehension in nonunderstanding episodes when the hold onset and release are reversed. Because the meaning of a hold is indicated by both its onset and release in that order, we predicted that perceived listener comprehension would be lower when the hold was presented in its naturally occurring sequence as opposed to the reversed order.

Experiment 2

Sampling Episodes from CELFI

As in Experiment 1, episodes were sampled from the CELFI corpus of L2 English speakers. Because of the narrower focus on the sequential organization of holds, the initial episode pool consisted of 42 transcripts identified during Experiment 1 as having a nonunderstanding episode with a hold onset in Turn 2 and a release in Turn 4. We returned to those 42 transcripts to locate all listeners who engaged in (a) a second nonunderstanding episode and (b) an understanding episode. Example episodes from the same listener (P62) are provided in Table 3. Whereas the listener requested clarification of the speaker’s initial utterance in Turn 2 of both nonunderstanding episodes (sorry?), she asked a follow-up question in Turn 2 of the understanding episodes (how they published that?).

TABLE 3. Sample nonunderstanding and understanding episodes

Finally, the videos of the new episodes were analyzed to ensure that (a) the nonunderstanding episodes depicted a hold onset in Turn 2 and hold release in Turn 4 and (b) the understanding episodes did not include holds. This process identified 12 listeners who each contributed one understanding and two nonunderstanding episodes that met the criteria for a total of 36 episodes. Six of these 12 listeners had contributed one nonunderstanding episode to Experiment 1 rating stimuli. In terms of their background information, the listeners (50% women) were students in undergraduate (50%) and graduate (50%) degree programs and spoke six different L1s with the most frequent being French (33%), Mandarin (17%), Tamil (17%), and Farsi (17%). They ranged in age from 18 to 29 with a mean age of 23.1 years (SD = 3.6). They had been living in Canada for a mean of 4.0 years (SD = 4.6) and had studied English for a mean of 13.6 years (SD = 4.0). Their reported proficiency test scores were similar to the median values for the entire corpus.