Leading voices: dialogue semantics, cognitive science and the polyphonic structure of multimodal interaction

Andy Lücking; Jonathan Ginzburg

doi:10.1017/langcog.2022.30

Leading voices: dialogue semantics, cognitive science and the polyphonic structure of multimodal interaction

Published online by Cambridge University Press: 05 December 2022

Andy Lücking

and

Jonathan Ginzburg

Show author details

Andy Lücking*: Affiliation:
Laboratoire de Linguistique Formelle (LLF), Université Paris Cité, CNRS – UMR 7110, Paris, France Text Technology Lab, Goethe University Frankfurt, Frankfurt am Main, Germany
Jonathan Ginzburg: Affiliation:
Laboratoire de Linguistique Formelle (LLF), Université Paris Cité, CNRS – UMR 7110, Paris, France
*: *Corresponding author. Email: andy.luecking@u-paris.fr

Article contents

Abstract
Introduction
Observations
Polyphonic interaction: cognitive–formal tools
Polyphonic interaction: cognitive–formal analyses
Upshot: from ‘horizontal’ to ‘vertical’ relevance in multimodal dialogue
Conclusions
Funding statement
Footnotes
References

Rights & Permissions

Abstract

The neurocognition of multimodal interaction – the embedded, embodied, predictive processing of vocal and non-vocal communicative behaviour – has developed into an important subfield of cognitive science. It leaves a glaring lacuna, however, namely the dearth of a precise investigation of the meanings of the verbal and non-verbal communication signals that constitute multimodal interaction. Cognitively construable dialogue semantics provides a detailed and context-aware notion of meaning, and thereby contributes content-based identity conditions needed for distinguishing syntactically or form-based defined multimodal constituents. We exemplify this by means of two novel empirical examples: dissociated uses of negative polarity utterances and head shaking, and attentional clarification requests addressing speaker/hearer roles. On this view, interlocutors are described as co-active agents, thereby motivating a replacement of sequential turn organisation as a basic organising principle with notions of leading and accompanying voices. The Multimodal Serialisation Hypothesis is formulated: multimodal natural language processing is driven in part by a notion of vertical relevance – relevance of utterances occurring simultaneously – which we suggest supervenes on sequential (‘horizontal’) relevance – relevance of utterances succeeding each other temporally.

Keywords

dialogue semantics multimodal interaction turn taking overlap clarification requests

Type: Article
Information: Language and Cognition , Volume 15 , Issue 1 , January 2023 , pp. 148 - 172

DOI: https://doi.org/10.1017/langcog.2022.30 [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

1. Introduction

Let us face it: it is all about meaning. A phoneme is the smallest meaning-distinguishing sound, a morpheme a meaning-carrying form. Most distinctions even in syntax – long regarded the core of linguistics – are based on semantic considerations. Now, investigating meanings poses a perplexing problem: we cannot directly encounter them or point at them or count them, and talking about meaning itself requires meaning. There are different ways to proceed in this situation. In psycholinguistics, for instance, experimental studies are used, where meaning is observed indirectly by observable features of language users’ processing of stimuli sentences. A quite different approach has been developed in philosophy and formal semantics: here, the act of interpretation is objectified in terms of mathematical models, that is, ‘small worlds’ which are used as items within which semantic representations of natural language expressions are evaluated. Both approaches exemplify research programmes that target distinct levels of meaning: This has recently been discussed in terms of Marrian (Marr, Reference Marr1982) implementation versus computation (resp. neural activity vs. behaviour; Krakauer et al., Reference Krakauer, Ghazanfar, Gomez-Marin, MacIver and Poeppel2017), and in terms of cognitive architectures complementing algorithmic representational models (Cooper & Peebles, Reference Cooper and Peebles2015), among others. With regard to language, there has been a long-standing collaboration: answers to What? questions are provided by formal grammar and theoretical linguistics, How? questions are addressed in psycholinguistics. Yet, this cooperation cooled down for a while (Ferreira, Reference Ferreira2005). There are several reasons for the disenchantment. With regard to meaning proper – that is, semantics – we think that theoretical linguistics ‘underaccomplishes’ the obligation to provide cognitively potent models of meaning given mainstream formal semantics’ sentence-oriented approach. The reason is this: consider a toy world that consists of three individuals, $ a $ (Aydın), $ n $ (Nuria), and $ x $ (Xinying). A mainstream model-theoretic approach to semantics maps natural language expressions onto terms of a formal language (mostly predicate logic), which in turn are interpreted in terms of the individuals of a world (denotation or reference). The meaning of a one-place predicate like sleep, for instance, is the set assigned to the formal translation $ slee{p}^{\prime } $ of the verb, and in our toy model (let us assume) is $ \left\{a,x\right\} $ (i.e., Aydın and Xinying sleep). The meaning of the sentence Aydın sleeps is compositionally derived as $ slee{p}^{\prime }(a) $ and is true iff (abbreviates if and only if) α ∈ . However, the formulae used in traditional formal semantics (e.g., $ slee{p}^{\prime }(a) $ ) are dispensable: they eventually get reduced to the basic notions of truth and reference ( $ slee{p}^{\prime }(a) $ , e.g., is true in our toy model) and therefore have no cognitive bearing. Hence, while being formally precise, it is unclear whether an approach of such a kind succeeds to ‘formulate the computational problem under consideration’, as Cooper and Peebles (Reference Cooper and Peebles2015, p. 245) put it.

Nonetheless, over the past 30 years, theoretical linguistics has developed a different sort of a formal model of meaning, namely dynamic update semantics – most notably Discourse Representation Theory (Kamp & Reyle, Reference Kamp and Reyle1993) – where the construction of semantic representations is constitutive of meaning (Kamp, Reference Kamp, Bäuerle, Egli and von Stechow1979, p. 409) and has cognitive (Hamm et al., Reference Hamm, Kamp and Van Lambalgen2006) and neuroscientific (Brogaard, Reference Brogaard, Abbott and Gundel2019) interpretations (see also Garnham, Reference Garnham2010). The sentence Aydın sleeps is processed within a dynamic update semantics in such a way that a fileFootnote ¹ for an object $ x $ (due to the proper name Aydın) is opened (if new) or continued (if known). We emphasise this detail since it reveals a dynamic shift in the notion of meaning: the meaning of an utterance updates a previous context and returns an updated context. Hence, reducing the meaning of an assertive sentence to truth and reference is replaced (or at least complemented) by its context change potential. The (new or continued) file is then populated with conditions that $ x $ is named Aydın (if not already known) and $ x $ is sleeping.Footnote ² A dynamic update semantics rooted in spoken language – also known as dialogue semantics – is KoS (Ginzburg, Reference Ginzburg and Bunt1994, Reference Ginzburg2012). KoS [not an acronym] is formulated by means of types from a Type Theory with Records (TTR; Cooper, Reference Cooper2013; Cooper & Ginzburg, Reference Cooper, Ginzburg, Lappin and Fox2015) instead of terms and expressions from an interpreted language like predicate logic. There is a straightforward model-theoretic, denotational construal of types much in the spirit of classical formal semantics (Cooper, Reference Coopern.d.), but one can also think of types as symbolic but embodied structures which are rooted in perception (as Cooper, Reference Coopern.d., points out), which label instances of linguistic processing (Connell, Reference Connell2019; Frankland & Greene, Reference Frankland and Greene2020), and are associated with motor and perception activation (Bickhard, Reference Bickhard, Paco and Gomila2008; Hummel, Reference Hummel2011; Meteyard et al., Reference Meteyard, Cuadrado, Bahrami and Vigliocco2012). Indeed, types can also be construed neurally (Cooper, Reference Cooper2019).

Why promote dialogue semantics and to a cognitive science audience? Cognitive science has come to acknowledge that multimodal interaction is the ‘central ecological niche’ of sentence processing (Holler & Levinson, Reference Holler and Levinson2019, p. 639). The dominant view on interaction and coordination in cognitive science is a systemic view: interlocutors are observed and construed as a complex system – there is work on systemic coupling on neural, behavioural, and attentional, goal-predicting levels (Fusaroli et al., Reference Fusaroli, Gangopadhyay and Tylén2014; Hasson et al., Reference Hasson, Ghazanfar, Galantucci, Garrod and Keysers2012; Pickering & Garrod, Reference Pickering and Garrod2004, Reference Pickering and Garrod2013; Sebanz & Knoblich, Reference Sebanz and Knoblich2009).

However, while a systemic view certainly provides important insights into the neural and cognitive underpinnings of alignment and communication within its ecological niche, we argue that significant lacunae remain unless this is complemented by analyses of the verbal and nonverbal signals (and their interactions) that constitute multimodal communication: the multimodal, interactive turn in cognitive science induces a renewed need for a precise formulation of the computational What? problem. Simplifying to a necessary degree, Fig. 1 summarises the semantic position within the multimodal discourse landscape. We focus on contents (cont) here and demonstrate throughout how contents depend to a very large extent on a fine-grained structured context (ctxt).

Fig. 1. A dialogue-semantics perspective for completing the systemic understanding of multimodal discourse.

In particular, we argue that a dialogue-semantics perspective makes at least three crucial contributions:

• Dialogue semantics provides a formal notion of content that is needed in order to define different kinds of cross-modal signals. From gesture studies, we have the notion of multimodal ensembles (Kendon, Reference Kendon2004) – utterances including speech–gesture composites – and from psycholinguistics, multimodal gestalts (Holler & Levinson, Reference Holler and Levinson2019) – recurrent, statistically significant multimodal actions, signals or features which are interlinked by a (common) communicative intent or meaning.Footnote ³ However, recurrent ensembles or gestalts often occur with a simplification in form (Lücking et al., Reference Lücking, Mehler and Menke2008). This raises the issue of how formally different gestalt or ensemble tokens are assigned to a common type instead of to different ones. Moreover, how to account for communicative signals, features or utterances which do not belong to a unique ensemble or gestalt? We argue mainly based on data from head shaking (Sections 2.1 and 4.2) that an explicit semantic analysis is needed to provide the required identity conditions and, among others, tell apart multimodal behaviour that, with respect to its perceptual forms, deceptively looks like a unified composite utterance.
• In line with research on attentional mechanisms (Mundy & Newell, Reference Mundy and Newell2007), we discuss (non-)attending to interlocutors as new attentional data and argue that it can be used to explain – as far as we know – hitherto unstudied occurrences of specific types of other-repair in discourse targeting the speaker and hearer roles.
• Timing and coherence within multimodal interaction is a subject sui generis for both cognitive science and dialogue semantics: dialogue agents are co-active during more or less the whole time of interaction – see also the analysis of Mondada (Reference Mondada2016).Footnote ⁴ Accordingly, the notion of turn should be replaced by the notion of leading voice. Moreover, this applies even to spoken contributions, where despite the entrenched assumption of one speaker per turn, assumed in Conversation Analysis to be one of the essential and universal structuring notions of conversation (Levinson & Torreira, Reference Levinson and Torreira2015), overlap is in certain situations and with inter-subject variation an acceptable option (Bennett, Reference Bennett1978; Falk, Reference Falk, Caron, Hoffman, Silva, Van Oosten, Alford, Hunold, Macauly and Manley-Buser1980; Hilton, Reference Hilton2018; Tannen, Reference Tannen1984; Yuan et al., Reference Yuan, Liberman and Cieri2006). In this respect, multimodal interaction is akin to a polyphonic musical piece.Footnote ⁵ Just like polyphonic music is organised by harmonic or contrapuntal composition techniques, polyphonic interaction is driven partly by dialogical relevance or coherence.Footnote ⁶ Note that the terms ‘leading’ and ‘accompanying voices’ also give rise to a subjective interpretation: a speaker or gesturer may have the impression to hold the leading voice in a conversation regardless of observational evidence.

Section 2 illustrates the above-mentioned challenges posed by multimodal phenomena. Section 3 then sketches a formal theory of multimodal interaction that involves: (i) semantic representations which can (and should) be construed as cognitive information state representations, (ii) partiturs (multimodal input representations) and (iii) lexical entries and conversational update rules that capture dialogical relevance enabling incremental and predictive processing. The machinery is applied to analyse the sample observations in Section 4. The formal theory may appear complex to those exposed to it for a first time or who do not endorse formal approaches, but its expressive granularity has been developed in light of many diverse dialogical phenomena, as explained in Section 3.2. In particular, it facilitates formulating our ultimate upshot in Section 5 in an explicit way: our claim, the multimodal serialisation hypothesis, is that vertical relevance – relevance of utterances occurring simultaneously – supervenesFootnote ⁷ on horizontal relevance – relevance of utterances succeeding each other temporally. Hence, multimodality compresses interaction temporally, but is not richer in terms of semantic expressivity. In other words, and with certain caveats we will spell out, simultaneous interaction, though more efficient and perhaps more emotionally engaging and aesthetically pleasing, can always be serialised without loss of semantic information. This is a rather strong claim, and it needs to be refined right away. On the one hand, there are multimodal signals which simply cannot be separated – for instance, you cannot separate spoken utterances from their intonation: they are coarticulated. This is, however, not just due to a common channel: speech–laughter is transmitted via the acoustic channel but can be separated into speech and laughter. On the other hand, serialising multimodal input gives rise to different possible orderings. We do not claim that every order of the elements from multimodal input when put into a sequence is equivalent, quite the contrary: we provide evidence for the opposite below. But in accordance with the claim, one of the possible orderings is semantically equivalent to the original multimodal input. Simultaneity and sequentiality in multimodal interaction can always become manifest in two ways: (i) across interlocutors and (ii) within one interlocutor. The multimodal serialisation hypothesis intentionally generalises over both manifestations (in fact, the empirical phenomena discussed in the following involve both kinds). Given these qualifications, the expressivity claim is a hypothesis that has to be explored in multimodal communication studies by cognitive science, theoretical linguistics, gestures studies and related disciplines.

2. Observations

2.1. Head shake

Eight uses of the head shake are documented by Kendon (Reference Kendon2002). The most well-known (Kendon’s use I) is a non-verbal expression of the answer particle ‘No’. Thus, a head shake can be used in order to answer a polar question:

Depending on whether A produced a negative or a positive propositional kernel in the question, B’s head shake is either a denial of the positive proposition or a confirmation of the negative one (which is not discussed by Kendon, Reference Kendon2002). In uses such as those documented in (1), the head shake conveys a proposition. However, the proposition expressed by the head shake is in part determined by the context in which it occurs – in (1), it can be one of two contradictory dialogue moves: a denial or a confirmation one. Hence, what is needed for instances such as (1) is a notion of contextually aware content. We provide such a content in Section 4.2.

Having a context-aware semantic representation of denial at our disposal, it makes predictions for head shakes in other contexts as well. Consider (2):

While (2a) is fully coherent, (2b) (at least without additional context – examples of which we provide in Section 4.2) has a contradictory flavour: the head’s denial is not matched in speech. Hence, in order to discuss apparently simple uses of head shakes, one already has to draw on a precisely formulated, contextually aware notion of contents.

2.2. Co-activity and communicative breakdown

A well-known pattern of co-activity in spoken discourse is the interplay of monitoring and feedback signals. For instance, backchannelling signals such as nodding or vocalisations such as ‘mhm’ influence the development of discourse (Bavelas & Gerwing, Reference Bavelas and Gerwing2011). The absence of monitoring or feedback signals leads to communicative breakdown since it raises the question whether one is still engaged in the joint interaction. Suppose $ A $ and $ B $ are sitting on a window seat in a café. $ A $ is telling $ B $ about a near-accident she witnessed on Main Street the day before. While $ A $ has been talking, $ B $ has been continuously staring out of the window. Thus, $ A $ lacks attentional gaze signals, which in turn raises doubts about $ B $ ’s conversational involvement. Accordingly, $ A $ will try to clarify $ B $ ’s addressee role:

$ A $ ’s clarification request or (other-initiated) repairFootnote ⁸ is a natural response in the sketched situation since it is triggered by a neurocognitive social attention mechanism (Nummenmaa & Calder, Reference Nummenmaa and Calder2009) in response to a violation of a behavioural norm. However, seen from a turn-based view, (3) is not easy to explain: $ A $ is speaking and $ B $ is listening, so the all-important roles of hearer and speaker are clearly filled – and now it is $ B $ ’s turn. Crucially, (3) could equally be made by $ B $ if s/he gets the impression $ A $ is rambling incoherently.

2.3. Summary

The upshot of the few phenomena we have discussed above is that multimodal interaction is:

• driven by a richly structured and fine-grained context,
• which is distinct but aligned across the participants,
• where the participants typically monitor each other’s co-activity.

In the following, we introduce a theoretical dialogue framework which can capture these observations.

3. Polyphonic interaction: cognitive–formal tools

3.1. Partiturs

A prerequisite for an analysis of multimodal interaction is a systematic means for telling apart the manifold verbal and non-verbal signals. We employ tiers in this respect, where a tier is built following the model of phonetics and phonology. Phonetics comprises the triple of articulatory phonetics, acoustic phonetics and auditory phonetics. Signalling of other communication means can be construed in an analogous way. For instance, facial muscles – facial display – vision defines the tier for facial expression. Tiers give rise to a uniform approach to linguistic analysis which ultimately rests on perceptual classification (cf. Cooper, Reference Coopern.d.), which we formulate in terms of TTR (a Type Theory with Records; Cooper, Reference Coopern.d.; Cooper & Ginzburg, Reference Cooper, Ginzburg, Lappin and Fox2015). Classification in TTR is expressed as a judgement: in general, object $ o $ is of type $ T $ . With regard to spoken utterances, a record $ r $ (situation) providing a sound event (construed as an Individual) – $ r= $ $ \left[\begin{array}{ll}{\mathrm{s}}_{\mathrm{event}}& =\mathrm{a}\\ {}\mathrm{c}& =\mathrm{s}1\end{array}\right] $ – is correctly classified by a record type $ T= $ $ \left[\begin{array}{ll}{\mathrm{s}}_{\mathrm{event}}& : Ind\\ {}\mathrm{c}& : Sign\left({\mathrm{s}}_{\mathrm{event}}\right)\end{array}\right] $ (i.e., $ r:T $ ), iff the object labelled ‘s_event’ (in this case, a soundwave) belongs to the phonological repertoire of the language in question.Footnote ⁹

Tiers can be likened to different instruments on a musical score: a partitur. Footnote ¹⁰ Building on Cooper (Reference Cooper, Orwin, Howes and Kempson2015), we represent partiturs as strings of multimodal communication events $ e $ , which are temporally ordered sequences of types. One can think of strings in terms of a flip-book: a dynamic event is cut into slices, and each slice is modelled as a record type. Such string types (Cooper, Reference Coopern.d.; Fernando, Reference Fernando2007) are notated in round brackets and typed in an obvious manner, where RecType is the general type of a record type:

The progressive unfolding of subevents on the various tiers in time gives rise to incremental production and perception. Formally, this is indicated by the Kleene plus (‘ $ {}^{+} $ ’). (4) shows the type of multi-tier signalling, and it remains silent concerning potential inherent rhythms of the individual tiers. In fact, it has been argued that different kinds of gestures exhibit a specific ‘rhythmic pulse’ (Tuite, Reference Tuite1993), as does speech, which lead to tier-specific temporal production cycles that may jointly peak in synchronised intervals (Loehr, Reference Loehr2007). The temporal relationship between signals on different tiers is therefore specified in a relative way, following the example set by the Behaviour Markup Language (Vilhjálmsson et al., Reference Vilhjálmsson, Cantelmo, Cassell, Chafai, Kipp, Kopp, Mancini, Marsella, Marshall, Pelachaud, Ruttkay, Thórisson, van Welbergen, van der Werf, Pelachaud, Martin, André, Chollet, Karpouzis and Pelé2007). It should be noted that the subevents on partiturs can be made as detailed as needed – from phonetic features to complete sentences or turns. A reasonable fine-grained temporal resolution of partiturs seems to be the level of syllables. Arguably, syllables constitute coherent events as do tones in a melodic phrase and movement elements in locomotion, and to which attentional processes are rhythmically attuned in the sense of Jones and Boltz (Reference Jones and Boltz1989). See Lücking and Ginzburg (Reference Lücking and Ginzburg2020) for more details on parsing on partiturs. We will make crucial use of record-type representations along these lines in the following.

3.2. Cognitive states in dialogue semantics

We model cognitive states by means of dialogue agent-specific Total Cognitive States (TCS) of KoS (Ginzburg, Reference Ginzburg and Bunt1994, Reference Ginzburg2012; Larsson, Reference Larsson2002; Purver, Reference Purver2006). A TCS has two partitions, namely a private and a public one. A TCS is formally represented in (5). In a dialogue between A and B, there are both, A.TCS and B.TCS.Footnote ¹¹

(The symbol “ $ := $ ” indicates a definition relation.)

Now, trivially, communication events take place in some context. The simplest model of context, going back to Montague (Reference Montague and Thomason1974), is one which specifies the existence of a speaker, addressing an addressee at a particular time. This can be captured in terms of the type in (6), which classifies situations (records) that involve the obvious entities and actions.

However, over the last four decades it has become clearer how much more pervasive reference to context in interaction is. Indeed, arguably, this traditional formulation gets things backwards in that it seems to imply that ‘context’ is some distinct component one refers to. In fact, as will become clear, following Barwise and Perry (Reference Barwise and Perry1983), we take utterances – multimodal events – to be the basic units interlocutors assign contents to given their current cognitive states and from this generalise to obtain utterance types, the meanings/characters semanticists postulate.

The visual situation is a key component in interaction from birth (see Tomasello, Reference Tomasello1999, Ch. 3).Footnote ¹² Expectations arise due to illocutionary acts – one act (querying, assertion and greeting) giving rise to anticipation of an appropriate response (answer, acceptance and counter-greeting), also known as adjacency pairs (Schegloff, Reference Schegloff2007). Extended interaction gives rise to shared assumptions or presuppositions (Stalnaker, Reference Stalnaker and Cole1978), whereas uncertainties about mutual understanding that remain to be resolved across participants – questions under discussion – are a key notion in explaining coherence and various anaphoric processes (Ginzburg, Reference Ginzburg and Bunt1994, Reference Ginzburg2012; Roberts, Reference Roberts1996). These considerations among several additional significant ones lead to positing a significantly richer structure to represent each participant’s view of publicised context, the dialogue gameboard (DGB), whose basic make up is given in (7):

It should be emphasised (again) that there is not a single DGB covering a dialogical episode, but a DGB for each participant. Participants’ DGBs are usually coupled, that is, develop in parallel. Participant specific DGBs, however, allow to incorporate misunderstandings, negotiation, coordination and the like in a straightforward manner in KoS. In any case, facts represents the shared assumptions of the interlocutors – identified with a set of propositions. In line with TTR’s general conception of (linguistic) classification as type assignment – record types regiment records – propositions are construed as typing relationships between records (situations) and record types (situation types), that is, as Austinian propositions (Austin, Reference Austin1950; Barwise & Etchemendy, Reference Barwise and Etchemendy1987). More formally, propositions are records of type $ \left[\begin{array}{ll}\mathrm{sit}& : Rec\\ {}\mathrm{sit}\hbox{-} \mathrm{type}& : Rec Type\end{array}\right] $ .Footnote ¹³ The ontology of dialogue (Ginzburg, Reference Ginzburg2012) knows two special sorts of Austinian proposition: grammar types classifying phonetic events (Loc(utionary)Prop(ositions)) and speech acts classifying utterances (Illoc(utionary)Prop(ositions)). Both types are part and parcel of locutionary and illocutionary interaction: dialogue moves that are in the process of being grounded or under clarification are the elements of the pending list; already grounded moves (roughly, moves that are not contested, or agreed-upon moves) are moved to the moves list. Within moves, the first element has a special status given its use to capture adjacency pair coherence and it is referred to as LatestMove. The current question under discussion is tracked in the qud field, whose data type is a partially ordered set (poset). Vis-sit represents the visual situation of an agent, including his or her visual focus of attention (foa), which, if any (attention may be directed towards something non-visual, even non-perceptualFootnote ¹⁴), can be an object (Ind), or a situation or event (which in TTR are modelled as records, i.e., entities of type Rec). Mood tracks a participant’s public displays of emotion (i.e., externally observable appraisal indicators such as intonation or facial expressions, which often do but need not coincide with the participant’s internal emotional state), crucial for inter alia laughter, smiling and sighing (Ginzburg et al., Reference Ginzburg, Mazzocconi and Tian2020b), and, as we shall see, head shaking as well. The DGB structure in (7) might seem like an overly rich notion for interlocutors to keep track of. Ginzburg and Lücking (Reference Ginzburg and Lücking2020) show how the DGB type can be recast as a Baddeley-style (Baddeley, Reference Baddeley2012) multicomponent working memory model interfacing with long-term memory.

Given that our signs (lexical entries/phrasal rules) are construed as types for interaction, they refer directly to the DGB via the field dgb-params. For instance, the linguistic meaning of the head shake from (1) in Section 2.1 patterns with the lexical entry for ‘No’ when used as an answer particle to a polar question (a.k.a. a ‘yes–no’ question) and, following Tian and Ginzburg (Reference Tian and Ginzburg2016), is given in (8).

When used in the context of a polar question with content $ p $ (the current question under discussion – MaxQUD – is $ p $ ?), saying ‘No’ and/or shaking the head asserts a ‘No semantics’ applied to $ p $ . NoSem( $ p $ ) in turn is sensitive to the polarity of the proposition to which it applies (cf. the discussion of the head shake in Section 2.1). To this end, positive (PosProp) and negative (NegProp) propositions have to be distinguished. If a negative particle (not, no, n’t, never and nothing) is part of the constituents of a proposition $ \neg p $ , then $ \neg p $ is of type NegProp ( $ \neg p $ : NegProp). The corresponding positive proposition, the one with the negative particle removed, is $ p $ ( $ p $ : PosProp). With this distinction at hand, NoSem works as follows:

(Note that the result of ‘NoSem( $ p $ )’ is always of type NegProp – $ p $ : NegProp means that $ p=\neg q $ , which NoSem leaves unchanged according to the second condition in (9).) (8) and (9) provide a precise characterisation of answer particle uses of negation and head shake and therefore make testable predictions concerning meaning in context.

The evolution of context in interaction is described in terms of conversational rules, mappings between two cognitive states, the precond(ition)s and the effects. Two rules are given in (10): a DGB that satisfies preconds can be updated by effects.

Within the dialogue update model of KoS, following Ginzburg et al. (Reference Ginzburg, Cooper, Hough, Schlangen, Abeillé and Bonami2020a), QUD gets modified incrementally, that is, at a word-by-word latency (or even higher).Footnote ¹⁵ Technically, this can be implemented by adopting the predictive principle of incremental interpretation in (11) on top of partitur parsing (see Section 3.1). This says that if one projects that the currently pending utterance (the preconditions in (11)) will continue in a certain way (pending.sit-type.proj in (11)), then one can actually use this prediction to update one’s DGB, concretely to update LatestMove with the projected move; this will, in turn, by application of the existing conversational rules, trigger an update of QUD:Footnote ¹⁶

We will make use of utterance projection in analysing head shakes synchronous to speech in Section 4.2 and in Section 5 when explicating vertical relevance. Such projective rules implement predictive processing in interactions and therefore provide a computational underpinning of a central cognitive mechanism (Litwin & Miłkowski, Reference Litwin and Miłkowski2020).

4. Polyphonic interaction: cognitive–formal analyses

The formal tools from Section 3 are used to provide precise analyses of the observations from Section 2: attention and communicative breakdown (Section 4.1) and the semantics of head shake (Section 4.2).

4.1. Conversational engagement

In two-person conversation, the values of spkr and addr of a DGB are rarely in question, apart from initially (Who is speaking? Are you addressing me?), but the need to verify that the addressing condition holds is what we take to drive attention monitoring. We conceive the two states of being engaged or disengaged in conversation as two hypotheses in a probabilistic Bayesian framework. Relevant data for the (dis-)engagement hypotheses can be found in gaze, which is an excellent predictor of conversational attention (Nummenmaa & Calder, Reference Nummenmaa and Calder2009; Vertegaal et al., Reference Vertegaal, Slagter, van der Veer and Nijholt2001). The quoted sources as well as the discussion in the following concern unobstructed face-to-face dialogue, that is, dialogue where participants stand or sit opposing each other and can freely talk. The findings and the assumptions derived below do not carry over to ‘obstructed’ discourse situations simpliciter, for instance, when interlocutors are talking while carrying a piano.

Within cognitive DGB modelling, the Vis-Sit field already provides an appropriate data structure for gaze. Mutual gaze can be formulated as a perspectival default condition on partiturs.Footnote ¹⁷ Of course, there is no claim that mutual gazing occurs continuously. Indeed, continuous gaze is often viewed as being rude or encroaching. In fact, mutual gaze tends to be short, often less than a second (Kendon, Reference Kendon1967).

Gaze is not the only attentional signalling system, however. Dialogue agents regularly provide verbal and non-verbal feedback signals (Bavelas & Gerwing, Reference Bavelas and Gerwing2011). Among the verbal reactive tokens (Clancy et al., Reference Clancy, Thompson, Suzuki and Tao1996) the majority are backchannels. As with gaze, a lack of backchannelling will result in communicative breakdown. In sum, there is ample evidence that gazing and backchannelling provide important datapoints for tracking (mutual) attention. We combine both into a probabilistic framework along the following lines:

We assume that gazing provides slightly more attentional evidence than backchannelling by a proportion of 0.6 to 0.4. We derive the prior probabilities for gaze under $ {H}_1 $ from Argyle (Reference Argyle1988, p. 159), and the priors for gaze under $ {H}_2 $ are stipulated, as are the backchannel probabilities. Furthermore, we assume that engagement is the probabilistic default case of interaction with a plausibility of 0.8 to 0.2:

If one of the kinds of gaze from $ \mathcal{D} $ is observed, the posterior probability can be calculated from the probability tree in (13) by means of a Bayesian update according to Bayes’ theorem ( $ P\left(H|D\right)=\frac{P\left(D|H\right)P(H)}{P(D)} $ ). Let us illustrate an update triggered by an observation of individual gaze, $ {D}_1 $ . Compared to the prior probabilities of the engagement and disengagement hypotheses, $ {D}_1 $ leads to an increase of the probability of $ {H}_1 $ at the expense of $ {H}_2 $ . The corresponding numerical values are collected in Table 1.

Table 1. Bayesian update table

The change of the posteriors in comparison to the priors show that the already more probable engagement hypothesis gains further plausibility (increasing from 0.8 to 0.9). Hence, observing individual gaze, $ {D}_1 $ , supports (the public display of) mutual attention. Bayesian updates apply iteratively: In this way, only a mixture of data observations of different kinds leads to an oscillation of $ {H}_1 $ within a certain probability interval. This leads us to a testable hypothesis, namely that the extrema of the oscillation interval constitute thresholds of mutual attention. If the engagement posterior takes a value below the minimum of the interval, it triggers attention clarification: Are you with me? Values that exceed the maximum lead to irritation: Why are you staring at me?

4.2. Head shake and noetics

In Section 2, the exchange re-given in (14) was introduced as an obstacle for the No-semantics of head shakes introduced in Section 3.

If we make the (rather consensual) assumption, that the outcomes of utterances are predicted as soon as possible (see Section 3, in particular example (11)), then an explanation of (14) is straightforward: A’s utterance in (14a) provides a negative proposition, $ \neg $ believe(A,B), which by NoSem the head shake affirms. On the other hand, (14b) provides a positive proposition, believe(A,B), which by the same lexical entry the head shake negates, hence a contradiction ensues.

The contradiction in (14b) can be ameliorated, however:

In this case, one can understand A as verbally expressing his belief in B’s protestation of innocence, whereas the head shake affirms the negative proposition B makes, $ \neg $ stole(B,500€) (when related to the second sentence uttered by B), or expresses that A is upset about what ‘they’ did (when related to B’s initial uttered sentence). In either case, this requires us to assume that the head shake can be disassociated from speech that is simultaneous with it, an assumption argued for in some detail with respect to speech laughter by Mazzocconi et al. (Reference Mazzocconi, Tian and Ginzburg2020/22). Such observations are of great importance for a multimodal theory. This is because it has been claimed that multi-tier interpretation is guided by the heuristic ‘if multiple signs occur simultaneously, take them as one’ (Enfield, Reference Enfield2009, p. 9). Such heuristics have to be refined in consideration of the above evidence.Footnote ¹⁸

Examples like the head shake in (15b) – which can be glossed ‘I disapprove of $ p $ ’ – are therefore subsumed to a ‘negative appraisal use’ of negation (Tian & Ginzburg, Reference Tian and Ginzburg2016) by Lücking and Ginzburg (Reference Lücking and Ginzburg2021), and analysed as a noetic act expressing a speaker’s attitude towards the content of his or her speech via DGB’s Mood field.Footnote ¹⁹ Note, finally, that A’s response in (15) can be serialised as head shake followed by speech (i.e., head shake + ‘I believe you’). However, the sequence ‘I believe you’ + head shake seems to be a bit odd, illustrating a remark concerning the multimodal serialisation hypothesis we made in Section 1, namely that sequential orderings need not be equivalent. Such temporal effects need to be explored further in future studies.

5. Upshot: from ‘horizontal’ to ‘vertical’ relevance in multimodal dialogue

In uni-modal interaction (best exemplified perhaps by chat conducted sequentially between users across a network), conversation is constrained by relevance or coherence between successive participant moves (and ultimately across longer stretches). For reasons related to our metaphor with musical notion (cf. partiturs), we call this notion horizontal relevance.

Some examples for relevant (indicated by ‘✓’) responses to a query and to an assertion are given in (16a,b) and irrelevant ones (indicated by ‘#’) to both in (16c).

For conversation, the query/response relation is the one studied in greatest detail (Berninger & Garvey, Reference Berninger and Garvey1981; Ginzburg et al., Reference Ginzburg, Yusupujiang, Li, Ren, Kucharska and Łupkowski2022; Stivers & Enfield, Reference Stivers and Enfield2010). The basic characterisation of this relationship given in Ginzburg et al. (Reference Ginzburg, Yusupujiang, Li, Ren, Kucharska and Łupkowski2022) is that the class of responses to a question $ {q}_1 $ can be partitioned into three classes.

A formal account of horizontal relevance in terms of conversational rules is given in Ginzburg (Reference Ginzburg2012, Sects. 4.4.5 and 6.7.1). The basic idea is that an utterance $ u $ is relevant in the current context iff $ u $ can be integrated as the (situational component of the) LatestMove via some conversational rule.

But how does the sequential notion of horizontal relevance relate to simultaneous interaction on partiturs, that is, to vertical relevance (to stick to the basic metaphor)? We believe that vertical relevance is supervenient on horizontal relevance. To the best of our knowledge, a careful study, either experimental or corpus-based, of vertical dialogical relevance has yet to be undertaken, apart from one subclass of cases involving speech, known as overlaps and interruptions, to which we return in our discussion below. We offer an initial, partial and impressionistic characterisation of the notion of vertical relevance in Table 2.

Table 2. Vertical relevance: possible content relationships between overlapping utterances across two speakers

Table 2 offers a selection of signals/contents that a non-leading voice $ B $ can express simultaneously relative to a leading voice $ A $ (speaking in terms of turn-replacements, not in terms of subjectively assumed importance; cf. Section 1). Note that two cases can be distinguished. The first case involves a single speaker for whom certain signals from the multimodal utterances may take the leading voice over other ones. A natural leading voice is speech (de Ruiter, Reference de Ruiter2004). Co-leading or accompanying roles of non-verbal signals can be assigned in relation to speech. In this respect, at-issue ( $ \approx $ co-leading) and non-at-issue ( $ \approx $ accompanying) uses of co-verbal manual gestures have been distinguished (Ebert, Reference Ebert2014).

The second case concerns the distribution of voices among several interlocutors. Inhabiting a leading or an accompanying role is rooted in processes of utterance projection (11) and incremental QUD construction, as we discuss in more formal detail below. We assume that the interlocutor who is responsible for publicly constructing the initial QUD – a process which (by the first case above) can be multimodal or even nonverbal itself – has/is the leading voice. We think that the classic notion of turn holder dissolves into the notion of leading voice. Accompanying voices are characterised by monitoring the incremental QUD construction and commenting on it – in ways exemplified in Table 2. In the most trivial case, this consists in providing backchannelling, but it may also involve the joint production of an utterance (in which case, it could be argued that the accompanying voice becomes a co-leading voice).

The final class we mention is one that has been, in certain respects, much studied, namely simultaneous speech. This is a somewhat controversial area because whereas the ‘normativity’ of one speaker using speech and another producing a non-verbal signal is not in question, the normativity of the corresponding case where both participants use speech is very much in question. This is so given the notion of turn and the rule-based system which interlocutors are postulated to follow in the highly influential account of Sacks et al. (Reference Sacks, Schegloff and Jefferson1974). This system is based on the assumption that normatively at any given time there should be a single speaker; deviations are ‘performance errors’, either unintentional overlaps or one interlocutor interrupting, attempting to gain the floor. The set-up we have provided does not predict any sharp contrast between non-speech/speech overlap and speech/speech overlap, although this could in principle be enforced by introducing conversational rules privileging the speech tier. Nonetheless, we do not think such a strategy is promising. Rather, there are other explanatory factors which conspire to suppress pervasive overlap. In a study of the multilingual CallHome corpus, Yuan et al. (Reference Yuan, Liberman and Cieri2007) note that overlapping varies across languages, with significantly more (non-backchannel) overlaps in Japanese than in the other languages they study (Arabic, English, German, Mandarin and Spanish); they also find that males and females make more overlaps when talking to females than to males, and similarly find more overlaps when talking with familiars than with strangers. Tannen (Reference Tannen1984) argues for the existence of distinct conversational styles, including a high-involvement style that favours a fast delivery pace, cooperative overlaps and minimal gaps contrasting with a dichotomous high-considerateness style. Hilton (Reference Hilton2018) conducted a study which found statistically significant correlations between a subject’s conversational style preference and their assessment of the acceptability of overlaps. All this argues against viewing avoidance of overlapping as a fundamental, systematic organising principle.

Can we say anything systematic based on subject matter about cases where overlap seems to be acceptable? There is no dearth of evidence for such cases going back to Bennett (Reference Bennett1978), Falk (Reference Falk, Caron, Hoffman, Silva, Van Oosten, Alford, Hunold, Macauly and Manley-Buser1980), Goodwin and Goodwin (Reference Goodwin, Goodwin, Auer and Di Luzio1992) and indeed Schegloff (Reference Schegloff2000), who while defending the basic intuition underlying Sacks et al. (Reference Sacks, Schegloff and Jefferson1974) list various cases of acceptable overlaps. We mention several subclasses: the first involves what we dub, following Goodwin and Goodwin (Reference Goodwin, Goodwin, Auer and Di Luzio1992), shared situation assessments. Examples of this are given in (18a,b); in all three cases, a single situation is being described. A second class, noted by Schegloff (Reference Schegloff2000), is symmetric moves like greetings, partings and congratulations (“we won!” “Yay!” etc.). A third class is exemplified by the attested (18c) – cases where the same question is being addressed; additional instances of this, noted by Schegloff (Reference Schegloff2000), are utterances involving self-addressed questions (Tian et al., Reference Tian, Maruyama and Ginzburg2017) and ‘split utterances’ – utterances started by A and completed by B (Goodwin, Reference Goodwin and Psathas1979; Gregoromichelaki et al., Reference Gregoromichelaki, Kempson, Purver, Mills, Ronnie Cann, Meyer-Viol and Patrick2011; Lerner, Reference Lerner1988; Poesio & Rieser, Reference Poesio and Rieser2010).

Our assumption throughout has been that vertical relevance supervenes on horizontal relevance – what we labelled earlier the multimodal serialisation hypothesis. We adopt this assumption since, at least on the basis of Table 2, all polyphonic utterances seem to have sequential manifestations which give rise to equivalent contents; such cases, nonetheless, do lead to distinct DGBs since the partiturs in the two cases are distinct. On the other hand, we believe that there exist sequential adjacency pairs that do not have polyphonic manifestations which give rise to equivalent contents: turn-assigning moves, such as those arising by using the assignee’s name or via gaze, do not have a polyphonic equivalent.

Assuming supervenience to hold, we derive vertical relevance from conversational rules by applying incrementalisation – in other words, given two conversational rules CR $ {}_1 $ and CR $ {}_2 $ that can apply in sequence where $ A $ holds the turn as a consequence of CR $ {}_1 $ and this is exchanged in CR $ {}_2 $ , if by means of incremental interpretation B finds herself in a DGB applicable to CR $ {}_2 $ before the move taking place CR $ {}_1 $ is complete, an overlap arises. To make this concrete, A asserting $ p $ and B discussing whether $ p $ is the case can be explicated in terms of the sequence of Assert QUD-incrementation and QSPEC (see (10)). Incrementalising this involves B using Assert QUD-incrementation before A completed their utterance, which then satisfies the preconditions of QSPEC. In such a case, as discussed above, A is the ‘leading voice’ and B is an ‘accompanying voice’.

All this means that to the extent that the conversational rules underlying horizontal relevance ensure the coherence of dialogue, the same applies to dialogue with polyphonic utterances. Given this, incrementalising conversational rules provides a detailed model for coherence-driven, predictive processing in natural language interaction. In particular, it makes the testable prediction that accompanying behaviour commenting on a leading voice (examples of which are collected in Table 2) is expected to occur before the leading voice finished its contribution on its own.

6. Conclusions

We have outlined a unified framework for describing multimodal dialogical interaction. We show how minor adjustments to an existing dialogue framework, KoS, which provides richly structured cognitive states and conversational rules along with (i) partiturs, representations of multimodal events and (ii) an incremental semantic framework are needed to analyse multimodal phenomena.

• We demonstrate the existence of noetic head shakes whose contents are dissociated from simultaneous speech. Such dissociation has been demonstrated in previous work for laughter.
• We offer a testable, quantitative account of mutual gaze repair and backchannelling driven by monitoring of participant roles – not enough leading to clarification requests, too much leading to complaints.
• We have argued that no overlap is not a defensible norm in multimodal interaction, including in cases where the two tiers involve speech. The intrinsically sequential notion of turn should be replaced by a notion such as leading/accompanying voice, which is driven by vertical coherence.

On the more basic level of theory design, the observations we discussed all exemplify the need for analytic semantic tools within the systemic landscape of cognitive science. We argued that a dynamic dialogue semantics incarnates a cognitively potent, formally precise linguistic framework for fertilising cross-talk between the disciplines.

As is frequently pointed out but cannot be overemphasized, an important goal of formalization in linguistics is to enable subsequent researchers to see the defects of an analysis as clearly as its merits; only then can progress be made efficiently. (Dowty, Reference Dowty1979, p. 322)

The issues of timing and coherence as captured in terms such as leading voice and vertical relevance have been identified as specific topics within multimodal dialogue semantics.

Acknowledgements

We wish to thank Judith Holler, two anonymous reviewers for Language and Cognition, Robin Cooper, Mark Liberman, Chiara Mazzocconi, and Hannes Rieser, for comments on earlier versions of this paper. Portions of this paper have been presented at the 2021 Dialogue, Memory, and Emotion workshop in Paris, at seminars in Bochum, Saarbrücken, at the Padova Summer School on Innovative Tools in the Study of Language, and at the 2022 ESSLLI summer school in Galway. We also wish to thank audiences there for their comments.

Funding statement

This work is supported by a public grant overseen by the French National Research Agency (ANR) as part of the programme ‘Investissements d’Avenir’ (reference: ANR-10-LABX-0083). It contributes to the IdEx Université Paris Cité – ANR-18-IDEX-0001.

Footnotes

¹ The metaphor of files and file changing is due to Heim (Reference Heim1982); in cognitive science, the closely related notion of mental files is used (Perner et al., Reference Perner, Huemer and Leahy2015) – see also their re-emergence in the philosophy of mind (Recanati, Reference Recanati2012).

² This is the minimal information that is received from the sentence. One can also add that Aydın very likely is human since it is a common first or family name, and, in recent memory-oriented approaches, that the semantic value for the proper name is to be found in long-term memory (Cooper, Reference Coopern.d.; Ginzburg & Lücking, Reference Ginzburg and Lücking2020).

³ In fact, there is information-theoretic evidence for such gestalts at least on the level of manual co-speech gestures (Mehler & Lücking, Reference Mehler, Lücking, Giorgolo and Alahverdzhieva2012). The notion of ‘local gestalts’ used by Mondada (Reference Mondada2014) seems to be a generalisation of the notion of ensembles, but to lack the statistical import gained from recurrence.

⁴ A more conservative view seems to be embraced by Streeck and Hartge (Reference Streeck, Hartge, Auer and Di Luzio1992), who analyse mid-turn gestures to ‘contextualise “next speech units”’, including a preparation of potential transition places (p. 137). This view is reinforced in Streeck (Reference Streeck2009, Ch. 8).

⁵ Thinking of conversational interaction in musical terms has been proposed by Thompson (Reference Thompson1993), whereas Clark (Reference Clark1996, Ch. 2, p. 50) mentions string quartets as a ‘mostly nonlinguistic joint activity’. In fact, string quartets were originally inspired by the eighteenth-century French salon tradition (Hanning, Reference Hanning1989). Duranti’s (Reference Duranti1997) paper documents what he calls ‘polyphony’ (‘normative overlap’) in Samoan ceremonial greetings. Based on a convergent effect of joint musical improvisation on the alignment of body movements and periodicity across speech turns, it has recently been argued that both music and linguistic interaction belong to a common human communicative facility (Daltrozzo & Schön, Reference Daltrozzo and Schön2009; Robledo et al., Reference Robledo, Hawkins, Cornejo, Cross, Party and Hurtado2021). However, despite the fact that we use the term leading voice in the very title, we use it here solely as a metaphor for depicting the structure of multimodal communication. In particular, we do not derive strong implications for the organisation of dialogue (or music) from it; in fact, other comparisons such as contrapuntal structure serve similar purposes, as we discuss below.

⁶ These two terms are frequently used interchangeably; we use the former for consistency with earlier work in the framework utilised in this paper, KoS. Coherence has been emphasised as a fundamental principle of the alignment of manual co-speech gesture and speech by Lascarides and Stone (Reference Lascarides and Stone2009).

⁷ Supervenience is a non-reductionist but asymmetric mode of dependence (see, e.g. Kim, Reference Kim1984), which, with respect to the multimodal serialisation hypothesis, can be paraphrased as follows: any difference in the set of properties of vertical coherence is accounted for by some difference in the set properties of horizontal coherence, but not the other way round. In this sense does vertical relevance depend on horizontal relevance but does not get ontologically reduced to it.

⁸ We assume these two latter terms are synonymous; the former often used in the dialogue community, the latter among Conversation Analysis researchers.

⁹ Sign is modelled in terms of phonology–syntax–semantics structures developed in Head-Driven Phrase Structure Grammar (Pollard & Sag, Reference Pollard and Sag1994). We abstract over a speaker’s knowledge of a language and the language system where it does not do any harm, as an anonymous reviewer of Language and Cognition observed. A speaker who is not aware of a certain word form (sound) will, however, not be able to provide a witness for a sign type containing that form as value of the phon feature. This, in turn, can trigger clarification interaction.

¹⁰ We use the Italian word partitur (and its English plural variant) since in semantics the term score is already taken due to the work of Lewis (Reference Lewis, Bäuerle, Egli and von Stechow1979).

¹¹ We restrict attention here to two-person dialogue; for discussion on the differences between two-person and multi-party dialogue and how to extend an account of the former to the latter, see Ginzburg (Reference Ginzburg2012, Sect. 8.1).

¹² The importance of vision in the establishment of joint attention is affirmed by studies on the development of joint attention in congenitally blind infants (Bigelow, Reference Bigelow2003). Blind children must rely on non-visual attention-getting strategies such as hearing or touching. As a consequence, they not only develop joint attention at later stages than sighted children, but also depend on their interlocutors to establish a common focus of attention – at least until the symbolic competence of speech is developed to a sufficient degree (Bigelow, Reference Bigelow2003). Furthermore, it has been found in event-related potential studies that congenitally blind subjects (but not sighted ones) recruit posterior cortical areas for the processing of information relevant for an auditory attention task, and in a temporally ordered manner (Liotti et al., Reference Liotti, Ryder and Woldorff1998). The authors of the study speculate that the observed topographical changes might be due to a ‘reorganisation in primary visual cortex’ (p. 1011). With respect to the Vis-Sit in KoS, this can be seen as evidence that at least some of the visual information is replaced by information from other tiers of the partitur. Hence, a corresponding formal model can in principle be devised, accounting for interactions with congenitally blind interlocutors, an issue brought up by an anonymous reviewer of Language and Cognition.

¹³ On this view, a proposition p $ = $ $ \left[\begin{array}{ll}\mathrm{sit}& =\mathrm{s}\\ {}\mathrm{sit}\hbox{-} \mathrm{type}& =\mathrm{T}\end{array}\right] $ is true iff $ s:T $ – the situation $ s $ is of the type $ T $ . Note that an incongruous situation type (inquired about by an anonymous reviewer) lacks any witnessing situations and therefore in model-theoretic terms has an ‘empty’ extension.

¹⁴ As is arguably the case in remembering and imagination (Irish, Reference Irish and Abraham2020; Werning, Reference Werning2020).

¹⁵ Ginzburg et al. (Reference Ginzburg, Cooper, Hough, Schlangen, Abeillé and Bonami2020a) are motivated by data showing unfinished utterances can trigger updates driving, e.g., elliptical phenomena like sluicing: He could bring the ball down, but opts to cushion a header towards … well, who exactly? Nobody there. (From a live match blog)

¹⁶ Since there are more and less likely hypotheses concerning the continuation of an ongoing utterance, utterance projection should ultimately be formulated in a probabilistic manner using a probabilistic version of TTR (Cooper et al., Reference Cooper, Dobnik, Larsson and Lappin2015). Instead of a single effect, a range of probabilistically ranked predictions is acknowledged, as is common in statistical natural language parsing (e.g., Demberg et al., Reference Demberg, Keller and Koller2013). Incremental and predictive processing underlies grammatical framework such as Dynamic Syntax from the outset (Gregoromichelaki et al., Reference Gregoromichelaki, Cann, Kempson and Goldstein2013; Kempson et al., Reference Kempson, Meyer-Viol and Gabbay2001).

¹⁷ Conditions or rules are perspectival if they are applicable only to particular dialogue participants; see Ginzburg et al. (Reference Ginzburg, Mazzocconi and Tian2020b, Sect. 4.1.2) for a first use of ‘participant sensitive’ conversational rules.

¹⁸ As pointed out by an anonymous reviewer, Enfield’s heuristics can be understood more loosely along the lines of ‘if multiple signs occur simultaneously, interpret them in relation to one another’. Since Enfield does not provide a semantics, there remains some leeway for interpretation. The semantic and pragmatic synchrony rules stated by McNeill (Reference McNeill1992) are more explicit in this respect (‘[…] speech and gesture, present the same meanings at the same time’, p. 27; ‘[…] if gestures and speech co-occur they perform the same pragmatic functions’, p. 29).

¹⁹ The term ‘noetic’ is inspired by William James (James, Reference James1981, Ch. XXV), who emphasised, for instance, that ‘[i]nstinctive reactions and emotional expressions thus shade imperceptibly into each other’ (p. 1058). In this sense, noetics describe how feelings, sentiments, sensations, memories, emotions and unconscious acts bear on and are transmitted through a feedback loop of thinking and knowledge (Krader, Reference Krader2010). We believe that emphasising the inherent integration of appraisal and content, among others, is a useful way of conceiving attitudes in conversations.

²⁰ ‘I would like to think of discourse as not so much an exchange but as a shared world that is built up through various modes of mutual response over the course of time in particular interaction.’ (Bennett, Reference Bennett1978, p. 574).

References

Argyle, M. (1988). Bodily communication (2nd ed.). Routledge.Google Scholar

Austin, J. L. (1950). Truth. In Proceedings of the Aristotelian society. Supplementary, Reprinted in John L. Austin: Philosophical papers (2nd ed., Vol. XXIV, pp. 111–128). Clarendon Press.Google Scholar

Baddeley, A. (2012). Working memory: Theories, models, and controversies. Annual Review of Psychology, 63, 1–29. https://doi.org/10.1146/annurev-psych-120710-100422 CrossRef Google Scholar PubMed

Barwise, J., & Etchemendy, J. (1987). The Liar. Oxford University Press.Google Scholar

Barwise, J., & Perry, J. (1983). Situations and attitudes. MIT Press.Google Scholar

Bavelas, J. B., & Gerwing, J. (2011). The listener as addressee in face-to-face dialogue. International Journal of Listening, 25(3), 178–198. https://doi.org/10.1080/10904018.2010.508675 CrossRef Google Scholar

Benitez-Quiroz, C. F., Wilbur, R. B., & Martinez, A. M. (2016). The not face: A grammaticalization of facial expressions of emotion. Cognition, 150, 77–84. https://doi.org/10.1016/j.cognition.2016.02.004 CrossRef Google Scholar PubMed

Bennett, A. (1978). Interruptions and the interpretation of conversation. Annual Meeting of the Berkeley Linguistics Society, 4, 557–575.CrossRef Google Scholar

Berninger, G., & Garvey, C. (1981). Relevant replies to questions: Answers versus evasions. Journal of Psycholinguistic Research, 10(4), 403–420.CrossRef Google Scholar

Bickhard, M. H. (2008). Is embodiment necessary? In Paco, C. & Gomila, T. (Eds.), Handbook of cognitive science: An embodied approach, perspectives on cognitive science, chapter 2 (pp. 29–40). Elsevier.Google Scholar

Bigelow, A. E. (2003). The development of joint attention in blind infants. Development and Psychopathology, 15(2), 259–275. https://doi.org/10.1017/s0954579403000142 CrossRef Google Scholar PubMed

Brogaard, B. (2019). What can neuroscience tell us about reference? In Abbott, B. & Gundel, J. (Eds.), The Oxford handbook of reference (pp. 365–383). Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199687305.013.17 Google Scholar

Clancy, P. M., Thompson, S. A., Suzuki, R., & Tao, H. (1996). The conversational use of reactive tokens in English, Japanese, and Mandarin. Journal of Pragmatics, 26(3), 355–387. https://doi.org/10.1016/0378-2166(95)00036-4 CrossRef Google Scholar

Clark, H. (1996). Using language. Cambridge University Press.CrossRef Google Scholar

Connell, L. (2019). What have labels ever done for us? The linguistic shortcut in conceptual processing. Language, Cognition and Neuroscience, 34(10), 1308–1318. https://doi.org/10.1080/23273798.2018.1471512 CrossRef Google Scholar

Cooper, R. (2015). Type theory, interaction and the perception of linguistic and musical events. In Orwin, M., Howes, C., & Kempson, R. (Eds.), Language, Music and Interaction (pp. 67–90). College Publications.Google Scholar

Cooper, R. (2019). Representing types as neural events. Journal of Logic, Language and Information, 28(2), 131–155.CrossRef Google Scholar

Cooper, R. (2013). From perception to communication: An analysis of meaning and action using a theory of types with records (TTR). Oxford University Press (in press).Google Scholar

Cooper, R., Dobnik, S., Larsson, S., & Lappin, S. (2015). Probabilistic type theory and natural language semantics. Linguistic Issues in Language Technology, 10(4), 1–43. https://doi.org/10.33011/lilt.v10i.1357 CrossRef Google Scholar

Cooper, R., & Ginzburg, J. (2015). Type theory with records for natural language semantics. In Lappin, S. & Fox, C (Eds.), The handbook of contemporary semantic theory (chapter 12, 2nd ed., pp. 375–407). Wiley-Blackwell.CrossRef Google Scholar

Cooper, R. P., & Peebles, D. (2015). Beyond single-level accounts: The role of cognitive architectures in cognitive scientific explanation. Topics in Cognitive Science, 7(2), 243–258. https://doi.org/10.1111/tops.12132 CrossRef Google Scholar PubMed

Daltrozzo, J., & Schön, D. (2009). Conceptual processing in music as revealed by N400 effects on words and musical targets. Journal of Cognitive Neuroscience, 21(10), 1882–1892. https://doi.org/10.1162/jocn.2009.21113 CrossRef Google Scholar PubMed

de Ruiter, J. P. (2004). On the primacy of language in multimodal communication. In Proceedings of the workshop on multimodal corpora (pp. 38–41). European Language Resources Association (CD-ROM).Google Scholar

Debras, C. (2017). The shrug: Forms and meanings of a compound enactment. Gesture, 16(1), 1–34. https://doi.org/10.1075/gest.16.1.01deb CrossRef Google Scholar

Demberg, V., Keller, F., & Koller, A. (2013). Incremental, predictive parsing with psycholinguistically motivated tree-adjoining grammar. Computational Linguistics, 39(4), 1025–1066.CrossRef Google Scholar

Dowty, D. R. (1979). Word meaning and Montague grammar. Reidel.CrossRef Google Scholar

Duranti, A. (1997). Polyphonic discourse: Overlapping in Samoan ceremonial greetings. Text – Interdisciplinary Journal for the Study of Discourse, 17(3), 349–382.CrossRef Google Scholar

Ebert, C. (2014). The non-at-issue contributions of gestures. In Workshop on demonstration and demonstratives. University of Stuttgart.Google Scholar

Enfield, N. J. (2009). The anatomy of meaning: Speech, gesture, and composite utterances. Language, Culture and Cognition, Vol. 13. Cambridge University Press.CrossRef Google Scholar

Falk, J. (1980). The conversational duet. In Caron, B.R., Hoffman, M. A. B., Silva, M., Van Oosten, J., Alford, D. K., Hunold, K. A., Macauly, M. & Manley-Buser, J. (Eds.), Annual meeting of the Berkeley Linguistics Society (Vol. 6, pp. 507–514). Berkeley, CA: Berkeley Linguistics Society.Google Scholar

Fernando, T. (2007). Observing events and situations in time. Linguistics and Philosophy, 30(5), 527–550. https://doi.org/10.1007/s10988-008-9026-1 CrossRef Google Scholar

Ferreira, F. (2005). Psycholinguistics, formal grammars, and cognitive science. The Linguistic Review, 22(2–4), 365–380. https://doi.org/10.1515/tlir.2005.22.2-4.365 CrossRef Google Scholar

Frankland, S. M., & Greene, J. D. (2020). Concepts and compositionality: In search of the brain’s language of thought. Annual Review of Psychology, 71(1), 273–303. https://doi.org/10.1146/annurev-psych-122216-011829 CrossRef Google Scholar PubMed

Fusaroli, R., Gangopadhyay, N., & Tylén, K. (2014). The dialogically extended mind: Language as skillful intersubjective engagement. Cognitive Systems Research, 29–30, 31–39. https://doi.org/10.1016/j.cogsys.2013.06.002 CrossRef Google Scholar

Garnham, A. (2010). Models of processing: discourse. WIREs Cognitive Science, 1(6), 845–853. https://doi.org/10.1002/wcs.69 CrossRef Google Scholar PubMed

Ginzburg, J. (1994). An update semantics for dialogue. In Bunt, H. (Ed.), Proceedings of the 1st international workshop on computational semantics. Tilburg University.Google Scholar

Ginzburg, J. (2012). The interactive stance: Meaning for conversation. Oxford University Press.CrossRef Google Scholar

Ginzburg, J., Cooper, R., Hough, J., & Schlangen, D. (2020a). Incrementality and HPSG: Why not? In Abeillé, A. & Bonami, O. (Eds.), Constraint-based syntax and semantics: Papers in honor of Danièle Godard. CSLI Publications.Google Scholar

Ginzburg, J., & Lücking, A. (2020). On laughter and forgetting and reconversing: A neurologically-inspired model of conversational context. In Proceedings of the 24th workshop on the semantics and pragmatics of dialogue, SemDial/WatchDial. Brandeis University.Google Scholar

Ginzburg, J., Mazzocconi, C., & Tian, Y. (2020b). Laughter as language. Glossa, 5(1), 104. https://doi.org/10.5334/gjgl.1152 CrossRef Google Scholar

Ginzburg, J., Yusupujiang, Z., Li, C., Ren, K., Kucharska, A., & Łupkowski, P. (2022). Characterizing the response space of questions: Data and theory. Dialogue and Discourse (forthcoming).Google Scholar

Goodwin, C. (1979). The interactive construction of a sentence in natural conversation. In Psathas, G. (Ed.), Everyday language: Studies in ethnomethodology (pp. 97–121). Irvington Publishers.Google Scholar

Goodwin, C., & Goodwin, M. H. (1992). Assessments and the construction of context. In Auer, P. & Di Luzio, A. (Eds.), Rethinking context: Language as an interactive phenomenon (Vol. 11, pp. 147–190). Amsterdam: John Benjamins.Google Scholar

Gregoromichelaki, E., Cann, R., & Kempson, R. (2013). On coordination in dialogue: Sub-sentential speech and its implications. In Goldstein, L. (Ed.), Brevity (chapter 3, pp. 53–73). Oxford University Press.CrossRef Google Scholar

Gregoromichelaki, E., Kempson, R., Purver, M., Mills, G. J., Ronnie Cann, R., Meyer-Viol, W., & Patrick, G. T. H. (2011). Incrementality and intention-recognition in utterance processing. Dialogue and Discourse, 2(1), 199–233. https://doi.org/10.5087/dad.2011.109 CrossRef Google Scholar

Hadar, U., Steiner, T. J., & Rose, F. C. (1985). Head movement during listening turns in conversation. Journal of Nonverbal Behavior, 9(4), 214–228.CrossRef Google Scholar

Hamm, F., Kamp, H., & Van Lambalgen, M. (2006). There is no opposition between formal and cognitive semantics. Theoretical Linguistics, 32(1), 1–40.CrossRef Google Scholar

Hanning, B. R. (1989). Conversation and musical style in the late eighteenth-century Parisian Salon. Eighteenth-Century Studies, 22(4), 512–528.CrossRef Google Scholar

Hasson, U., Ghazanfar, A. A., Galantucci, B., Garrod, S., & Keysers, C. (2012). Brain-to-brain coupling: A mechanism for creating and sharing a social world. Trends in Cognitive Sciences, 16(2), 114–121. https://doi.org/10.1016/j.tics.2011.12.007 CrossRef Google Scholar PubMed

Heim, I. (1982). The semantics of definite and indefinite noun phrases. PhD thesis. University of Massachusetts Amherst.Google Scholar

Heylen, D. (2008). Listening heads. In Modeling communication with robots and virtual humans (pp. 241–259). Springer.CrossRef Google Scholar

Hilton, K. (2018). What does an interruption sound like? PhD thesis. Stanford University.Google Scholar

Holler, J., & Levinson, S. C. (2019). Multimodal language processing in human communication. Trends in Cognitive Sciences, 23(8):639–652. https://doi.org/10.1016/j.tics.2019.05.006 CrossRef Google Scholar PubMed

Hummel, J. E. (2011). Getting symbols out of a neural architecture. Connection Science, 23(2), 109–118. https://doi.org/10.1080/09540091.2011.569880 CrossRef Google Scholar

Irish, M. (2020). On the interaction between episodic and semantic representations – constructing a unified account of imagination. In Abraham, A. (Ed.), The Cambridge handbook of the imagination. (pp. 447–465). Cambridge Handbooks in Psychology. Cambridge University Press. https://doi.org/10.1017/9781108580298.027 CrossRef Google Scholar

James, W. (1981). The principles of psychology. Harvard University Press.Google Scholar

Jones, M. R., & Boltz, M. (1989). Dynamic attending and responses to time. Psychological Review, 96(3), 459–491. https://doi.org/10.1037/0033-295X.96.3.459 CrossRef Google Scholar PubMed

Kamp, H. (1979). Events, instants and temporal reference. In Bäuerle, R., Egli, U., & von Stechow, A. (Eds.), Semantics from different points of view (pp. 376–417). Springer Series in Language and Communication, Vol. 6. Springer.CrossRef Google Scholar

Kamp, H., & Reyle, U. (1993). From discourse to logic. Kluwer Academic Publishers.Google Scholar

Kempson, R., Meyer-Viol, W., & Gabbay, D. M. (2001). Dynamic syntax. Blackwell Publishers.Google Scholar

Kendon, A. (1967). Some functions of gaze-direction in social interaction. Acta Psychologica, 26(1), 22–63. https://doi.org/10.1016/0001-6918(67)90005-4 CrossRef Google Scholar PubMed

Kendon, A. (2002). Some uses of the head shake. Gesture, 2(2), 147–182.CrossRef Google Scholar

Kendon, A. (2004). Gesture: Visible action as utterance. Cambridge University Press.CrossRef Google Scholar

Kim, J. (1984). Concepts of supervenience. Philosophy and Phenomenological Research, 45(2), 153–176.CrossRef Google Scholar

Krader, L. (2010). Noetics: The science of thinking and knowing. Peter Lang.Google Scholar

Krakauer, J. W., Ghazanfar, A. A., Gomez-Marin, A., MacIver, M., A., & Poeppel, D. (2017). Neuroscience needs behavior: Correcting a reductionist bias. Neuron, 93(3):480–490. https://doi.org/10.1016/j.neuron.2016.12.041 CrossRef Google Scholar PubMed

Larsson, S. (2002). Issue based dialogue management. PhD thesis. Gothenburg University.Google Scholar

Lascarides, A., & Stone, M. (2009). Discourse coherence and gesture interpretation. Gesture, 9(2), 147–180.CrossRef Google Scholar

Lerner, G. H. (1988). Collaborative turn sequences: Sentence construction and social action. PhD thesis. University of California.Google Scholar

Levinson, S. C., & Torreira, F. (2015). Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, 6, 731.CrossRef Google Scholar PubMed

Lewis, D. (1979). Scorekeeping in a language game. In Bäuerle, R, Egli, U, & von Stechow, A (Eds.), Semantics from different points of view (pp. 172–187). Springer Series in Language and Communication, Vol. 6. Springer.CrossRef Google Scholar

Liotti, M., Ryder, K., & Woldorff, M. G. (1998). Auditory attention in the congenitally blind: Where, when and what gets reorganized? NeuroReport, 9(6), 1007–1012.CrossRef Google Scholar PubMed

Litwin, P., & Miłkowski, M. (2020). Unification by fiat: Arrested development of predictive processing. Cognitive Science, 44, e12867. https://doi.org/10.1111/cogs.12867 CrossRef Google Scholar PubMed

Loehr, D. (2007). Aspects of rhythm in gesture in speech. Gesture, 7(2), 179–214.CrossRef Google Scholar

Lücking, A., & Ginzburg, J. (2020). Towards the score of communication. In Proceedings of the 24th workshop on the semantics and pragmatics of dialogue, SemDial/WatchDial. Brandeis University.Google Scholar

Lücking, A., & Ginzburg, J. (2021). Saying and shaking ‘no’. In Proceedings of the 28th international conference on head-driven phrase structure grammar, HPSG 2021. University Library.Google Scholar

Lücking, A., Mehler, A., & Menke, P. (2008) Taking fingerprints of speech-and-gesture ensembles: Approaching empirical evidence of intrapersonal alignmnent in multimodal communication. In Proceedings of the 12th workshop on the semantics and pragmatics of dialogue, LonDial’08 (pp. 157–164). King’s College London.Google Scholar

Marr, D. (1982). Vision. Freeman.Google Scholar

Mazzocconi, C., Tian, Y., & Ginzburg, J. (2020/22) What is your laughter doing there: A taxonomy of the pragmatic functions of laughter. IEEE Transactions of Affective Computing, 13(3), 1301–1321 (Published online 2020).Google Scholar

McNeill, D. (1992). Hand and mind – What gestures reveal about thought. Chicago University Press.Google Scholar

Mehler, A., & Lücking, A. (2012). Pathways of alignment between gesture and speech: Assessing information transmission in multimodal ensembles. In Giorgolo, G. & Alahverdzhieva, K. (Eds.), Proceedings of the international workshop on formal and computational approaches to multimodal communication under the auspices of ESSLLI.Google Scholar

Meteyard, L., Cuadrado, S. R., Bahrami, B., & Vigliocco, G. (2012). Coming of age: A review of embodiment and the neuroscience of semantics. Cortex, 48(7), 788–804. https://doi.org/10.1016/j.cortex.2010.11.002 CrossRef Google Scholar PubMed

Mondada, L. (2014). The local constitution of multimodal resources for social interaction. Journal of Pragmatics, 65, 137–156. https://doi.org/10.1016/j.pragma.2014.04.004 CrossRef Google Scholar

Mondada, L. (2016). Challenges of multimodality: Language and the body in social interaction. Journal of Sociolinguistics, 20(3), 336–366. https://doi.org/10.1111/josl.1_12177 CrossRef Google Scholar

Montague, R. (1974). Pragmatics. In Thomason, R. (Ed.), Formal philosophy. Yale University Press.Google Scholar

Mundy, P., & Newell, L. (2007). Attention, joint attention, and social cognition. Current Directions in Psychological Science, 16(5), 269–274. https://doi.org/10.1111/j.1467-8721.2007.00518.x CrossRef Google Scholar PubMed

Nummenmaa, L., & Calder, A. J. (2009). Neural mechanisms of social attention. Trends in Cognitive Sciences, 13(3), 135–143. https://doi.org/10.1016/j.tics.2008.12.006 CrossRef Google Scholar PubMed

Perner, J., Huemer, M, & Leahy, B. (2015) Mental files and belief: A cognitive theory of how children represent belief and its intensionality. Cognition, 145(Suppl C), 77–88. https://doi.org/10.1016/j.cognition.2015.08.006 CrossRef Google Scholar

Pickering, M. J., & Garrod, S. (2004). Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences, 27(2), 169–190.CrossRef Google Scholar

Pickering, M. J., & Garrod, S. (2013). An integrated theory of language production and comprehension. Behavioral and Brain Sciences, 36(4), 329–347. https://doi.org/10.1017/S0140525X12001495 CrossRef Google Scholar PubMed

Poesio, M., & Rieser, H. (2010). Completions, continuations, and coordination in dialogue. Dialogue and Discourse, 1(1), 1–89.CrossRef Google Scholar

Poggi, I. (2001) Mind markers. In The semantics and pragmatics of everyday gestures. Verlag Arno Spitz.Google Scholar

Pollard, C., & Sag, I. A. (1994). Head-driven phrase structure grammar. CSLI Publications.Google Scholar

Purver, M. (2006). CLARIE: Handling clarification requests in a dialogue system. Research on Language & Computation, 4(2), 259–288.CrossRef Google Scholar

Recanati, F. (2012). Mental files. Oxford University Press.CrossRef Google Scholar

Roberts, C. (1996) Information structure in discourse: Towards an integrated formal theory of pragmatics. In OSU working papers in linguistics (Vol. 49, pp. 91–136). Department of Linguistics, The Ohio State University.Google Scholar

Robledo, J. P., Hawkins, S., Cornejo, C., Cross, I., Party, D., & Hurtado, E. (2021). Musical improvisation enhances interpersonal coordination in subsequent conversation: Motor and speech evidence. PLoS One, 16(4), e0250166. https://doi.org/10.1371/journal.pone.0250166 CrossRef Google Scholar PubMed

Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 50(4), 696–735.CrossRef Google Scholar

Schegloff, E. A. (2000). Overlapping talk and the organization of turn-taking for conversation. Language in Society, 29, 1–63.CrossRef Google Scholar

Schegloff, E. A. (2007). Sequence organization in interaction. Cambridge University Press.CrossRef Google Scholar

Sebanz, N., & Knoblich, G. (2009). Prediction in joint action: What, when, and where. Topics in Cognitive Science, 1(2), 353–367. https://doi.org/10.1111/j.1756-8765.2009.01024.x CrossRef Google Scholar PubMed

Stalnaker, R. C. (1978). Assertion. In Cole, P. (Ed.), Syntax and semantics (Vol. 9, pp. 315–332). Academic Press.Google Scholar

Stivers, T., & Enfield, N. J. (2010). A coding scheme for question–response sequences in conversation. Journal of Pragmatics, 42(10), 2620–2626.CrossRef Google Scholar

Streeck, J. (2009) Gesturecraft. Gesture Studies, Vol. 2. John Benjamins.CrossRef Google Scholar

Streeck, J., & Hartge, U. (1992). Previews: Gestures at the transition place. In Auer, P. & Di Luzio, A. (Eds.), The contextualization of language (pp. 135–157). John Benjamins.CrossRef Google Scholar

Tannen, D. (1984). Conversational style: Analyzing talk among friends. Oxford University Press.Google Scholar

Thompson, H. S. (1993). Conversation as musical interaction. HCRC Edinburgh unpublished lecture.Google Scholar

Tian, Y., & Ginzburg, J. (2016) No I am: What are you saying “No” to? In Sinn und Bedeutung 21. The University of Edinburgh.Google Scholar

Tian, Y., Maruyama, T., & Ginzburg, J. (2017). Self addressed questions and filled pauses: A cross-linguistic investigation. Journal of Psycholinguistic Research, 46(4), 905–922.CrossRef Google Scholar PubMed

Tomasello, M. (1999). The cultural origins of human cognition. Harvard University Press.Google Scholar

Tuite, K. (1993). The production of gesture. Semiotica, 93(1/2), 83–105.CrossRef Google Scholar

Vertegaal, R., Slagter, R., van der Veer, G., & Nijholt, A. (2001). Eye gaze patterns in conversations: There is more to conversational agents than meets the eyes. In Proceedings of SIGCHI 2001, CHI ‘01 (pp. 301–308). Association for Computing Machinery. https://doi.org/10.1145/365024.365119 Google Scholar

Vilhjálmsson, H., Cantelmo, N., Cassell, J., Chafai, N. E., Kipp, M., Kopp, S., Mancini, M., Marsella, S., Marshall, A. N., Pelachaud, C., Ruttkay, Z., Thórisson, K. R., van Welbergen, H., & van der Werf, R. J. (2007). The behavior markup language: Recent developments and challenges. In Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., & Pelé, D. (Eds.), Intelligent virtual agents (pp. 99–111). Springer.CrossRef Google Scholar

Werning, M. (2020). Predicting the past from minimal traces: Episodic memory and its distinction from imagination and preservation. Review of Philosophy and Psychology, 11, 301–333. https://doi.org/10.1007/s13164-020-00471-z CrossRef Google Scholar

Yuan, J., Liberman, M., & Cieri, C. (2006). Towards an integrated understanding of speaking rate in conversation. In Proceedings of INTERSPEECH (pp. 541–544). Pittsbergh, Pennsylvania: International Speech Communication Association.Google Scholar

Yuan, J., Liberman, M., & Cieri, C. (2007). Towards an integrated understanding of speech overlaps in conversation. In ICPhS XVI. The International Congress of Phonetic Sciences.Google Scholar

Fig. 1. A dialogue-semantics perspective for completing the systemic understanding of multimodal discourse.

Table 1. Bayesian update table

Table 2. Vertical relevance: possible content relationships between overlapping utterances across two speakers

Article contents

Leading voices: dialogue semantics, cognitive science and the polyphonic structure of multimodal interaction

Abstract

Keywords

1. Introduction

2. Observations

2.1. Head shake

2.2. Co-activity and communicative breakdown

2.3. Summary

3. Polyphonic interaction: cognitive–formal tools

3.1. Partiturs

3.2. Cognitive states in dialogue semantics

4. Polyphonic interaction: cognitive–formal analyses

4.1. Conversational engagement

4.2. Head shake and noetics

5. Upshot: from ‘horizontal’ to ‘vertical’ relevance in multimodal dialogue

6. Conclusions

Acknowledgements

Funding statement

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests