1. Introduction
Crossmodal correspondences involve associations between features across sensory modalities, encompassing various multisensory signals in everyday life experiences (Motoki et al., Reference Motoki, Marks and Velasco2023; Spence, Reference Spence2011). Well-known examples include the relationship between pitch and spatial elevation (Parise et al., Reference Parise, Knorre and Ernst2014; Zeljko et al., Reference Zeljko, Kritikos and Grove2019) and tastes and shapes, for instance, sweetness and roundness (Velasco et al., Reference Velasco, Woods, Petit, Cheok and Spence2016b). They emerge in part to help with crossmodal binding, and their value resides on helping survival and enhancing the quality of life experience (Stein et al., Reference Stein, Stanford and Rowland2014). However, little has been studied about how these correspondences are encoded in language and how language encoding is related to perceptual mechanisms of crossmodal correspondences. Knowing this allows for a better understanding of the relationship between perception and language and offers new insights into the functioning of crossmodal correspondences. Specifically, it could help us comprehend how associations in crossmodal phenomena are encoded in language, and if the cognitive mechanisms of crossmodal correspondences are transferred to language coding.
1.1. How language encodes perception
In a general sense, language encodes perception. Embodied cognition research has shown that perception can shape cognition, and therefore, language (Glenberg et al., Reference Glenberg, Witt and Metcalfe2013). Neuroimaging results also support the connections between perception, motor skills, specific sensory brain regions and linguistic concepts (Kiefer et al., Reference Kiefer, Sim, Herrnberger, Grothe and Hoenig2008; Kuhnke et al., Reference Kuhnke, Kiefer and Hartwigsen2020). Regarding the language encoding of perception, there are some fundamental elements to consider. First, there is a differential language hierarchy of the senses across cultures (i.e. Western cultures systematically have a larger lexicon for hearing and vision, Reilly et al., Reference Reilly, Flurie and Peelle2020; Winter et al., Reference Winter, Perlman and Majid2018). Secondly, certain cultures privilege different senses in their coding (Majid et al., Reference Majid, Roberts, Cilissen, Emmorey, Nicodemus, O’Grady, Woll, LeLan, de Sousa, Cansler, Shayan, de Vos, Senft, Enfield, Razak, Fedden, Tufvesson, Dingemanse, Ozturk, Brown, Hill, Le Guen, Hirtzel, van Gijn, Sicoli and Levinson2018). Additionally, the coding of the senses in language is far from perfect. Languages have a lack of capacity to completely capture the perceived reality, a phenomenon known as ineffability (Levinson & Majid, Reference Levinson and Majid2014). However, it is expected that sensory language coding resembles at least partially its perceptual counterpart (Marks, Reference Marks1996; Speed et al., Reference Speed, Vinson, Vigliocco, Dabrowska and Divjak2015). The embodied lexicon hypothesis (Winter, Reference Winter2019) specifically postulates that sensory perceptual asymmetries (how the senses are prioritized) and sensory perceptual associations are encoded in language, as a natural consequence of embodiment and perceptual simulation. In summary, sensory perception is expected to be encoded in language, including associations between the senses (e.g. crossmodal correspondences), with some asymmetries due to the hierarchy of the senses, and with certain limitations that would arise, at least, from the ineffability of language.
1.2. The semantic coding hypothesis
More particularly, there is evidence showing that crossmodal correspondences can also be encoded in language and affect perceptual responses.
Some words have meanings in more than one sensory modality. For instance, high and low are words with meaning in both the auditory (pitch) and the visual senses (spatial location) in the English language. Simultaneously, there is a well-known perceptual crossmodal association between pitch and spatial location (Spence, Reference Spence2011). Therefore, some words with meaning in more than one sensorial modality are probably encoding perceptual crossmodal correspondences.
When words replace perceptual stimuli in interference perceptual tasks, similar results are obtained. For instance, subjects presented with sharpness words (sharp/blunt) or brightness words (bright/dull) and an irrelevant high/low pitched tone, classified the words faster when the pitch was congruent than when the pitch was incongruent (Walker & Smith, Reference Walker and Smith1984), supporting the idea that semantic encoding of perceptions further influence perceptual tasks.
Moreover, words do not need to be an exact description of the perceptual stimuli if they are semantically related to the stimuli. The speed of classification of a high/low pitched tone was interfered with when congruent/incongruent words like white/black were presented, but also when the original words were replaced by the words day/night (Martino & Marks, Reference Martino and Marks1999). Subjects presented words within an irrelevant shape outline (angular/curved) classified congruent words faster than incongruent words that were intended to mean hardness (granite/fur), pitch (squeak/drone) or brightness (glisten/gloom) (Walker, Reference Walker2012). These studies suggest that perceptual encoding of crossmodal correspondences does not occur solely in isolated words but extends to a network of semantically related words.
Perceptual results vary when different languages encode crossmodal correspondences differently. For instance, people using languages with different semantic encoding for pitch such as Dutch (high/low) and Farsi (thin/thick) perform differently when asked to reproduce the pitch with congruent/incongruent visual stimuli that match/mismatch their language encoding (Dolscheid et al., Reference Dolscheid, Shayan, Majid and Casasanto2013). Interestingly, after training, Dutch speakers learned to describe pitch in the terms of Farsi language, and their performance in the task started to resemble that of the native speakers of Farsi. Such evidence supports the influence of linguistic input on semantic encoding of perception.
The aforementioned evidence seems to support the semantic coding hypothesis. This hypothesis posits that perceptual experiences from various modalities, along with the language used to describe these perceptions, serve as inputs to produce an abstract (high cognitive level) semantic network that encodes and influences crossmodal correspondences (Martino & Marks, Reference Martino and Marks2001; Melara, Reference Melara1989).
Two claims derived from the semantic coding hypothesis are important for the present work. First, if there is a clear connection between crossmodal correspondences and their encoding in language, this implies that it is possible for the mechanisms of crossmodal correspondences formation to be transferred to their encoding in language. Second, the encoding of crossmodal correspondences happens in a semantic network where perceptual related concepts are linked accordingly with their relationships and matchings across modalities.
1.3. Formation mechanisms of crossmodal correspondences
If there is a clear connection between crossmodal correspondences and their encoding in language, this implies that it is possible for the mechanisms of crossmodal correspondences formation to be transferred to their coding in language (e.g. Saluja & Stevenson, Reference Saluja and Stevenson2018). According to Spence (Reference Spence2011), at least four cognitive mechanisms play a role in crossmodal correspondences. First, structural similarities in the brain might connect the senses; for instance, intensity might be equally represented in the brain as neural firing regardless of the senses involved. Second, statistical co-occurrence in the environment leads to the formation of crossmodal correspondences. For example, pitch and location are associated because higher pitches in natural environments are more frequently produced in higher locations (Parise et al., Reference Parise, Knorre and Ernst2014). Third, emotion appears to mediate certain correspondences, such as between music and color (Spence, Reference Spence2020a). Finally, a semantic (or lexical) mechanism has been proposed (Walker, Reference Walker2016), which is encompassed by the semantic coding hypothesis.
In the present research, we studied how crossmodal correspondences emerge in language, following some of the first three mechanisms presented above. Structural similarities are expected to be innate, and therefore, to be formed before language. Although some of them might be posteriorly encoded in language, given the linguistic nature of our research, it is not possible for us to clearly separate or detect them. On the other hand, we expected to find statistical crossmodal correspondences in language. These are correlations of sensory signals that generate crossmodal correspondences emerging from everyday life experiences, whether natural (Parise, Reference Parise2016) or cultural, and humans probably use language to refer to these correlated experiences. However, not all kinds of experiences influence, with the same magnitude, the development and elicitation of crossmodal correspondences. We anticipated that a crossmodal semantic network might reveal semantic domains related to everyday life experiences (hereafter, domains of experience) in which crossmodal correspondences are highly prevalent.
1.4. The role of affect on crossmodal correspondences
We also expected that affect would be an essential cohesive element to strengthen crossmodal relationships in crossmodal semantic networks. Affect can be conceptualized through the affective domains identified in the literature of valence, potency or arousal and activity or dominance (see also Bakker et al., Reference Bakker, Van Der Voordt, Vink and De Boon2014). These affective domains were constructed based on the semantic differential technique (see Osgood, Reference Osgood1964, for some seminal work on this) and the research presented by Russell and Mehrabian (Reference Russell and Mehrabian1977). In their work, Mehrabian and Russell (Reference Mehrabian and Russell1974) conceptualized valence as a continuum of responses spanning from positive to negative, which they measured employing polar adjectives such as joyful or unhappy, while they conceptualized arousal as a spectrum of mental states stretching from low to high excitement, captured by terms like stimulated-relaxed. Finally, they associated dominance with notions of control and individual behavioral constraints, reflected in words such as brave-discouraged.
Senses and emotion are frequently connected by language and semantics. Previous research has shown that, for instance, there is evidence that valence may be an important part of the semantic representation of taste and smell (Arshamian et al., Reference Arshamian, Gerkin, Kruspe, Wnuk, Floyd, O’Meara, Rodriguez, Lundström, Mainland and Majid2022; Majid et al., Reference Majid, Roberts, Cilissen, Emmorey, Nicodemus, O’Grady, Woll, LeLan, de Sousa, Cansler, Shayan, de Vos, Senft, Enfield, Razak, Fedden, Tufvesson, Dingemanse, Ozturk, Brown, Hill, Le Guen, Hirtzel, van Gijn, Sicoli and Levinson2018; Speed & Majid, Reference Speed and Majid2020). Some semantically related lines and shapes might be easily connected to emotions (Salgado-Montejo et al., Reference Salgado-Montejo, Salgado, Alvarado and Spence2017). Moreover, there is increasing evidence that sensory receptors and neural circuits code sensory information in terms of emotional valence and that this initial system may be involved in posterior complex cognition such as in experiencing and naming emotions (Feinberg & Mallatt, Reference Feinberg, Mallatt, Poznanski, Tuszynski and Feinberg2016; Kryklywy et al., Reference Kryklywy, Ehlers, Anderson and Todd2020).
Recent literature, building on earlier work by researchers such as Kenneth (Reference Kenneth1923), suggests that hedonics and, more broadly affect (Spence & Deroy, Reference Spence and Deroy2013; Velasco et al., Reference Velasco, Woods, Deroy and Spence2015) play a role in the mediation of crossmodal correspondences. In this case, perceptual experiences in different contexts that were tied to the same emotion also show crossmodal associations. Two good examples are the emotional mediation between shapes and tastes (Velasco et al., Reference Velasco, Woods, Deroy and Spence2015) and the mediation between colors and music (Hauck et al., Reference Hauck, von Castell and Hecht2022; Palmer et al., Reference Palmer, Schloss, Xu and Prado-León2013; Spence, Reference Spence2020a). Our expectation was that the affective dimensions of crossmodal associations would be visible in the crossmodal semantic network.
1.5. The encoding of perceptual meaning
How can the language encoding of perception be represented in an accessible and operable way? One possibility is constructing modality norms, where people rate words according to the strength of elicitation of each sense associated with the word. An example of this is the work of Lynott and Connell (Reference Lynott and Connell2009), which is detailed in Section 2.1. Representations based on human ratings might include other aspects of meaning beyond perception. For instance, Binder et al. (Reference Binder, Conant, Humphries, Fernandino, Simons, Aguilar and Desai2016) created an encoding of meaning for 535 words using 63 cognitive features neurobiologically motivated and organized in 14 categories, including the senses separated as categories.
An alternative possibility is to base meaning encoding on distributional semantics, i.e. leveraging statistical patterns of language usage to construct a mathematical artifact that convey the meaning of words. Indeed, distributional semantics capture semantic association proximity based on the linguistic distributional hypothesis, which states that the semantic proximity between two linguistic expressions is a function of the similarity of the linguistic contexts where the expressions appear (Harris, Reference Harris1954).
Distributional semantics do not rule out embodied cognition in meaning; in fact, they can capture and explain it (Louwerse, Reference Louwerse2011). For instance, when people had to switch perceptual modalities in a property verification task, processing speed was slower than when they stayed in the same modality (Pecher et al., Reference Pecher, Zeelenberg and Barsalou2003). Accordingly, latent semantic analysis distances (a distributional semantics technique) are closer when words are from the same modality (Louwerse, Reference Louwerse2011). Johns and Jones (Reference Johns and Jones2012) tested that words form the same modality but otherwise not related (typewriter, piano) are also closer when comparing first-order co-occurrences statistics. That evidence suggests that information on perception modalities and their differences is captured in distributional semantics approaches.
Based on latent semantic analysis, it was possible to replicate maps and distances between cities with some accuracy (Louwerse & Zwaan, Reference Louwerse and Zwaan2009). Such a result has been replicated for Word Embeddings (Konkol et al., Reference Konkol, Brychcín, Nykl, Hercig, Kondrak and Watanabe2017) and large language models (Gurnee & Tegmark, Reference Gurnee and Tegmark2023), showing that even information not primarily linguistic can, anyway, be encoded in distributional semantic approaches.
Louwerse and Connell (Reference Louwerse and Connell2011) showed that the modality of words based on human ratings can be predicted from first order co-occurrences. Utsumi (Reference Utsumi2020) tried to predict Binder et al. (Reference Binder, Conant, Humphries, Fernandino, Simons, Aguilar and Desai2016) meaning representations using multiple distributional semantic approaches based on Word Embeddings. Although results showed an overall higher predictability for abstract knowledge, results for concrete and perceptual knowledge were good enough to suggest that perceptual knowledge is likely encoded in Word Embeddings more than expected. In summary, perceptual meaning can be inferred, at least in part, from distributional semantic approaches.
Word Embeddings are a distributional semantic approach that produce vector encodings of words. Such vectors are the result of training neural network architectures on word co-occurrences in large text corpora (Mikolov et al., Reference Mikolov, Chen, Corrado, Dean, Bengio and LeCun2013). Therefore, the closer the word vectors, the more similar are the contexts where these words are used and more closely are the semantics of the words, revealing an association between them.
Word Embeddings have been proven to capture diverse types of semantic relationships, including topic relatedness (Das et al., Reference Das, Zaheer and Dyer2015), semantic hierarchies such as hypernymy and hyponymy (Fu et al., Reference Fu, Guo, Qin, Che, Wang, Liu, Toutanova and Wu2014), context-dependent world knowledge (Grand et al., Reference Grand, Blank, Pereira and Fedorenko2022) and even biases present in culture and language, or across languages (Bolukbasi et al., Reference Bolukbasi, Chang, Zou, Saligrama, Kalai, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2016; Lewis & Lupyan, Reference Lewis and Lupyan2020). Importantly, Word Embeddings have been proved to capture emotional information (Passaro et al., Reference Passaro, Bondielli and Lenci2017). As a result, Word Embeddings allow combining perceptual, emotional and topical meaning in a single encoding.
1.6. The present work
In the present research, we used Word Embeddings to study how crossmodal correspondences emerge in language. If semantic proximity is calculated between words of different senses, we expected the semantic encoding of crossmodal correspondences in such proximity. We evaluated the closeness of semantic associations of words across senses to create a crossmodal semantic network as it appears in the English language. Then, we developed mechanisms to explore the network, namely, seeking domains of experience where semantic crossmodal correspondences are more likely to emerge through community detection, and assessing the role of affective functions (particularly, valence, arousal and dominance) of semantic crossmodal correspondences in these domains of experience.
The contributions of the present work are twofold. First, we evaluate a prediction of the embodied lexicon hypothesis, specifically, that associations between senses (in the context of this research, crossmodal correspondences) are encoded in the lexicon. Second, drawing upon the semantic coding hypothesis, we assess whether the examination of an abstract semantic network of crossmodal correspondences will reveal the statistical and emotional mechanisms underlying the formation of crossmodal correspondences.
2. Methods
The present research involved eight key steps designed to assess how crossmodal correspondences emerge in language. First, we selected words and their modalities based on sensory adjectives, focusing on object properties. Second, we retrieved 300-dimensional Word Embeddings for these words and calculated distances between embeddings of different sensory modalities. Third, we generated a crossmodal correspondence network by identifying closely related word pairs based on scaled cosine distances. Fourth, we detected communities within this network using Newman’s leading eigenvector method (Newman, Reference Newman2006). Fifth, we assessed the robustness of community selection by varying thresholds and community detection methods. Sixth, we identified dominant sensory modalities across communities using Cramer’s V and chi-square statistics. Seventh, we matched words with emotional valence, arousal and dominance values obtained from the NRC VAD Lexicon (Mohammad, Reference Mohammad2018) and compared these values across communities. Finally, we identified domains of experience for each community based on semantic domains from the Intercontinental Dictionary Series (IDS) (Key & Comrie, Reference Key and Comrie2021) and the Summer Institute of Linguistics (SIL) (Moe, Reference Moe2003) classifications, as well as other factors such as emotional attributes and modalities represented within the community. Below, we elaborate on the details associated with each step.
2.1. Selection of words and their modalities
We used the list of 423 object properties from Lynott and Connell (Reference Lynott and Connell2009), extensively employed in various psycholinguistic analyses (Connell & Lynott, Reference Connell and Lynott2012, Reference Connell and Lynott2014). Lynott and Connell (Reference Lynott and Connell2009) extracted a list of sensory adjectives describing object properties in the English language from diverse sources, such as dictionaries and thesauruses, with the aim of a wide coverage of the senses. They established modality exclusivity norms for these adjectives based on the average rating given for each word in each sensory modality on a scale from zero to five. For each word, the modality that, on average, received the highest rating was selected as the dominant modality. Additionally, a modality exclusivity index was obtained by dividing the range of average ratings across the senses by the sum of these ratings. For instance, harsh is a word with an auditory dominant modality and a low modality exclusivity rating (0.12) due to its potential meaning in other senses, whereas deafening is also auditory dominant but much more exclusive (modality exclusivity rating: 0.77).
Two critical reasons supported the selection of Lynott and Connell exclusivity norms (Reference Lynott and Connell2009) over more recent alternatives such as the affective ratings of the Lancaster sensorimotor norms (Lynott et al., Reference Lynott, Connell, Brysbaert, Brand and Carney2020). First, Lynott and Connell norms (Reference Lynott and Connell2009) focus on object properties, which is relevant given that we are specifically exploring associations between sensory features. These associations are better represented by the sensory properties of objects rather than by the extensive list of commonly used lemmas in English across all syntactic categories within the Lancaster sensorimotor norms. For instance, using the Lancaster sensorimotor norms, we might find an association between dog (higher in its visual rating) and bark (higher in its auditory rating). Such association is known as crossmodal semantic congruence (Laurienti et al., Reference Laurienti, Kraft, Maldjian, Burdette and Wallace2004), because it emerges from a shared identity or meaning, not from crossmodally corresponding features as those involved in crossmodal correspondences (Knoeferle et al., Reference Knoeferle, Knoeferle, Velasco and Spence2016). Second, the object properties by Lynott and Connell (Reference Lynott and Connell2009) were specifically chosen for their potential to capture the semantics of the senses. Consequently, Lynott’s norms exhibit greater modality exclusivity compared to norms based on nouns (Lynott & Connell, Reference Lynott and Connell2013) or those using randomly selected word sets (Winter, Reference Winter2019, Ch. 12). Greater modality exclusivity implies that this set of words is less experienced in multiple senses simultaneously, reducing the influence exerted by the multisensory nature of the words on the relationships found between different senses.
A potential limitation is that the selected norms are constrained to the Aristotelian senses (i.e. sight, hearing, touch, smell and taste). It is worth noting that our senses are not limited to these traditional five (Velasco & Obrist, Reference Velasco and Obrist2020). However, we focused on these considering the available linguistic tools to approach them in text analysis, their frequent use in everyday scenarios and their utility as a basic model for research (Winter, Reference Winter2019, Ch. 2).
The result of this first stage of the research is the list of words, their dominant modality and their modality exclusivity index.
2.2. Word Embedding retrievals and calculations of distances between embeddings of different modalities
We retrieved Word Embeddings that are associated with the English language from the Google News vectors database. These vectors are the results of training a neural network on the Google News dataset of about 100 billion words (Mikolov, Reference Mikolov2013). We then extracted 300-dimensional Word Embedding vectors (i.e. a list of 300 scores in a vector) for each of the 423 words of the Lynott and Connell norms (Reference Lynott and Connell2009), mentioned in the Section 3.1 and already classified by their dominant sensory modality.
 The next step was to determine how similar or closely related the words previously classified into different sensory modalities were (e.g. the words within the taste and smell modalities). Given a pair of different dominant modality word sets (A,B), we calculated the cosine distance between each pair of Word Embedding vectors (
 $ \overrightarrow{W_A}\in A,\overrightarrow{W_B}\in B $
) and scaled it by the average cosine distance of
$ \overrightarrow{W_A}\in A,\overrightarrow{W_B}\in B $
) and scaled it by the average cosine distance of 
 $ \overrightarrow{W_A} $
 with all the other word vectors of modality B. Consequently, we detected word vectors of A that are closer to or further from a specific word vector of B, rather than simply closer to or further from all word vectors of B. For instance, the word pulsing has high scores in both the auditory and haptic modalities, although its dominant modality is haptic. However, due to its high auditory score, its raw cosine distance is close to the vast majority of auditory words (see Figure 1a). In this way, the raw cosine distance is only capturing the multimodality of the word pulsing, but not its relative proximity to specific auditory words or features. By dividing the cosine distance of pulsing to each auditory word by the overall average cosine distance of pulsing to all auditory words, we determined which auditory words are genuinely close to pulsing, beyond the fact that pulsing is close to all auditory words in general (see Figure 1b). Notice that the aforementioned process can generate different scaled cosine distances for pair (
$ \overrightarrow{W_A} $
 with all the other word vectors of modality B. Consequently, we detected word vectors of A that are closer to or further from a specific word vector of B, rather than simply closer to or further from all word vectors of B. For instance, the word pulsing has high scores in both the auditory and haptic modalities, although its dominant modality is haptic. However, due to its high auditory score, its raw cosine distance is close to the vast majority of auditory words (see Figure 1a). In this way, the raw cosine distance is only capturing the multimodality of the word pulsing, but not its relative proximity to specific auditory words or features. By dividing the cosine distance of pulsing to each auditory word by the overall average cosine distance of pulsing to all auditory words, we determined which auditory words are genuinely close to pulsing, beyond the fact that pulsing is close to all auditory words in general (see Figure 1b). Notice that the aforementioned process can generate different scaled cosine distances for pair (
 $ \overrightarrow{W_A},\overrightarrow{W_B} $
) and (
$ \overrightarrow{W_A},\overrightarrow{W_B} $
) and (
 $ \overrightarrow{W_B},\overrightarrow{W_A} $
), yielding 20 crossed sets between the five senses.
$ \overrightarrow{W_B},\overrightarrow{W_A} $
), yielding 20 crossed sets between the five senses.

Figure 1. Effects of raw cosine distance vs. scaled cosine distance.
Note: White circles represent auditory words; dark grey circles represent haptic words. Dotted oval depicts words that are close to pulsing depending on the selected distance metric. In panel a, raw cosine distance was used; in panel b, scaled cosine distance.
Scaled cosine distances are positive real numbers. A value below one means a shorter distance among words than the average across all words for this pair of modalities, and a value above one means a larger distance among words than the average across all words for this pair of modalities. For instance, the words audible (auditory) and bronze (visual) have a large distance (1.20), whereas the words reverberating (auditory) and rippling (visual) have a short distance (0.48) and might be candidates for a semantic crossmodal correspondence.
Once calculated, the scaled cosine distances configured a complete bipartite graph between pairs of modalities, including all pairs of modalities. To illustrate, Figure 2a shows an example between two modalities, auditory and haptic, using only four words in each one (barking, buzzing, rhythmic and soundless for the auditory modality and feverish, pulsing, weightless and grainy for the haptic modality).

Figure 2. Example of complete bipartite graph and sparse graph.
White circles represent selected auditory words; dark grey circles represent selected haptic words. In panel (a), a complete bipartite graph is formed with scaled cosine distance among words; in panel (b), after selecting a threshold for closeness, less than half the relationships remain, leading to a sparse graph where even some words are isolated.
2.3. Generation of the final crossmodal correspondence network
In order to generate a sparse graph that represents only those relationships between words whose scaled cosine distance is particularly low for each pair of modalities, a threshold was determined by implementing the approach suggested by Tukey (Reference Tukey1977) for the selection of extreme points in a univariate distribution. For our particular case, this criterion selected pairs of words from different modalities with distances less than 1.5 times the interquartile range below the mean distance. The finalized sparse graph formed a network of crossmodal associations whose vertices are words from different sensory modalities, and the edges are only those associations between two words from different modalities that have the aforementioned short distances. For instance, Figure 2b shows a reduction in associations after applying the threshold, where the word pulsing retains almost all relationships, grainy remains connected solely with soundless, barking gets isolated, and less than half the relationships remain, leading to a sparse graph.
It is important to highlight three aspects: First, the final network is constituted of all the edges (relationships below the selected threshold) found across any pair of modalities. Each edge implies a potential crossmodal correspondence, as it can only occur between words from different modalities. Second, each pair of senses had a different threshold that depends solely on the distribution of distances among these specific pair of senses, avoiding some potential biases due to the hierarchy of senses in English. Third, some vertices (words) may be isolated from any connection with any other modality, and therefore, they do not have any candidate for a crossmodal correspondence (for instance, barking in Figure 2b). Such words were removed from the network. We assessed whether there were strong differences between removed and selected words due to dominant modality (using Cramer’s V) and modality exclusivity index (using eta square). From this point of analysis, scaled cosine distances were no longer used, and the only element that remained is the network of vertices and relationships that construct the sparse graph, that is, our network of crossmodal correspondences. We also calculated centrality measures for the vertices, particularly, the eigenvector centrality, with higher values corresponding to more connected vertices, in order to detect words that are highly connected in the network.
2.4. Community detection in the crossmodal correspondence network
When examining a network, such as the network of sensory associations across modalities, it can be insightful to identify potential communities within it. Community detection is a well-known task performed on networks and involves extracting vertices that have a high density of interconnections among them, thereby constituting a community (Fortunato & Hric, Reference Fortunato and Hric2016). Community detection can be likened to clustering but is specifically tailored for networks. A common method of community detection revolves around maximizing modularity, a metric that compares the observed number of relationships between vertices to the expected number of relationships that would occur at random. Newman’s leading eigenvector method (Newman, Reference Newman2006) computes principal components on the modularity matrix between vertices and selects the number of components with an eigenvalue greater than one, implying a potential association of vertices higher than expected. Hence, when the eigenvalues are greater than one, it indicates that specific sets of words belong to a community within the network and are even more closely related. Newman’s eigenvector method was applied to the resulting crossmodal association network explained in Step 2.3.
2.5. Assessment of robustness of community detection
To assess the robustness or stability of the communities identified in Step 2.4, we applied the following procedures: (a) implementing a different threshold to select the crossmodal candidate relationships, namely, three standard deviations below the mean for each pair of modalities, and (b) implementing different types of community detection. Specifically, we employed a popular alternative method for maximizing modularity, the Louvain method (Blondel et al., Reference Blondel, Guillaume, Lambiotte and Lefebvre2008), and a community detection method (Infomap) based on a different strategy, namely, minimum description length of a random walk through the network (Rosvall & Bergstrom, Reference Rosvall and Bergstrom2008). We then compared the results of the modified communities with the initial communities obtained in Section 3.4 using the purity metric. Such a metric, for each initial community, counts the shared number of words between the highest matching modified community and the initial community and divides it by the total number of words in the initial community.
In addition, it can be expected that sensory domains of experience, in general, overlap with the crossmodal domains of experience, because in contexts that are rich in sensory information people can make use of crossmodality to reduce the complexity of information. To assess this, we compared our communities with a clustering constructed over sensory representations. Particularly, we compared our results with the clusters made by Winter (Reference Winter2019, Ch. 13) over the vector of modality ratings of Lynott and Connell (Reference Lynott and Connell2009) described in Section 2.1., that yielded 13 clusters showing several large clusters for visual and haptic dimensions, whereas the chemical senses were constricted to smaller less specialized clusters. In addition, a large cluster of multisensory words was identified.
2.6. Identifying and comparing dominant modalities across communities
We analyzed communities with more than five words in order to have enough data to draw meaningful conclusions. We measured the strength of the association between the sensory modality of the words and the community to which they belong using Cramer’s V, and calculated the chi-square statistic to show the most represented modalities in each community.
2.7. Identification and comparison of valence, arousal and dominance across communities
We matched each word with values of emotional valence, arousal and dominance extracted from the NRC VAD Lexicon (Mohammad, Reference Mohammad2018). Valence is the core emotional dimension of pleasure/displeasure; arousal is the emotional dimension of excitation/calm (Bliss-Moreau et al., Reference Bliss-Moreau, Williams and Santistevan2020; Russell, Reference Russell2003); and dominance is the emotional dimension of control over the context. The NRC VAD lexicon assigned values of valence, arousal and dominance ranging from zero to one, being zero the lowest and one the highest. Table 1 shows an example of words for each dominant modality with all their correspondent information.
Table 1. Final word example information by modality

We selected the NRC VAD lexicon over the more widely known affective lexicon by Warriner et al. (Reference Warriner, Kuperman and Brysbaert2013) for two primary reasons. The first reason pertains to the coverage each lexicon provided for the 423 words from Lynott and Connell (Reference Lynott and Connell2009). In the case of the NRC VAD lexicon, 111 words (26.2%) from Lynott and Connell’s list were not initially found; in the case of Warriner et al. (Reference Warriner, Kuperman and Brysbaert2013), that number increased to 169 (39.9%). To address this issue, we extracted canonical forms of the words not found in each lexicon (linguistic stems and lemmas). Only 31 (7.3%) words remained without emotional values in the NRC VAD lexicon, whereas 94 words (22.2%) were left without emotional values using the lexicon developed by Warriner et al. (Reference Warriner, Kuperman and Brysbaert2013). The second reason pertains to a slight improvement in the results of the NRC VAD lexicon compared to that of Warriner et al. (Reference Warriner, Kuperman and Brysbaert2013). Notably, the NRC VAD lexicon does not rely on obtaining ratings for each word. Instead, it involves the simultaneous comparison of four words, from which the highest and lowest in the quality being measured (for example, valence) are selected. This approach allows for the determination of five out of the six potential relationships among the four words in a single trial (Louviere et al., Reference Louviere, Flynn and Marley2015). This led to a better split-half reliability for the NRC VAD lexicon compared to the lexicon by Warriner et al. (Reference Warriner, Kuperman and Brysbaert2013). It is worth noting that there is low correlation between the two lexicons regarding arousal and dominance scores (Mohammad, Reference Mohammad2018).
We then assessed the emotional differences across communities using eta square as a metric of the difference between communities, and calculated confidence intervals for each estimate of valence, arousal and dominance of each community.
2.8. Identification of domains of experience across communities
To identify the domains of experience of each community, we decided to pinpoint the most closely related semantic domains for each of these communities. Semantic domains refer to classifications of concepts, which are reflected in groups of words whose meanings are highly related (Hills et al., Reference Hills, Todd and Jones2015; Nerlich & Clarke, Reference Nerlich and Clarke2000). This interconnectedness stems from their association with real-world phenomena, causing these words to revolve around specific subjects or thematic experiences (Brinton & Brinton, Reference Brinton and Brinton2010). While there are various types of semantic domains, we chose those whose classification is based on areas of interest common to human experience across multiple cultures, specifically, the classification from the Intercontinental Dictionary Series (IDS) (Key & Comrie, Reference Key and Comrie2021) and the Summer Institute of Linguistics (SIL) (Moe, Reference Moe2003). IDS semantic domains are derived from Buck’s (Reference Buck2008) thesaurus, and comprise 22 semantic domains with 1,310 lexical entries across many languages, including English. On the other hand, SIL developed, based on the family of Bantu languages, a list of semantic domains for any language, with nine high level domains in a hierarchy that has up to four levels, comprising around 1,600 total semantic domains. All words from each community were searched on the SIL webpage of semantic domains (Summer Institute of Lingustics, 2021) to identify the most frequently recurring semantic domains. Additionally, for each community, average raw cosine distance between all the words of the community and all the IDS semantic domains was calculated, and the three semantic domains with lowest average distance were extracted for each community. Sense Perception is one of the IDS domains, and by its very nature, is closely related to all the communities, without adding any interesting information. That was the reason to discard such a domain of the analyses. Finally, a label was selected for each community assessing the following elements: SIL related domains, IDS related domains, average valence, arousal and dominance and most represented modalities in the community. The previous elements helped to describe each domain of experience (i.e. a semantic domain related to everyday life experiences).
3. Results
3.1. The crossmodal network
A complete set of 60,265 scaled cosine distances were generated between pairs of words of different senses. A subset of 1,206 pairs of words were selected as crossmodal semantic associations, corresponding to pairs of words from different modalities with particularly small scaled cosine distances between them (distances less than 1.5 times the interquartile range below the mean distance). These 1,206 pairs had 378 distinct words from the original set of 423 words (89.9% of the words from now on the selected words), and comprised 2% of all possible combinations, ranging from 1.1% to 2.6% of all possible combinations for each pair of sensory modalities. The 1,026 pairs are available as supplementary material in the link at the data availability statement.
 Difference between dominant modality of selected words and dominant modality of removed words was little, showing that there is not apparent bias against a particular modality (
 $ {\Phi}_{Cramer} $
= 0.13) (Table 2).
$ {\Phi}_{Cramer} $
= 0.13) (Table 2).
Table 2. Selection of words by modality

Note: Number of words selected/unselected as crossmodal correspondences candidates divided by modality.
 On the other hand, selected words had slightly lower modality exclusivity than removed ones, but the effect size is low (
 $ {\eta}^2 $
=. 007), showing a small bias in selection to multisensory words. Overall, there was little evidence of dominant modality or modality exclusivity affecting the formation of the network of crossmodal correspondences, showing that methodology decisions helped to avoid potential biases.
$ {\eta}^2 $
=. 007), showing a small bias in selection to multisensory words. Overall, there was little evidence of dominant modality or modality exclusivity affecting the formation of the network of crossmodal correspondences, showing that methodology decisions helped to avoid potential biases.
3.2. Exploration of communities
Community detection extracted 13 communities or clusters (hereafter, domains of experience). Each domain of experience comprised highly interconnected words, where a connection indicates low cosine scaled distances between the words and, therefore, higher semantic proximity. Five domains of experience with fewer than five words associated, and they were removed from the analysis. Figure 3 shows the words belonging to each of the domains of experience divided by their dominant modality, with sizes proportional to relative centrality within each group. An explanation of the assigned labels is presented in Section 3.3.

Figure 3. Words in the domains of experience.
Words belonging to each domain of experience. Colors indicate the dominant modality of word, as calculated by Lynott and Conell (Reference Lynott and Connell2009). Word size is proportional to the centrality of the word in the network.
Figure 4 shows the two largest domains, Domain 2 and Domain 4 as a network of relationships, in more detail given their size (for a graph of the remaining domains, see the supplementary material in the data availability statement) Specific areas of the domains of experience are visible.

Figure 4. Detailed network of the two largest domains of experience. (a) Domain 2.
Domain 2 – Nature as an example of the matches uncovered in the five sensory modalities (visual, haptic, auditory, gustatory, and olfactory). Links between words indicate semantic closeness. Colors indicate the dominant modality of word, as calculated by Lynott and Conell (Reference Lynott and Connell2009). In Domain 2 from left to right, four clear zones are visible. At the extreme lower left, there are mostly haptic and visual words related to mass and size (examples: immense, heavy and large); at the inner lower left, a highly crossmodal zone connected to roughness, and, in general, negative valence words (examples: painful, smelly, ugly, harsh, bitter) is visible. At the inner upper right, atmospheric patterns related to temperature, humidity and light (examples: chilly, stormy, cool, warm, slippery) stand out. Finally, at the extreme right, predominantly visual qualities connected with the central words quiet and cool are found. (b) Domain 4.
Domain 4 – People as an example of the matches uncovered in the five sensory modalities (visual, haptic, auditory, gustatory and olfactory). Links between words indicate semantic closeness. Colors indicate the dominant modality of word, as calculated by Lynott and Conell (Reference Lynott and Connell2009). In Domain 4, we detected three specific zones. The word orangey shows a color-related zone (examples: blue, beige, pink, blonde) to the left. On the right, the words spiky and leathery lead to an upper zone mostly related with textures (examples: blotchy, mottled) and shape-related words (examples: rectangular, globular, speckled) whereas the words bristly and husky lead to a lower zone connected with animal and human body properties (examples: fatty, sweaty, hairy, snarling and scrawny).
3.3. Domains of experience associations and labelling
 Domains of experience differ in the modality of the words contained in them (
 $ {\Phi}_{Cramer} $
 = 0.327). In fact, six out of eight domains of experience had a high count of words of a single sense (see Table 3).
$ {\Phi}_{Cramer} $
 = 0.327). In fact, six out of eight domains of experience had a high count of words of a single sense (see Table 3).
Table 3. Domains of experience association with modalities (standardized residuals)

Note: The numbers shown represent the standardized residuals. Numbers in bold have an absolute value greater than 1.5 standard deviations. Red cells mean frequency words lower than expected for this sense in this community; green cells mean frequency words higher than expected for this sense in this community
 
Figure 5 shows error bars (Mean and SD) of valence, arousal and dominance for each domain. There are clear differences in average valence (
 $ {\eta}^2 $
=.162). Differences in average arousal (
$ {\eta}^2 $
=.162). Differences in average arousal (
 $ {\eta}^2 $
=.041) and average dominance (
$ {\eta}^2 $
=.041) and average dominance (
 $ {\eta}^2 $
=.053) were smaller but important between domains.
$ {\eta}^2 $
=.053) were smaller but important between domains.

Figure 5. Error bars of valence, arousal and dominance for each domain of experience.
Middle point depicts the average of each emotional feature in the group; emotional features range from 0 to 1. Upper and lower limits are limits with 95% of confidence, Bonferroni corrected.
Table 4 shows the results of the labelling process for domains of experience. Across domains of experience, it was noticeable that some IDS semantic domains are highly repetitive: Emotions and Values (4 times), The Body (4 times), Food and Drink (3 times), and The Physical World (3 times). Final labels were heuristically selected for better recall and brevity, although their correspondence might not necessarily be one-to-one and the domains might be broader than the labels can capture. This issue is further debated in the discussion section.
Table 4. Domains of experience features

Note: Domains of experience description based on SIL and IDS semantic domains. Domains of experience labels were assigned not only based on SIL and IDS semantic domains, but also in emotional scores and dominant modalities per domain.
Figure 6 shows the labeled domains of experience, binding together their dominant modality, and the three emotional elements assessed, namely, valence, arousal and dominance.

Figure 6. Characterization of domains of experience.
X-axis depicts valence. The Y-axis depicts dominance. Point size shows arousal. Values of the variables are in the range [0, 1]. Predominant significant sense (i.e. the sense with higher positive residual as shown in Table 2) is depicted aside the domain of experience. Icons of senses by Takao Umehara from NounProject.com.
3.4. Robustness assessment
Table 5 shows purity results of the three methods of robustness assessment. Results were robust (mean purity above 50%) for five domains (1, 3, 4, 5 and 7). This indicates that, even when using alternative clustering methods, a high proportion of words were grouped into similar communities. In the other three domains, two scenarios were observed. Domain 2 (Nature), which comprises a significant number of words, tended to be further subdivided into two or three groups. Conversely, the words from the Domain 6 (Toxicity) and Domain 8 (Deterioration) often remained isolated or tended to integrate into other communities. Overall, results were fairly robust, with some caveats for two of the smallest communities and a potential subdivision within Domain 2 (Nature).
Table 5. Purity of communities after robustness assessment

Note: Values of purity (number of words of original community) for each robustness assessment and for Winter (Reference Winter2019) clusters. Notice that 3σ threshold has a lower number of words per community, due to a tighter threshold for candidate crossmodal correspondence selection. Mean purity for the three robustness assessments – except Winter (Reference Winter2019) – is also depicted.
When comparing our results with Winter (Reference Winter2019), purity is clearly lower than that for variations in method/threshold, as shown in Table 5, on almost every domain. Further analysis showed that the ‘multisensory’ cluster of Winter (Reference Winter2019) is divided among all our communities, indicating that our results aim to more accurately capture this multisensory phenomenon. In fact, the high purity of Domain 7 (Toxicity) is due exactly to this ‘multisensory’ cluster. Other than that, Domain 1 (Food) mixes Winter’s ‘chemical’ and ‘taste’ clusters (plus some ‘multisensory’ words) and Domain 2 (Nature) extracts words from all the clusters with a bias toward touch and sight clusters. In summary, although some overlap is present, there are clear differences among Winter (Reference Winter2019) clusters and our communities due to their different goals (i.e. grouping the senses vs. capturing crossmodal correspondences in different senses).
4. Discussion
The present study focused on studying the emergence of crossmodal correspondences in language, offering a comprehensive analysis of how crossmodal experiences are interconnected in linguistic constructs. Overall, our data support both the semantic coding and embodied lexicon hypotheses. The discussion revolves around this particular support, which we present in four sections. In section 4.1, the evidence for perceptual crossmodal associations transferred to language is presented for three specific cases (pitch, color and tastes), supporting the embodied lexicon hypothesis. In sections 4.2 and 4.3, evidence of the importance of the statistical and emotional mechanisms of crossmodal correspondences for the semantic organization of associations between the senses is discussed. Section 4.4 presents the domains of experience and their overall connection with the semantic coding hypothesis.
4.1. Perceptual crossmodal associations in the semantic network
We focused on three particular cases: pitch, color and shapes.
4.1.1. Pitch
There are no words that solely represent high or low pitch in our dataset. However, several auditory words have pitch as a part of their meaning. Surprisingly, most words related to high pitch are concentrated in domain 3 (Beauty): cooing, tinkling, wailing, whistling, jingling, crackling. Notice that other domains do not have clear pitched words, such as Domain 1, 5, 6 or 8 (see Figure 3, particularly panel (c)). There is also a group of brightness-related word in the same domain: shimmering, glistening, brilliant. The relationship of brightness and high pitch is a well-known crossmodal correspondence, appearing with both perceptual and semantic stimuli (Martino & Marks, Reference Martino and Marks1999; Spence, Reference Spence2011).
Moreover, Domain 3 (beauty) is a high valence domain that includes very specific gustatory words. Two of them, honeyed and caramelised, are semantically related to sweetness. Concurrently, there is a tested association between sweet taste and high pitch, whether stimuli are gustatory words (Crisinel & Spence, Reference Crisinel and Spence2009, Reference Crisinel and Spence2010a) or real tastants (Crisinel & Spence, Reference Crisinel and Spence2010b; Wang et al., Reference Wang, Wang and Spence2016). Importantly, such relationships seem to be emotionally mediated (Wang et al., Reference Wang, Wang and Spence2016).
Studies of the pitch–flavor relationship have found that sour is associated with high pitch too (Crisinel & Spence, Reference Crisinel and Spence2009, Reference Crisinel and Spence2010b; Knoeferle et al., Reference Knoeferle, Woods, Käppler and Spence2015). It is important to notice that the word sour has no close relations with high-pitched words and it is located in Domain 2 (Nature) close to words of negative valence (ugly, harsh, painful). In addition, four important high pitch words are not in Domain 3 (Beauty): giggling/whimpering in Domain 4 (People) (Figure 4b) and shrill/whining in Domain 7 (Irritation) (Figure 3f). All four words have negative valence. Overall, there is no clear relationship between high-pitched negative words and tastes in our evidence.
Low-pitched words are not that clear in our stimuli. They seem to be more located in Domain 2 (Nature) (resounding, raucous), and be confounded with loudness, hindering the possibility of assessing pitch–taste relationships, since loudness has been connected to the concentration of tastants, rather than its taste (Wang et al., Reference Wang, Wang and Spence2016).
Is pitch associated with high and low spatial locations in our results? Words high and low are both in Domain 2 (Nature) (Figure 4a). The only crossmodal connection of high is booming which is arguably related with low pitch but also with loudness. In turn, booming is also connected with big and large, suggesting the well-known association between pitch and size, i.e. lower pitches are associated with bigger objects (Korzeniowska et al., Reference Korzeniowska, Simner, Root-Gutteridge and Reby2022). On the other hand, low is connected with quiet and mild, suggesting an association based upon intensity. Thus, the presence of crossmodal correspondences between pitch and vertical position in our evidence is unclear, likely due to the interference from other perceptual dimensions affecting the meaning of words.
In summary, there is fair evidence of known pitch crossmodal associations in the semantic crossmodal network, although not all relationships are clearly present in our results.
4.1.2. Color
As showed in Figure 4b, the most part of color words related to hue are located in Domain 4 (People) and share a single link with the gustatory word orangey. At a first glance, apparently crossmodal associations with color are mostly absent in the evidence. A closer look reveals a different picture.
Domain 4 (People) deals with animals, human bodies and clothing. The presence of the colors in such domain is particularly relevant, given the importance of colors in visual recognition (Bramão et al., Reference Bramão, Reis, Petersson and Faísca2011). In fact, source identification is key to the relationships found between colors, odors and tastes; food color is diagnostic (Saluja & Stevenson, Reference Saluja and Stevenson2018; Spence, Reference Spence2020b). Without proper source identification, associations with colors display great variability among cultures, individuals (Goubet et al., Reference Goubet, Durand, Schaal and McCall2018; Wan et al., Reference Wan, Woods, Van Den Bosch, McKenzie, Velasco and Spence2014) and even food categories (Velasco et al., Reference Velasco, Escobar, Spence and Olier2023). Thus, they might not clearly emerge in a semantic network specifically crafted to avoid source objects. However, the visual and central word ‘earthy’ in Domain 1 (Food), referring to a wider range of colors identifiable with multiple sources, show connections with gustatory words (chocolatey, cloying, flavoursome) and olfactory words (aromatic, fragrant, perfumed). Close crossmodal relationships of colors, odors and tastes in the food context have been long studied (see Spence, Reference Spence2020b, for a review).
Referring to the relationship between colors and music, it is worth noting that emotional mediation is a likely frequent mechanism of correspondence, and that the relationship encompasses hue, lightness and saturation (Palmer et al., Reference Palmer, Schloss, Xu and Prado-León2013). As shown in the previous section, high-valenced Domain 3 (Beauty) comprises many bright-related words (shimmering, glowing) along with auditory pleasant words (cooing, tinkling). Moreover, the word rhythmic is central in such domain. Finally, the word melodious in the high-valenced Domain 1 (Food) is directly linked to color words such as earthy and flowery.
Overall, we found evidence of color crossmodal associations mediated by emotion within the semantic crossmodal network.
4.1.3. Tastes
When discussing pitch, we have already evaluated its correspondence with tastes. In addition, crossmodal correspondences literature shows a strong relationship between tastes and visual-haptic features, where sweetness is related to curved shapes (Velasco et al., Reference Velasco, Woods, Petit, Cheok and Spence2016b), creaminess (Carvalho et al., Reference Carvalho, Wang, Van Ee, Persoone and Spence2017) and softness (Pistolas & Wagemans, Reference Pistolas and Wagemans2023), whereas sourness/saltiness/bitterness are related to sharpness/angularity (Velasco et al., Reference Velasco, Woods, Marks, Cheok and Spence2016a). In such relationships, some studies have demonstrated emotional mediation (Chuquichambi et al., Reference Chuquichambi, Munar, Spence and Velasco2024; Salgado Montejo et al., Reference Salgado Montejo, Alvarado, Velasco, Salgado, Hasse and Spence2015; Velasco et al., Reference Velasco, Woods, Deroy and Spence2015). Sweet-related words are in Domain 3 (Beauty) (e.g. honeyed, caramelised, see Figure 3c) and in Domain 1 (Food) (e.g. sweet, biscuit, see Figure 3a). Both domains are high in valence. In Domain 3 (Beauty), there are many haptic central words (silky, buttery, creamy) and the word curved is linked to creamy. On the other hand, sour and bitter are in Domain 2 (Nature), close to sharp, bumpy, steep, rough and prickly. It is noticeable that angular is in Domain 3 (Beauty) and soft in Domain 4 (Nature). Such closeness of antonyms is well known in Word Embeddings, and arise from the fact that they probably share similar contexts (Ono et al., Reference Ono, Miwa, Sasaki, Mihalcea, Chai and Sarkar2015). Nevertheless, they are not central in their respective assigned domains.
In summary, we found compelling evidence for taste-related crossmodal correspondences, likely mediated by emotion, in the semantic crossmodal network.
Broadly speaking, the three cases presented earlier show that perceptual associations in the crossmodal phenomena are encoded in language structures, supporting the embodied lexicon hypothesis.
4.2. Emotion in the crossmodal semantic network
Our results show that emotion is a key dimension of the semantic organization of concepts between sensory modalities (see Figures 5 and 6). Indeed, valence, arousal and dominance emerged as salient dimensions within the domains of experience. This observation resonates with Spence and Deroy’s (Reference Spence and Deroy2013) work, emphasizing the role of affective factors in mediating crossmodal correspondences (see also Spence, Reference Spence2020a).
In our research, valence showed the largest effect size of the three emotional elements, suggesting its role as a core component of the crossmodal correspondence architecture in language. The previous section depicted valence-mediated crossmodal relationships about the perceptual features of pitch, color and tastes.
Dominance had the second effect in size across communities. Low dominance highlights domains of experience where there might be out-of-control threat elements. In Domain 7 (Irritation), several elements of powerless perception are bound together: the auditory complaining of whining and clamorous, with the loose texture of flaky and mushy and the visual lacking vigor of insipid and dull. These elements are not easily linked to a single simultaneous experience and are more easily explained due to their connection to a similar emotional state. The role of dominance in crossmodal correspondences continues to be a topic of exploration, potentially involving high-order cognitive based influences in specific crossmodal relationships.
Regarding arousal, the higher arousal intensity of Domain 8 – Deterioration, reflects an increase in the urgency of acting to address potential risks. Conversely, the lower arousal intensity of Domain 1 – Food refers to potentially enjoyable experiences, were the word melodious is linked to color words such as earthy and flowery. Several research studies have shown arousal as a mediator in the correspondences of music and color (Lindborg & Friberg, Reference Lindborg and Friberg2015; Whiteford et al., Reference Whiteford, Schloss, Helwig and Palmer2018).
The centrality of the positively valenced words melodious and sonorous within the Food domain (Figure 3a) supports and adds to the long-established relationship between music and food, that seems to be mediated by emotion (Reinoso-Carvalho et al., Reference Reinoso-Carvalho, Gunn, Molina, Narumi, Spence, Suzuki, ter Horst and Wagemans2020; Spence, Reference Spence2019). In fact, pleasures, such as food, music and sex appear to activate similar brain regions (Blood & Zatorre, Reference Blood and Zatorre2001).
One standing issue regarding the role of emotion in the semantic crossmodal network concerns the relative contribution of the emotional mediation versus other potential explanations, such as the statistical co-occurrence of emotional experiences or the semantic use of shared emotional qualities. That is, beyond emotional mediation, certain multisensory experiences might share affect (i.e. rancid and rotten in Domain 6 – Toxicity) or we might frequently use words with shared affect to convey meaning (i.e. sweet and melodious in Domain 1 – Food). Although it is not possible to fully separate the contribution of these factors in the present research, the examples presented earlier (i.e. pitch and taste, music and color) demonstrate the emotional mediation that is present in our evidence. Therefore, emotional mediation likely contributes, to a certain degree and alongside other factors, to the role of emotion in the semantic network of association between senses in language.
4.3. Statistical associations in the crossmodal semantic network
Domains not clearly characterized by emotional features may be better characterized by statistical associations, (i.e. similar arrangement of stimuli in time and space across senses). Domain 5 – Movement (Figure 3e) hints toward relationships determined by patterns in space and time. The triad flickering (visual)–pulsing (haptic)–reverberating (auditory) is a good example, suggesting that the temporal crossmodal nature of the sound-induced flash illusion (Hirst et al., Reference Hirst, McGovern, Setti, Shams and Newell2020) appears in language and might extend to other senses.
Domain 2 – Nature (Figure 4a) also shows strong connections between temperature (hot, warm, cold), humidity (slippery, dry, wet) and light words (sunny, foggy, hazy), all related to weather. Crossmodal literature has mostly explored the relationship of hue and pitch with temperature, but has focused less on the relationship with brightness, based upon intensity (Spence, Reference Spence2020c). Some previous linguistic analyses on weather have found both a component representing weather conditions and an evaluative component, showing that emotional and statistical correlation factors are present in the weather lexicon (Stewart, Reference Stewart2007). However, weather is usually evaluated in valence terms (i.e. positive and negative) (Stewart, Reference Stewart2020), likely affecting the not-polarized valence found as an average in the Nature domain of experience.
In Domain 4 – People (Figure 4b), textural and visual patterns (fluffy, leathery, waxy) connect with people descriptors (blonde, brunette) and then with colors. This show that these patterns might become a mechanism for object, animals and people recognition, and that such recognition might be facilitated by statistical associations between senses. For instance, fatty (gustatory), bulky (visual), greasy (haptic) and stinky (olfactory) might arguably stem from a repeated exposure to the same multisensory experiences in the environment not necessarily related to a single source.
Intensity is a well-known crossmodal physical underlying phenomenon, frequently associated with the senses of smell and taste (Hanson-Vaux et al., Reference Hanson-Vaux, Crisinel and Spence2013; Velasco et al., Reference Velasco, Woods, Marks, Cheok and Spence2016a). Our evidence is plentiful, showing intensity-related crossmodal correspondences (examples: heavy/large/booming in the Nature domain, crackling/tingly/glowing in the Beauty domain, soundless/tiny /weightless in the Movement domain). Sometimes the variation in the direction of the intensity (for instance, the chilly/drab/raucous connection in the Nature domain) is concurrent with the specific context (a cold, dark, windy day).
Crossmodal statistical associations related to changes in space and time are also present in the evidence. There are two striking patterns related to transformation and change: process of disintegration and fire-related changes. Disintegration changes relate to the physical change actions of break and fall (burnt, melted, creaking – Figure 3h). This is the reason for naming the domain ’Deterioration’: Processes within this domain share a decrease in the organization of systems that can be translated between senses.
Fire-related changes proved highly crossmodal in nature in our evidence: The words crackling (auditory), crisp (haptic) and charred (visual) are connected between them and initially located in the Beauty domain (Figure 3c). What is more, they are connected with several gustatory-olfactory words (i.e. the connection of crackling with earthy, lemony, oniony, smoky in the Food domain, and the connection of charred with burnt in the Deterioration domain). Furthermore, they are connected to several light-related words (i.e. the connection of crisp with vivid, brilliant and dazzling). In fact, crackling is one of the more central words in our network.
4.4. Binding all together: Domains of experience
Here, it is important to mention that the emotional and statistical accounts of crossmodal correspondences might not be mutually exclusive (Spence, Reference Spence2011). Instead, as suggested by our study, they appear to be part of a broader architecture of crossmodal associations, as captured in language. Importantly, each crossmodal domain of experience depicts a more complex picture than captured by the assigned label. Indeed, each domain of experience might include a mix of statistical correlations and emotional correspondences. The Nature domain (Figure 4a), for instance, exhibits a large subnetwork of valenced words (ugly, painful, harsh, putrid). The Beauty domain (Figure 3c) also shows many patterned words (rhythmic, shimmering, tinkling, creamy).
Overall, the combination of emotion and statistical correlations in the present evidence reveals eight key domains of experience. According to our evidence, such domains are connected to the broader and cross-cultural IDS domains of Food, The Physical World, The Body and Emotions and Values. We believe that these findings help to further clarify specific sensory domains where diverse types of crossmodal correspondences might be important. In fact, most of these domains were well-established domains of experience connected before to crossmodal correspondences (for the physical world, see Parise & Spence, Reference Parise and Spence2009; for emotion, see Spence, Reference Spence2020a; for food, see Velasco et al., Reference Velasco, Salgado-Montejo, Marmolejo-Ramos and Spence2014). It is also worth highlighting domains present in IDS that did not appear in our analysis. Although absence of evidence is not evidence of absence, our results did not show crossmodal communities strongly associated with IDS semantic domains such as The House, Warfare and Hunting, Religion and Beliefs, and Social and Political Relations, among the most striking. For instance, despite the documented connection between moral judgment and taste (Chapman et al., Reference Chapman, Kim, Susskind and Anderson2009; Eskine et al., Reference Eskine, Kacinik and Prinz2011), we did not observe this connection reflected in our work, where it could have appeared in the form of connections with Religion and Beliefs or Social and Political Relations.
We believe that the evidence presented in this research supports a clear connection between crossmodal correspondences and their encoding in language. More particularly, it supports the idea that the emotional and statistical mechanisms of crossmodal correspondences formation are encoded in language. A potential organization of crossmodal correspondences in language can be achieved through the structure of a semantic network, wherein perceptually related concepts are linked according to their relationships and matches across modalities, as predicted by the semantic coding hypothesis.
5. Conclusions, limitations, and future work
Our findings indicate that many perceptual crossmodal correspondences are embedded in language. Beyond simply reflecting these correspondences, language can capture their emotional and perceptual mechanisms (statistical correspondences) in a crossmodal semantic network. These results align with the embodied lexicon hypothesis and the semantic coding hypothesis.
The crossmodal semantic network reveals domains of experience where crossmodal correspondences are present across a wide range of everyday life contexts, including food, weather, nature, the body of people and animals, danger, threat and fire.
Our work underscores the connection between perception and language and the importance of emotions. It demonstrates that our feelings and the statistical regularities that we perceive are key in organizing sensory concepts across different domains (Noel et al., Reference Noel, Blanke and Serino2018).
Our results open interesting opportunities to discover, or hypothesize, specific crossmodal perceptual correspondences contributing to a new perceptual research agenda. Furthermore, studying sensory associations using a computational model allow us to move beyond one-on-one matches and consider how crossmodal (in)consistency across contexts and objects, influences experience, decision-making, and sensory integration. After all, sensory associations impact how we integrate sensory information. Understanding how language encodes crossmodal correspondences is crucial as it ultimately influences how neurons encode perceptual information. Our analysis can also inform the development of hypotheses and guidelines for fields as diverse as design (Velasco & Spence, Reference Velasco and Spence2019) or neurological rehabilitation (Tinga et al., Reference Tinga, Visser-Meily, van der Smagt, Van der Stigchel, van Ee and Nijboer2016).
Future research applying a similar method across languages can provide insights into the evolutionary drivers of linguistic constructs. If certain crossmodal associations are shared across languages, it might suggest deep-rooted evolutionary reasons for these correspondences. Given that distributional semantic encodings have been built on diverse languages, analyzing the differences and similarities in crossmodal correspondences across these languages can shed light on cultural nuances in sensory perceptions.
It is important to mention here, that our methodological decisions might have influenced the results obtained. For instance, expanding the list of words used from Lynott and Connell (Reference Lynott and Connell2009) could provide a clearer picture of crossmodal correspondences in language. Additionally, our results may not generalize to other languages as we focused solely on English. Word Embeddings have limitations in providing a complete encoding of perception and fully represent semantic memory but can also capture other language features. Moreover, some similarities between words be the result of relationships could be due to relationships like hyponymy or morphology. For instance, lemony and tangy might be considered hyponyms of citrusy, leading to their grouping in Domain 1; or the suffix -ing in creaking, swinging and banging might tie these words in Domain 8. It is important to notice that hyponymy and morphology are also intertwined with perceptual semantics, making it difficult to disentangle them with the available methods.
Measuring emotion-related words presents important caveats. Scoring the words in terms of valence, arousal, and dominance may present some challenges as the field of emotion research is continuously updating and revising its methods and approaches. What is understood by an emotion and the relationship between emotions and language may change as the field presses forward. Finally, the community detection method used, as shown in the robustness analysis, creates larger groups than other methods. We consider this a decision that allows a level of generalization appropriate to an initial exploration of the crossmodal language that can be further explored with other community detection methods.
Another limitation of our study is the potential for imposed labels on the identified domains of experience. Our findings suggest that the communities/domains may convey information beyond the labels we applied, focusing instead on specific features that could be more strongly associated. Future research can delve deeper into this to enhance our understanding of the domains of experience.
Given the complex nature of the crossmodal correspondences and their relation to language, it is reasonable to assume that other factors may influence the architecture of semantic crossmodal correspondences. Nevertheless, our results support the importance of emotion and perceptual statistical associations in the semantic organization of sensory concepts between sensory modalities in language, around domains of experience.
Language cannot fully represent crossmodal correspondences, because many experiences and sensorial features lack corresponding words. Additionally, the extent to which Word Embeddings capture specific perceptual information remains unclear. Cross-cultural analysis may uncover similarities and differences between crossmodality representation in different languages, potentially revealing common underlying phenomena.
Data availability statement
All code and data to reproduce results, and final data results are available at: https://osf.io/yrmhb/?view_only=9151b5859e6b4215998802222bf1c38b.
Acknowledgements
The authors thank the anonymous reviewers, and especially the action editor Bodo Winter, for their thoughtful and thorough comments on the manuscript.
Competing interest
Author 3 is owner of the company Atrianna (www.atrianna.com).
 
 










