The world we see is populated by colors, textures, edges, and countless other visual features. Yet we see more than a collection of features: We also see whole objects, and relations within and between those objects. How are these entities represented? Here, we advance the case for LoT-like representation in perception. We argue that at least two types of visual representations are compositional, and we explore their connections with the rest of the mind.
Consider the hands in Figure 1A. Although they differ in various superficial features, they appear to share something: their structure – specifically, their skeletal structure. The same parts are connected in the same ways, just in different poses. Similarly, the middle shape in Figure 1B shares its structure with the left shape but not the right shape, even though the middle and right shapes share other features. Skeletal representations describe shapes via their parts' intrinsic axes and connections, often in a hierarchical tree format, wherein certain parts “descend” or “offshoot” from others (Feldman & Singh, Reference Feldman and Singh2006). Copious evidence suggests that skeletal representations are psychologically real, implicated in detection (Kovács & Julesz, Reference Kovács and Julesz1994; Wilder, Feldman, & Singh, Reference Wilder, Feldman and Singh2016), discrimination (Lowet, Firestone, & Scholl, Reference Lowet, Firestone and Scholl2018), categorization (Wilder, Feldman, & Singh, Reference Wilder, Feldman and Singh2011), aesthetics (Van Tonder, Lyons, & Ejima, Reference Van Tonder, Lyons and Ejima2002), and more (Firestone & Scholl, Reference Firestone and Scholl2014; Psotka, Reference Psotka1978).
Figure 1. Demonstrations of compositionality in visual perception. (A) The three hands shown here differ in global shape, the locations of their boundaries, and other surface features; however, they appear to share something: Their structure – specifically, their skeletal structure (indicated by the inset colored lines). The same parts have taken on different poses. Skeletal shape representations describe objects in terms of the axes of their parts, including how those parts are arranged with respect to one another, in ways that instantiate several core LoT properties. (Adapted from Lowet et al., Reference Lowet, Firestone and Scholl2018.) (B) Skeletal shape representations explain why infants and adults can see that the middle shape shares something with the leftmost shape that it does not share with the rightmost shape, even though the middle and rightmost shapes share other features. (Adapted from Ayzenberg & Lourenco, Reference Ayzenberg and Lourenco2019.) (C) The three object pairs shown here differ in a variety of visual features, and even involve different objects – but each seems to instantiate the same relation: containment. Recent evidence suggests that the mind rapidly and automatically encodes such relations, representing the relation itself separately from the objects participating in it. (Adapted from Hafri et al., Reference Hafri, Bonner, Landau and Firestone2020.) (D) These two images depict the same objects (cat and mat) and the same relation (support), but differ in their structure – a cat on a mat is a very different scene from a mat on a cat. Put differently, “argument order” matters: R(x,y) may be quite different than R(y,x), and there is evidence that visual processing is sensitive to this difference in compositional structure. (Adapted from Hafri & Firestone, Reference Hafri and Firestone2021.)
We contend that skeletal representations exhibit several of Quilty-Dunn et al.'s LoT properties: Discrete constituents, role-filler independence, and abstract content. First, skeletal representations contain discrete constituents that represent axis structure independently of surrounding boundaries, composing with boundary representations to describe overall shape. This may explain why infants (Ayzenberg & Lourenco, Reference Ayzenberg and Lourenco2022) and adults (Wilder et al., Reference Wilder, Feldman and Singh2011) categorize novel shapes by skeletal structure despite differences in surface properties. Second, representations of individual parts exhibit role-filler independence, retaining identity over changes in position within the overall skeletal representation. Such transportability (Fodor, Reference Fodor1987) explains why we can easily determine when distinct shapes share the same parts, and why such shapes prime one another (Cacciamani, Ayars, & Peterson, Reference Cacciamani, Ayars and Peterson2014). Third, skeletal representations are abstract, expressing aspects of shape that appear stable despite part articulations (Fig. 1A), changes in surface properties (Fig. 1B; Green, Reference Green2019), and sense modality (Green, Reference Green2022). Moreover, visual brain areas encode skeletal structure across surface changes (Ayzenberg, Kamps, Dilks, & Lourenco, Reference Ayzenberg, Kamps, Dilks and Lourenco2022; Hung, Carlson, & Connor, Reference Hung, Carlson and Connor2012; Lescroart & Biederman, Reference Lescroart and Biederman2013). Skeletal representations may also encode nonmetric, categorical properties – for example, straight/curved and symmetric/asymmetric (Amir, Biederman, & Hayworth, Reference Amir, Biederman and Hayworth2012; Green, Reference Green2017; Hafri, Gleitman, Landau, & Trueswell, Reference Hafri, Gleitman, Landau and Trueswell2023).
We suggest that these LoT properties make skeletal representations compositional: Discrete constituents encoding different geometrical elements and properties combine to form representations of global shape.
Compositionality in vision extends to relations between objects. Consider the object pairs in Figure 1C. They appear to share something: the relation containment. Visual processing respects this commonality – it represents relations between objects, beyond the objects themselves (Hafri & Firestone, Reference Hafri and Firestone2021). Such representations also exhibit several LoT properties. First, visual processing represents relations abstractly and categorically: Observers are more sensitive to metric changes across relational category boundaries (e.g., from containing to merely touching) than within (e.g., from one instance of containment to another; Lovett & Franconeri, Reference Lovett and Franconeri2017), and even “confuse” instances of the same relation for one another (Hafri, Bonner, Landau, & Firestone, Reference Hafri, Bonner, Landau and Firestone2020). Furthermore, visual brain areas encode eventive relations abstractly, generalizing across event participants (Hafri, Trueswell, & Epstein, Reference Hafri, Trueswell and Epstein2017; Wurm & Lingnau, Reference Wurm and Lingnau2015).
Second, such representations contain discrete constituents and exhibit role-filler independence, in ways that augment Quilty-Dunn et al.'s discussion. Consider Figure 1D. Both images involve the same objects (cat and mat) and relation (support), but cat-on-mat differs from mat-on-cat in compositional structure. Thus, “argument order” matters – the “fillers” map to different roles. Recent work shows that vision is sensitive to this difference. When observers repeatedly reported the location of a target individual (e.g., blue-shirted man) in a stream of action photographs (e.g., blue-kicking-red, red-pushing-blue), a “switching cost” emerged: Slower responses when the target individual's role (Agent/Patient) switched (e.g., pusher on trial n − 1 but kickee on trial n), suggesting that observers encoded relational structure automatically (Hafri, Trueswell, & Strickland, Reference Hafri, Trueswell and Strickland2018).
These properties make representations of categorical between-object relations compositional: Discrete constituents encoding entities and relations combine to form representations of structured situations.
The prospect of LoT-like, compositional visual representations impacts broader debates about perception's format. Many claim that perceptual representations are constitutively iconic, analog, or “picture-like” (Burge, Reference Burge2022; Carey, Reference Carey2009; Dretske, Reference Dretske1981; Kosslyn, Thompson, & Ganis, Reference Kosslyn, Thompson and Ganis2006). However, although LoT-like formats clearly suffice to encode categorical, nondegreed relations (e.g., containment), many iconic formats may not – particularly accounts requiring perceptual icons to mirror graded degrees of difference in perceptible properties (e.g., orientation or brightness; Block, Reference Block2023).
This perspective also raises exciting questions and research directions. For example, it may partially explain how information from perception is “readily consumed” by cognitive and linguistic systems (because of the similar formats of some perceptual and higher-level representations; Cavanagh, Reference Cavanagh2021; Quilty-Dunn, Reference Quilty-Dunn2020). Recent work explores these connections explicitly: Skeletal shape representations impact aesthetic preferences and linguistic descriptions of shapes (Sun & Firestone, Reference Sun and Firestone2022a, Reference Sun and Firestone2022b), and representations of symmetry and roles may be shared across perception and language (Hafri et al., Reference Hafri, Trueswell and Strickland2018, Reference Hafri, Gleitman, Landau and Trueswell2023; Rissman & Majid, Reference Rissman and Majid2019; Strickland, Reference Strickland2017). One could also investigate the “psychophysics” of compositional processes – the timing and ordering of how relational representations are built from their parts.
Nevertheless, LoT-like perceptual representations may not be fully language-like. Although perception plausibly predicates properties of individuals (Quilty-Dunn & Green, Reference Quilty-Dunn and Green2023), it may lack the full expressive freedom of first-order logic (Camp, Reference Camp, Grzankowski and Montague2018), especially logical connectives needed for truth-functional completeness (Mandelbaum et al., Reference Mandelbaum, Dunham, Feiman, Firestone, Green, Harris and Quilty-Dunn2022). Perception may be able to represent that an object is red but not that it is not red. Moreover, certain perceptual formats may impose constraints on which properties are attributable to which individuals – constraints absent from higher-level cognition. Perhaps perception cannot explicitly represent relations between nonadjacent object parts, or eventive relations of long durations (e.g., a jack slowly lifting a car).
Because perception and thought confront multifarious tasks with different computational demands, we contend that they comprise a multiplicity of formats (Marr, Reference Marr1982; Yousif, Reference Yousif2022), each optimized for different computations, and some more LoT-like than others. Thus, any theory positing a single-privileged format for perception or thought should be met with suspicion. Instead, researchers should heed Quilty-Dunn et al.'s advice to “let a thousand representational formats bloom” (target article, sect. 2, para. 2).
The world we see is populated by colors, textures, edges, and countless other visual features. Yet we see more than a collection of features: We also see whole objects, and relations within and between those objects. How are these entities represented? Here, we advance the case for LoT-like representation in perception. We argue that at least two types of visual representations are compositional, and we explore their connections with the rest of the mind.
Consider the hands in Figure 1A. Although they differ in various superficial features, they appear to share something: their structure – specifically, their skeletal structure. The same parts are connected in the same ways, just in different poses. Similarly, the middle shape in Figure 1B shares its structure with the left shape but not the right shape, even though the middle and right shapes share other features. Skeletal representations describe shapes via their parts' intrinsic axes and connections, often in a hierarchical tree format, wherein certain parts “descend” or “offshoot” from others (Feldman & Singh, Reference Feldman and Singh2006). Copious evidence suggests that skeletal representations are psychologically real, implicated in detection (Kovács & Julesz, Reference Kovács and Julesz1994; Wilder, Feldman, & Singh, Reference Wilder, Feldman and Singh2016), discrimination (Lowet, Firestone, & Scholl, Reference Lowet, Firestone and Scholl2018), categorization (Wilder, Feldman, & Singh, Reference Wilder, Feldman and Singh2011), aesthetics (Van Tonder, Lyons, & Ejima, Reference Van Tonder, Lyons and Ejima2002), and more (Firestone & Scholl, Reference Firestone and Scholl2014; Psotka, Reference Psotka1978).
Figure 1. Demonstrations of compositionality in visual perception. (A) The three hands shown here differ in global shape, the locations of their boundaries, and other surface features; however, they appear to share something: Their structure – specifically, their skeletal structure (indicated by the inset colored lines). The same parts have taken on different poses. Skeletal shape representations describe objects in terms of the axes of their parts, including how those parts are arranged with respect to one another, in ways that instantiate several core LoT properties. (Adapted from Lowet et al., Reference Lowet, Firestone and Scholl2018.) (B) Skeletal shape representations explain why infants and adults can see that the middle shape shares something with the leftmost shape that it does not share with the rightmost shape, even though the middle and rightmost shapes share other features. (Adapted from Ayzenberg & Lourenco, Reference Ayzenberg and Lourenco2019.) (C) The three object pairs shown here differ in a variety of visual features, and even involve different objects – but each seems to instantiate the same relation: containment. Recent evidence suggests that the mind rapidly and automatically encodes such relations, representing the relation itself separately from the objects participating in it. (Adapted from Hafri et al., Reference Hafri, Bonner, Landau and Firestone2020.) (D) These two images depict the same objects (cat and mat) and the same relation (support), but differ in their structure – a cat on a mat is a very different scene from a mat on a cat. Put differently, “argument order” matters: R(x,y) may be quite different than R(y,x), and there is evidence that visual processing is sensitive to this difference in compositional structure. (Adapted from Hafri & Firestone, Reference Hafri and Firestone2021.)
We contend that skeletal representations exhibit several of Quilty-Dunn et al.'s LoT properties: Discrete constituents, role-filler independence, and abstract content. First, skeletal representations contain discrete constituents that represent axis structure independently of surrounding boundaries, composing with boundary representations to describe overall shape. This may explain why infants (Ayzenberg & Lourenco, Reference Ayzenberg and Lourenco2022) and adults (Wilder et al., Reference Wilder, Feldman and Singh2011) categorize novel shapes by skeletal structure despite differences in surface properties. Second, representations of individual parts exhibit role-filler independence, retaining identity over changes in position within the overall skeletal representation. Such transportability (Fodor, Reference Fodor1987) explains why we can easily determine when distinct shapes share the same parts, and why such shapes prime one another (Cacciamani, Ayars, & Peterson, Reference Cacciamani, Ayars and Peterson2014). Third, skeletal representations are abstract, expressing aspects of shape that appear stable despite part articulations (Fig. 1A), changes in surface properties (Fig. 1B; Green, Reference Green2019), and sense modality (Green, Reference Green2022). Moreover, visual brain areas encode skeletal structure across surface changes (Ayzenberg, Kamps, Dilks, & Lourenco, Reference Ayzenberg, Kamps, Dilks and Lourenco2022; Hung, Carlson, & Connor, Reference Hung, Carlson and Connor2012; Lescroart & Biederman, Reference Lescroart and Biederman2013). Skeletal representations may also encode nonmetric, categorical properties – for example, straight/curved and symmetric/asymmetric (Amir, Biederman, & Hayworth, Reference Amir, Biederman and Hayworth2012; Green, Reference Green2017; Hafri, Gleitman, Landau, & Trueswell, Reference Hafri, Gleitman, Landau and Trueswell2023).
We suggest that these LoT properties make skeletal representations compositional: Discrete constituents encoding different geometrical elements and properties combine to form representations of global shape.
Compositionality in vision extends to relations between objects. Consider the object pairs in Figure 1C. They appear to share something: the relation containment. Visual processing respects this commonality – it represents relations between objects, beyond the objects themselves (Hafri & Firestone, Reference Hafri and Firestone2021). Such representations also exhibit several LoT properties. First, visual processing represents relations abstractly and categorically: Observers are more sensitive to metric changes across relational category boundaries (e.g., from containing to merely touching) than within (e.g., from one instance of containment to another; Lovett & Franconeri, Reference Lovett and Franconeri2017), and even “confuse” instances of the same relation for one another (Hafri, Bonner, Landau, & Firestone, Reference Hafri, Bonner, Landau and Firestone2020). Furthermore, visual brain areas encode eventive relations abstractly, generalizing across event participants (Hafri, Trueswell, & Epstein, Reference Hafri, Trueswell and Epstein2017; Wurm & Lingnau, Reference Wurm and Lingnau2015).
Second, such representations contain discrete constituents and exhibit role-filler independence, in ways that augment Quilty-Dunn et al.'s discussion. Consider Figure 1D. Both images involve the same objects (cat and mat) and relation (support), but cat-on-mat differs from mat-on-cat in compositional structure. Thus, “argument order” matters – the “fillers” map to different roles. Recent work shows that vision is sensitive to this difference. When observers repeatedly reported the location of a target individual (e.g., blue-shirted man) in a stream of action photographs (e.g., blue-kicking-red, red-pushing-blue), a “switching cost” emerged: Slower responses when the target individual's role (Agent/Patient) switched (e.g., pusher on trial n − 1 but kickee on trial n), suggesting that observers encoded relational structure automatically (Hafri, Trueswell, & Strickland, Reference Hafri, Trueswell and Strickland2018).
These properties make representations of categorical between-object relations compositional: Discrete constituents encoding entities and relations combine to form representations of structured situations.
The prospect of LoT-like, compositional visual representations impacts broader debates about perception's format. Many claim that perceptual representations are constitutively iconic, analog, or “picture-like” (Burge, Reference Burge2022; Carey, Reference Carey2009; Dretske, Reference Dretske1981; Kosslyn, Thompson, & Ganis, Reference Kosslyn, Thompson and Ganis2006). However, although LoT-like formats clearly suffice to encode categorical, nondegreed relations (e.g., containment), many iconic formats may not – particularly accounts requiring perceptual icons to mirror graded degrees of difference in perceptible properties (e.g., orientation or brightness; Block, Reference Block2023).
This perspective also raises exciting questions and research directions. For example, it may partially explain how information from perception is “readily consumed” by cognitive and linguistic systems (because of the similar formats of some perceptual and higher-level representations; Cavanagh, Reference Cavanagh2021; Quilty-Dunn, Reference Quilty-Dunn2020). Recent work explores these connections explicitly: Skeletal shape representations impact aesthetic preferences and linguistic descriptions of shapes (Sun & Firestone, Reference Sun and Firestone2022a, Reference Sun and Firestone2022b), and representations of symmetry and roles may be shared across perception and language (Hafri et al., Reference Hafri, Trueswell and Strickland2018, Reference Hafri, Gleitman, Landau and Trueswell2023; Rissman & Majid, Reference Rissman and Majid2019; Strickland, Reference Strickland2017). One could also investigate the “psychophysics” of compositional processes – the timing and ordering of how relational representations are built from their parts.
Nevertheless, LoT-like perceptual representations may not be fully language-like. Although perception plausibly predicates properties of individuals (Quilty-Dunn & Green, Reference Quilty-Dunn and Green2023), it may lack the full expressive freedom of first-order logic (Camp, Reference Camp, Grzankowski and Montague2018), especially logical connectives needed for truth-functional completeness (Mandelbaum et al., Reference Mandelbaum, Dunham, Feiman, Firestone, Green, Harris and Quilty-Dunn2022). Perception may be able to represent that an object is red but not that it is not red. Moreover, certain perceptual formats may impose constraints on which properties are attributable to which individuals – constraints absent from higher-level cognition. Perhaps perception cannot explicitly represent relations between nonadjacent object parts, or eventive relations of long durations (e.g., a jack slowly lifting a car).
Because perception and thought confront multifarious tasks with different computational demands, we contend that they comprise a multiplicity of formats (Marr, Reference Marr1982; Yousif, Reference Yousif2022), each optimized for different computations, and some more LoT-like than others. Thus, any theory positing a single-privileged format for perception or thought should be met with suspicion. Instead, researchers should heed Quilty-Dunn et al.'s advice to “let a thousand representational formats bloom” (target article, sect. 2, para. 2).
Acknowledgments
For comments on an earlier draft, the authors acknowledge members of the JHU Perception and Mind Laboratory.
Financial support
This study was supported by NSF BCS no. 2021053 awarded to C. F.
Competing interest
None.