Hostname: page-component-cd9895bd7-jn8rn Total loading time: 0 Render date: 2024-12-29T03:14:50.388Z Has data issue: false hasContentIssue false

Toward biologically plausible artificial vision

Published online by Cambridge University Press:  28 September 2023

Mason Westfall*
Affiliation:
Department of Philosophy, Philosophy–Neuroscience–Psychology Program, Washington University in St. Louis, St. Louis, MO, USA w.mason@wustl.edu http://www.masonwestfall.com

Abstract

Quilty-Dunn et al. argue that deep convolutional neural networks (DCNNs) optimized for image classification exemplify structural disanalogies to human vision. A different kind of artificial vision – found in reinforcement-learning agents navigating artificial three-dimensional environments – can be expected to be more human-like. Recent work suggests that language-like representations substantially improves these agents’ performance, lending some indirect support to the language-of-thought hypothesis (LoTH).

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press

Image classifiers implemented with deep convolutional neural networks (DCNNs) have been taken by many to tell against language-of-thought (LoT) architectures. Quilty-Dunn et al. argue that this is a mistake. These image classifiers exhibit deep structural disanalogies to human vision, so, whether or not they implement LoT architectures tells us little about human vision. This is perhaps unsurprising, because biological vision is plausibly not optimized solely for image classification (Bowers et al., Reference Bowers, Malhorta, Dujmović, Montero, Tsvetkov, Biscione and Blything2022, p. 10). Would training artificial vision under more ecologically realistic conditions produce a more realistic model of human vision? To make progress on this question, I describe some reinforcement-learning (RL) agents trained to navigate artificial three-dimensional environments on the basis of how things appear from their perspective, and explain why we might expect their vision to be more human-like. Interestingly, language-like representations seem to be especially helpful to these agents. They explore more effectively, more quickly learn novel tasks, and are even facilitated in downstream image classification. These models arguably provide some indirect evidence for the language-of-thought hypothesis (LoTH) about human vision, and may offer some clues as to why LoT architectures arose evolutionarily.

What is biological vision optimized for, and what would artificial vision that was similarly optimized be like? One answer to the first question is that biological vision is optimized for an agent's success in their environment. Success requires a number of competences that vision must contribute to simultaneously. Agents need to effectively explore, learn new behaviors, and act to achieve their goals, all while the environment changes in often surprising ways.

Recent work in RL arguably more closely approximates the optimization problem facing biological agents. Artificial RL agents can learn to do many complex tasks, across a variety of environments – most interestingly, in this context, exploring and pursuing goals in artificial three-dimensional environments like Habitat (Savva et al., Reference Savva, Kadian, Maksymets, Zhao, Wijmans, Jain and Batra2019), Matterport3D (Chang et al., Reference Chang, Dai, Funkhouser, Halber, Nießner, Savva and Zhang2017), Gibson Env (Xia et al., Reference Xia, Zamier, He, Sax, Malik and Savarese2018), Franka Kitchen (Gupta et al., Reference Gupta, Kumar, Lynch, Levine and Hausman2019), VizDoom (Kempka et al., Reference Kempka, Wydmuch, Runc, Toczek and Jaśkowski2016), Playroom (Tam et al., Reference Tam, Rabinowitz, Lampinen, Roy, Chan, Strouse and Hill2022), and City (Tam et al., Reference Tam, Rabinowitz, Lampinen, Roy, Chan, Strouse and Hill2022). One way of accomplishing this – especially in environments where environmental reward is sparse – is by making novelty intrinsically rewarding. These “curious agents” can learn, without supervision, representations that enable them to perform navigation tasks, interact with objects, and also perform better than baseline in image recognition tasks (Du, Gan, & Isola, Reference Du, Gan and Isola2021). As the authors put it, their agents are “learning a task-agnostic representation for different downstream interactive tasks” (Du et al., Reference Du, Gan and Isola2021, p. 10409).

One challenge these researchers face is how to characterize novelty. Superficial differences in viewing angle or pixel distribution can easily be rated as highly novel, leading to low-level exploration that does not serve learning conducive to achieving goals. A recent innovation is to equip RL agents with “prior knowledge, in the form of abstractions derived from large vision-language models” (Tam et al., Reference Tam, Rabinowitz, Lampinen, Roy, Chan, Strouse and Hill2022, p. 2). Doing so enables the state space over which novelty is defined to be characterized by abstract, semantic categories, such that novelty is defined in task-relevant ways (Mu et al., Reference Mu, Zhong, Raileanu, Jiang, Goodman, Rocktäschel and Grefenstette2022). This method has been shown to substantially improve performance across a variety of tasks and environments, compared to nonlinguistic ways of characterizing the state space (Mu et al., Reference Mu, Zhong, Raileanu, Jiang, Goodman, Rocktäschel and Grefenstette2022; Schwartz et al., Reference Schwartz, Tennenholtz, Tessler and Mannor2019; Tam et al., Reference Tam, Rabinowitz, Lampinen, Roy, Chan, Strouse and Hill2022). The improvements are especially pronounced for tasks involving relations between objects, for example, “Put an OBJECT on a {bed, tray}” (Tam et al., Reference Tam, Rabinowitz, Lampinen, Roy, Chan, Strouse and Hill2022, p. 2), reminiscent of work on relations reviewed in the target article (Hafri & Firestone, Reference Hafri and Firestone2021). As the authors note, their training on vision–language representations that encode “objects and relationship” instead of on ImageNet – optimized for classification – should be expected to be more successful (Tam et al., Reference Tam, Rabinowitz, Lampinen, Roy, Chan, Strouse and Hill2022, p. 10).

Why would linguistic categories facilitate performance? One possibility is that language compresses the state space in ways that facilitate successful actions. The semantic categories enshrined in natural language tend to abstract from action-irrelevant variation, and respect action-relevant variation. So, visual processing optimized relative to natural language categories is de facto optimized for action-relevant distinctions. The LoT architecture characteristic of object files and visual working memory seems well-suited to serving this function (though LoT plausibly is importantly different from natural languages; Green, Reference Green2020; Mandelbaum et al., Reference Mandelbaum, Dunham, Feiman, Firestone, Green, Harris and Quilty-Dunn2022). Predicating abstract properties of individual objects in a LoT is poised to guide action, because abstract semantic categories often determine the action affordances available for some individual object, independent of nuisance variation associated with, for example, viewing angle (though viewing angle is plausibly relevant for more fine-grained control tasks; Parisi et al., Reference Parisi, Rajeswaran, Purushwalkam and Gupta2022, p. 6). Such abstract, task-agnostic representations are also able to transfer to new tasks or environments, in which familiar kinds take on novel relevance for action.

These recent innovations in RL arguably offer indirect support for the LoTH as applied to humans. Of course, similar performance can be achieved by distinct underlying competence, and we should not exaggerate how similar even artificial RL agents’ performance actually is to humans at present. Nevertheless, language-like structures appear especially helpful for artificial agents when faced with rather more biologically plausible optimization problems than the one that faces image classifiers. Perhaps an LoT served our ancestors similarly in an evolutionary context. Language-like structures enabled creatures to encode abstract properties in a task-agnostic way, which nevertheless facilitated downstream performance on a wide variety of tasks, as the environment changed. It's not hard to imagine why evolution might see to it that such a system stuck around.

Financial support

This research received no specific grant from any funding agency, commercial, or not-for-profit sectors.

Competing interest

None.

References

Bowers, J. S., Malhorta, G., Dujmović, M., Montero, M. L., Tsvetkov, C., Biscione, V., …Blything, R. (2022). Deep problems with neural network models of human vision. Behavioral and Brain Sciences, 174.CrossRefGoogle ScholarPubMed
Chang, A., Dai, A., Funkhouser, T., Halber, M., Nießner, M., Savva, M., … Zhang, Y. (2017). Matterport3D: Learning from RGB-D data in indoor environments. arXiv preprint, arXiv:1709.06158.Google Scholar
Du, Y., Gan, C., & Isola, P. (2021). Curious representation learning for embodied intelligence. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1040810417.CrossRefGoogle Scholar
Green, E. J. (2020). The perception–cognition border: A case for architectural division. Philosophical Review, 129(3), 323393.CrossRefGoogle Scholar
Gupta, A., Kumar, V., Lynch, C., Levine, S., & Hausman, K. (2019). Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint, arXiv:1910.11956.Google Scholar
Hafri, A., & Firestone, C. (2021). The perception of relations. Trends in Cognitive Sciences, 25(6), 475492.CrossRefGoogle ScholarPubMed
Kempka, M., Wydmuch, M., Runc, G., Toczek, J., & Jaśkowski, W. (2016). Vizdoom: A doom-based AI research platform for visual reinforcement learning. 2016 IEEE Conference on Computational Intelligence and Games (CIG). IEEE, pp. 18.CrossRefGoogle Scholar
Mandelbaum, E., Dunham, Y., Feiman, R., Firestone, C., Green, E., Harris, D.,… Quilty-Dunn, J. (2022). Problems and mysteries of the many languages of thought. Cognitive Science, 46(12), e13225.CrossRefGoogle ScholarPubMed
Mu, J., Zhong, V., Raileanu, R., Jiang, M., Goodman, N., Rocktäschel, T., & Grefenstette, E. (2022). Improving intrinsic exploration with language abstractions. arXiv preprint, arXiv:2202.08938.Google Scholar
Parisi, S., Rajeswaran, A., Purushwalkam, S., & Gupta, A. (2022). The unsurprising effectiveness of pre-trained vision models for control. International Conference on Machine Learning. PMLR, pp. 1735917371.Google Scholar
Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., … Batra, D. (2019). Habitat: A platform for embodied AI research. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 93399347.CrossRefGoogle Scholar
Schwartz, E., Tennenholtz, G., Tessler, Chen, & Mannor, S. (2019). Language is power: Representing states using natural language in reinforcement learning. arXiv preprint, arXiv:1910.02789.Google Scholar
Tam, A. C., Rabinowitz, N. C., Lampinen, A. K., Roy, N. A., Chan, S. C. Y., Strouse, D.,…Hill, F. (2022). Semantic exploration from language abstractions and pretrained representations. arXiv preprint, arXiv:2204.05080.Google Scholar
Xia, F., Zamier, A., He, Z., Sax, A., Malik, J., & Savarese, S. (2018). Gibson Env: Real-world perception for embodied agents. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2018, pp. 90689079.CrossRefGoogle Scholar