Genies, lawyers, and smart-asses: Extending proxy failures to intentional misunderstandings

Tomer D. Ullman; Sophie Bridgers

doi:10.1017/S0140525X23002820

Genies, lawyers, and smart-asses: Extending proxy failures to intentional misunderstandings

Published online by Cambridge University Press: 13 May 2024

Tomer D. Ullman

and

Sophie Bridgers

Show author details

Tomer D. Ullman*: Affiliation:
Department of Psychology, Harvard University, Cambridge, MA, USA www.tomerullman.org
Sophie Bridgers: Affiliation:
Department of Psychology, Harvard University, Cambridge, MA, USA www.tomerullman.org Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA secb@mit.edu
*: Corresponding author: Tomer D. Ullman; Email: tullman@fas.harvard.edu

Article contents

Abstract
Acknowledgment
Financial support
Competing interest
References

Rights & Permissions

Abstract

We propose that the logic of a genie – an agent that exploits an ambiguous request to intentionally misunderstand a stated goal – underlies a common and consequential phenomenon, well within what is currently called proxy failures. We argue that such intentional misunderstandings are not covered by the current proposed framework for proxy failures, and suggest to expand it.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 47 , 2024 , e86

DOI: https://doi.org/10.1017/S0140525X23002820 [Opens in a new window]
Copyright: Copyright © The Author(s), 2024. Published by Cambridge University Press

Making your way through busy market stalls, you chance upon an antique lamp. As you brush the dust off to inspect it, a genie springs out in a cloud of colorful smoke.

“One wish, no more, no less,” says the genie.

“Make me rich!” you reply, and immediately alarm bells go off in your head. Your mind floods with tragic tales of people who got what they asked for.

“Don't kill my parents so I inherit their money,” you hasten to add. “Actually, don't kill anyone. Also I don't want stolen mafia money. Or any kind of stolen money. Make that ‘no crimes’. And don't make it that I turn things into gold, I just want money. Real money. And don't make everyone poor so I'm rich by comparison, and…” you trail off, thinking it through.

Eventually, you place the lamp down carefully and say, “You know what? Forget I even asked.”

Compared to the examples of proxy failure in John et al., our genie example seems fanciful. But, we propose the logic of the genie – an agent that exploits ambiguous requests to intentionally misunderstand the stated goal – underlies a phenomenon that is (1) well within “proxy failures,” and (2) common and consequential, but (3) not covered by the current framework of John et al. Our point is not that John et al.'s framework is wrong; we find it both enlightening and useful. Rather, we suggest that many important situations that seem to fall under the notion of proxy failure require expanding their framework. Our argument is not a “no,” but a “yes, and.”

To start, the dynamics of intentionally misunderstanding requests follow the logic of many of the examples John et al. use when introducing the problem of interest. A tenant asked by their landlord to “do some weeding,” who then pulls out three weeds and calls it a day, is acting in line with the terms used by John et al.: A regulator (landlord) with a goal (cleared yard) conveys the goal to an agent (tenant), but uses language that doesn't match the goal directly, after which the agent engages in “hacking” or “gaming.”

Intentional misunderstandings are common and important. They show up in history (Scott, Reference Scott1985), fables and art (Da Silva & Tehrani, Reference Da Silva and Tehrani2016; Uther, Reference Uther2004), childhood (Opie & Opie, Reference Opie and Opie2001), and interpersonal conflict (Bridgers, Taliaferro, Parece, Schulz, & Ullman, Reference Bridgers, Taliaferro, Parece, Schulz and Ullman2023). Such letter-versus-spirit of the law concerns are also often discussed in the legal realm (Hannikainen et al., Reference Hannikainen, Tobia, de Almeida, Struchiner, Kneer, Bystranowski and Żuradzki2022; Isenbergh, Reference Isenbergh1982; Katz, Reference Katz2010). But such concerns are not with scalar proxies standing in for a true but unknowable goal. Consider a lawyerly child watching videos on their tablet who is told, “time to put the tablet down,” and proceeds to place the tablet on a table, only to keep watching their videos. Such a child is not optimizing a scalar reward conveyed by a parent who cannot convey some complex goal and resorts to a proxy. The parent was being quite clear, and the child was being quite a smart-ass.

If we accept the above, then intentional misunderstandings pose a challenge for the framework of John et al. The framework supposes that a regulator has a difficult-to-convey goal, and instead gives an agent a different goal. But in many current models of human communication, concepts (including goals) are hidden variables, conveyed indirectly through ambiguous utterances (Goodman & Frank, Reference Goodman and Frank2016). This is true for any goal, including proxies. The process of recovering meaning from ambiguous utterances is usually so transparent that people don't even notice it unless it breaks down, for example, when hijacked intentionally. To see this, take the Hanoi rats (please): The original goal of killing all the rats is unobserved, but can be easily recovered from the utterance “kill all the rats.” The utterance “bring me rat tails” is not a proxy goal, it is an utterance, which can be used to derive the original goal, and is likely understood by people to mean the original goal. True, the proxy utterance can be intentionally misunderstood, but so can the original utterance – there is nothing inherently special about proxy utterances in terms of clarity from the standpoint of a theory of communication, and likewise there is nothing inherently special about a proxy goal in terms of observability.

Our differing analysis for intentional misunderstandings is important for at least two reasons: First, it moves the focus away from illegibility and prediction, especially in communicating goals to machines. People who supposedly convey a “proxy goal” to a machine do not experience failure because they can't predict all the ways in which their proxy goal might have unintended consequences. Rather, they experience failure because they didn't even realize they were conveying a different goal in the first place (many of the examples in Krakovna, Reference Krakovna2020, are like this). An engineer evaluating a loss function infers the goal behind it, because that is how human communication works, but most machines currently aren't built to run an inference process from a loss function to an intended goal. This brings us to a second reason the differing analysis is important: It suggests currently unexplored remedies, at least for some cases. Telling a genie (or child, or lawyer) a goal, and then tacking on a long list of caveats will not stop them if they are determined to misunderstand: Every caveat is an opportunity for another loophole (in line with the “proxy treadmill”). By contrast, highlighting common-ground in order to specifically rule out loopholes is useful, for example, telling a child “can you do your homework?” and following it with “you know what I mean” to avoid the tired “yes, I can.” Highlighting common-ground would not be useful for a machine that is optimizing a given loss function rather than engaging in human-like communication.

And what of our genie? They forgot you even asked, just like you wanted. And so, they offer you one wish. No more, no less.

Acknowledgment

We acknowledge GU for unintentionally providing some of the examples of intentional misunderstandings.

Financial support

T. D. U. is supported by a Jacobs Foundation Fellowship, and this commentary makes reference to work supported by an MIT Simons Center for the Social Brain Postdoctoral Fellowship (S. B.), and an NSF Science of Learning and Augmented Intelligence Grant 2118103 (T. D. U. and S. B.).

Competing interest

None.

References

Bridgers, S. E. C., Taliaferro, M., Parece, K., Schulz, L., & Ullman, T.. (2023). Loopholes: A window into value alignment and the communication of meaning. PsyArxiv.Google Scholar

Da Silva, S. G., Tehrani, J. J. (2016). Comparative phylogenetic analyses uncover the ancient roots of Indo-European folktales. Royal Society open science, 3(1), 150645.CrossRef Google Scholar PubMed

Goodman, N. D., & Frank, M. C. (2016). Pragmatic language interpretation as probabilistic inference. Trends in Cognitive Sciences, 20(11), 818–829.CrossRef Google Scholar PubMed

Hannikainen, I. R., Tobia, K. P., de Almeida, G. D. F., Struchiner, N., Kneer, M., Bystranowski, P., … Żuradzki, T. (2022). Coordination and expertise foster legal textualism. Proceedings of the National Academy of Sciences of the United States of America, 119(44), e2206531119.CrossRef Google Scholar PubMed

Isenbergh, J. (1982). Musings on form and substance in taxation. HeinOnline.CrossRef Google Scholar

Katz, L. (2010). A theory of loopholes. The Journal of Legal Studies, 39(1), 1–31.CrossRef Google Scholar

Krakovna, V. (2020). Specification gaming examples in AI – Master list. http://bit.ly/kravokna_examples_list (accessed: 2020-12-28).Google Scholar

Opie, I. A., & Opie, P. (2001). The lore and language of schoolchildren. New York Review of Books.Google Scholar

Scott, J. C. (1985). Weapons of the weak: Everyday forms of peasant resistance. Yale University Press.Google Scholar

Uther, H.-J. (2004). The types of international folktales – A classification and bibliography. Suomalainen Tiedeakatemia Academia Scientiarum Fennica Exchange Centre.Google Scholar