Bayesian Models of the Mind

Michael Rescorla

doi:10.1017/9781108955973

1 Introduction

Thomas Bayes was an eighteenth-century minister and mathematician who passed his life in relative obscurity. Upon his death in 1761, his friend Richard Price found among his papers a document entitled “An Essay Towards Solving a Problem in the Doctrine of Chances.” Price, recognizing the essay’s immense significance, saw to its posthumous publication (Bayes, Reference Bayes1763). Bayes’s insights gave birth to what is now known as Bayesian decision theory: a mathematical framework that models reasoning and decision-making under uncertain conditions. Named after Bayes due to his founding insights, the framework was first systematically articulated by Pierre-Simon Laplace (Reference Laplace1814/1902). Despite frequent vicissitudes in development, reception, and application, the framework attracted increasingly many adherents beginning in the early twentieth century and accelerating as the century progressed (McGrayne, Reference McGrayne2011). It currently enjoys great popularity, finding widespread use within statistics (Berger, Reference Berger1985; Gelman et al., Reference Gelman, Carlin, Stern, Dunson, Vehatri and Rubin2014), philosophy (Earman, Reference Earman1992), machine learning (Murphy, Reference Murphy2023), robotics (Thrun, Burgard & Fox, Reference Thrun, Burgard and Fox2005), physics (Trotta, Reference Trotta2008), medical science (Ashby, Reference Ashby2006), and myriad other disciplines.

Bayesian decision theory originated as a theory of how people should operate, not a theory of how they actually operate. Nevertheless, cognitive scientists increasingly use it to describe the actual workings of the human mind. Over the past few decades, cognitive science has produced impressive Bayesian models of mental activity. The models postulate that certain mental processes conform, or approximately conform, to Bayesian norms. Bayesian models offered within cognitive science have illuminated numerous mental phenomena, such as perception, motor control, and navigation.

This Element has a two-fold purpose. First, it provides a self-contained introduction to the foundations of Bayesian cognitive science. Second, it explores what we can learn about the mind from Bayesian models offered by cognitive scientists.

On the second front, my main concern is how Bayesian cognitive science relates to mental representation. Just as the heart serves to pump blood and the stomach serves to digest food, one of the mind’s principal functions is to represent the world. For instance, I have various beliefs about Napoleon: that he was born in Corsica, that he was an emperor, and so on. Thus, the mind somehow reaches beyond itself to represent external reality. In that sense, the mind is a representational organ. Historically, most philosophers have agreed that the mind’s representational capacity is among its key features. However, prominent scientists and philosophers throughout the past century have questioned whether representation deserves any place in the science of the mind. As a result, controversy continues to fester over the explanatory value of mental representation. Representationalists such as Burge (Reference Burge2010; Reference Burge2022), Fodor (Reference Fodor1975; Reference Fodor1987; Reference Fodor2008), Peacocke (Reference Peacocke1994; Reference Peacocke1999), Pylyshyn (Reference Pylyshyn1984), and Shea (Reference Shea2018) insist that mental representation plays a vital role within the scientific explanation of various core mental phenomena. Anti-representationalists as varied as Chemero (Reference Chemero2009), Churchland (Reference Churchland1981), Field (Reference Field2001), Quine (Reference Quine1960), Ramsey (Reference Ramsey2007), Stich (Reference Stich1983), and van Gelder (Reference van Gelder1992) reject this position.

I will argue that Bayesian cognitive science assigns mental representation a central explanatory role. Bayesian models of perception, motor control, navigation, and other core mental activities posit representational mental states. Explanations supplied by the models characterize both explananda and explanantia in thoroughly representational terms. So Bayesian cognitive science presupposes the traditional picture of the mind as a representational organ. It invests that picture with unprecedented empirical substance through well-confirmed, mathematically rigorous models.

Sections 2 and 3 present key elements of Bayesian decision theory. Section 4 surveys how cognitive scientists use the Bayesian framework to model mental activity. Section 5 articulates a realist stance towards Bayesian models of the mind: when a Bayesian model is explanatorily successful, we have good reason to believe that the model describes actual mental states and processes with at least approximate accuracy. Sections 6 and 7 argue that representational properties of mental states figure crucially in explanations provided by Bayesian cognitive science. My conclusion: Bayesian modeling supports a representationalist perspective on the mind.

My exposition contains more mathematics than most writings on philosophy of mind. The technical content reflects my conviction that fully understanding mental representation requires familiarity with the mathematical language used by scientists to study mental representation. I hope that this Element will help some readers achieve the requisite familiarity and will promote greater appreciation for the benefits that such familiarity affords. To keep the text as accessible as possible, I have confined many technical details to the Appendix.

2 The Probability Calculus

The core notion of Bayesian decision theory is credence, or subjective probability—a quantitative measure of the degree to which an agent believes a hypothesis. I may have low credence that a meteor shower occurred five days ago, higher credence that Seabiscuit will win the race tomorrow, and even higher credence that Napoleon was born in Corsica. An agent’s credence in hypothesis H is notated as P(H). Credences are psychological facets of the individual agent, not objective chances or frequencies out in the world. The agent’s credences need not track any objective probabilities that inhere in mind-independent reality. To illustrate, suppose that a biased coin has objective chance 0.3 of landing heads. I may mistakenly believe that the coin is fair and therefore assign subjective probability 0.5 to the hypothesis that it will land heads. Then my credence departs dramatically from the objective chance of heads.

What is it to attach a credence to a hypothesis? What does it mean for an agent to set P(H) = x as opposed to P(H) = y ≠ x? Beginning with Ramsey (Reference Ramsey and Braithwaite1931) and de Finetti (Reference de Finetti, Kyburg and Smokler1937/1980), many authors have tried to answer these questions (Erikkson & Hájek, Reference Erikkson and Hájek2007). In practice, contemporary Bayesians usually leave the questions unanswered. They take the notion of credence as primitive, without providing noncircular necessary and sufficient conditions for an agent to attach a credence to a hypothesis. This is the strategy pursued within Bayesian cognitive science, and it is the strategy I will pursue.

Bayesian decision theory was given a secure mathematical grounding by Kolmogorov (Reference Kolmogorov1933/1956), who articulated axioms for probability in his landmark Foundations of the Theory of Probability. The axioms are not specific to subjective probability; they apply equally to objective probability. Section 2 expounds basic aspects of Kolmogorov’s axiomatization, which is sometimes called the probability calculus. Section 3 discusses how Bayesians use the probability calculus to model uncertainty.Footnote ¹

2.1 Sets of Outcomes

Kolmogorov’s axiomatization uses set theory as a basis for probability theory. The central notion of set theory is membership:

ω \in A,

meaning that ω is a member of set A. We also say that ω belongs to A.

In Kolmogorov’s axiomatization, probabilities attach to sets of outcomes drawn from an outcome space Ω. To illustrate, suppose we want to model probabilities over the result of a player rolling a six-sided die. We may take the outcome space to be

Ω = {1, 2, 3, 4, 5, 6},

that is, the set containing elements 1, 2, 3, 4, 5, and 6. The hypothesis that the player rolls an even number corresponds to the set

{2, 4, 6} .

Similarly, suppose we seek to define probabilities over possible results of a horse race. We can specify an outcome by describing the order in which the horses finish. Ω contains each such outcome. The hypothesis that Seabiscuit wins the race corresponds to the set

{ω : Seabiscuit finishes before every other horse in ω},

that is, the set of outcomes in which Seabiscuit finishes before every other horse.

Philosophers commonly assume that probabilities attach to propositions. In the scientific and mathematical literature, one rarely finds any appeal to propositions. Instead, researchers follow Kolmogorov in attaching probabilities to sets. Under certain assumptions, one can recapture talk about “propositions” within Kolmogorov’s setting. One can treat Ω as containing possible worlds, and one can analyze propositions as sets of possible worlds (Stalnaker, Reference Stalnaker1984). These assumptions are not mandated by Kolmogorov’s axiomatization. For example, the simple outcome space ${1, 2, 3, 4, 5, 6}$ is allowed by Kolmogorov’s axiomatization, even though its elements are not possible worlds.

When probabilities attach to sets of outcomes, elementary set-theoretic operations mimic the propositional operations negation, conjunction, and disjunction:

Negation corresponds to complementation. The complement of set A is the set $A^{c}$ containing all elements that are in Ω but not in A. See Figure 1. The hypothesis that the player rolls 1 is the set
${1},$
while the hypothesis that the player does not roll 1 is its complement
${1}^{c} = {2, 3, 4, 5, 6} .$
Conjunction corresponds to intersection. The intersection of A and B is the set containing all elements that are in both A and B. The intersection is written as $A \cap B$ . See Figure 2. The hypothesis that the player rolls an even number and the player rolls a number greater than 3 is the intersection
${2, 4, 6} \cap {4, 5, 6} = {4, 6} .$
If we intersect together disjoint sets (i.e. sets that share no members), then the result is the empty set ∅ containing no members. The hypothesis that the player rolls an even number and the player rolls an odd number is
${2, 4, 6} \cap {1, 3, 5} = \emptyset .$
Disjunction corresponds to union. The union of A and B is the set containing all elements that are in A or B. The union is written as $A \cup B$ . See Figure 3. The hypothesis that the player rolls 1 or the player rolls 4 is the union
${1} \cup {4} = {1, 4} .$

Figure 1 Ω is the rectangle. A is the ellipse. $A^{c}$ is shaded gray.

Figure 2 A and B are overlapping sets. Their intersection $A \cap B$ is shaded gray.

Figure 3 A and B are overlapping sets. Their union $A \cup B$ is shaded gray.

By iteratively applying set-theoretic operations, Kolmogorov replicates the formation of logically complex sentences or propositions.

In simple applications, such as a die roll or horse race, the outcome space Ω is finite. Many applications require Ω to be infinite. For example, consider an asteroid’s speed as it enters our solar system. There are infinitely many possible asteroid speeds. If we want to model probabilities over possible asteroid speeds, we need an infinite outcome space.

2.2 Axioms of the Probability Calculus

In probability theory, sets of outcomes are called events. The probability calculus contains three axioms that govern the assignment of probabilities to events:

Axiom 1: Probabilities are real numbers between 0 and 1,

where a real number is any number that can be expressed as a decimal. As applied to subjective probability, Axiom 1 sets a scale for degrees of belief. 1 is the maximal possible degree of belief. 0 is the minimum. When an agent assigns probability 1 to an event, we say that the agent is certain of the event.

Axiom 2: $P (Ω) = 1 .$

Intuitively: Ω exhausts all relevant possibilities, so it must receive maximal degree of belief.

Axiom 3: Additivity.

To elucidate additivity, suppose that $H_{1}$ and $H_{2}$ are disjoint events. For example, let $H_{1}$ be the hypothesis that Seabiscuit wins the race and $H_{2}$ the hypothesis that War Admiral wins the race. Consider the union $H_{1} \cup H_{2}$ : the hypothesis that Seabiscuit wins the race or War Admiral wins the race. Additivity requires that:

P (H_{1} \cup H_{2}) = P (H_{1}) + P (H_{2}) .

In general, the probability that either of two disjoint events occurs is found by adding together the probabilities assigned to the individual events. See Figure 4. As discussed in Section A2, Kolmogorov ultimately uses a somewhat stronger version of additivity than I have articulated here.

Figure 4 $H_{1}$ and $H_{2}$ are disjoint events. Additivity requires that their union (the total shaded area) receive a probability equal to the sum of the probabilities assigned to them individually.

Axioms 1–3 can be applied to objective probabilities or to subjective probabilities. Applied to objective probabilities, they are construed as constraints that probabilities do in fact satisfy. Applied to subjective probabilities, they are construed as constraints that probabilities should satisfy: an agent is rational to the extent that her credences satisfy the axioms.Footnote ²

The core tenet of Bayesian decision theory is that credences should conform to the probability calculus axioms. Since Bayesians advance the probability calculus axioms as normative constraints, we may ask why these particular axioms are supposed to be rationally privileged. Why is someone who conforms to the axioms rationally superior to someone who violates them? A large literature, stretching back to Ramsey (Reference Ramsey and Braithwaite1931) and de Finetti (Reference de Finetti, Kyburg and Smokler1937/1980), seeks to answer this question by providing a foundational justification for the probability calculus axioms (Easwaran, Reference Easwaran2011a; Pettigrew, Reference Pettigrew and Zalta2019; Pettigrew, Reference Pettigrew2020; Weisberg, Reference Weisberg, Gabbay, Hartman and Woods2009). For present purposes, I simply assume that the probability calculus axioms are rational constraints on credence.

From a mathematical perspective, we regard Axioms 1–3 as constraints on a function P that maps each event H to a real number $P (H)$ . When P satisfies all three constraints, it is called a probability distribution or a probability measure.Footnote ³

2.3 Random Variables

Probability theory assigns a central role to random variables. Intuitively, a random variable uses real numbers to model a specific aspect of a probabilistic situation. To illustrate, suppose that the outcome space Ω contains possible worlds in which an asteroid is hurtling towards Earth. Let X be a function that carries each possible world to the asteroid’s speed in that world as the asteroid enters our solar system (where speed is measured using canonical units, such as meters/sec). So

X (ω) = x

means that the asteroid has speed x in world ω as it enters our solar system. X is a function from Ω (a set of possible worlds) to $ℝ$ (the set of real numbers). More generally, suppose we have an outcome space Ω. A random variable is a function that carries each outcome ω to a real number x:

X (ω) = x .

A rigorous definition of “random variable” is given in Section A3, but for present purposes we may operate at a more intuitive level.

We can use a random variable X to define various events of interest. Continuing the asteroid example, take the hypothesis that the asteroid’s speed falls between a and b. To codify the hypothesis more formally, our first step is to consider the interval $[a, b]$ . See Figure 5. Our second step is to collect together all the possible worlds mapped by X into that interval. In other words, we consider the set of possible worlds ω such that $a \leq X (ω) \leq b$ :

Figure 5 An interval $[a, b]$ lying in ℝ.

{ω : a \leq X (ω) \leq b} .

This set is notated as $X^{- 1} [a, b]$ . It contains those possible worlds where the asteroid’s speed falls between a and b, so it codifies the hypothesis that the asteroid’s speed falls between a and b. More generally, given a random variable X defined on outcome space Ω, $X^{- 1} [a, b]$ codifies the hypothesis that X’s value falls between a and b. See Figure 6.

Figure 6 Ω is the rectangle. $X^{- 1} [a, b]$ contains the outcomes mapped by X into the interval $[a, b]$ .

As a second illustration, consider the asteroid’s position when it hits the earth’s surface. We can describe asteroid position using an ordered pair $(x, y)$ drawn from a canonical coordinate system (e.g. longitude and latitude). We now want a function X that maps each possible world ω to an x-coordinate and a second function Y that maps ω to a y-coordinate. The conjunction

X (ω) = x & Y (ω) = y

means that the asteroid lands at location $(x, y)$ in possible world ω. Taken together, X and Y map Ω (a set of possible worlds) into $ℝ^{2}$ (the set of ordered pairs of real numbers). We may use X and Y to define various events of interest. For example, consider the rectangle depicted in Figure 7. Call this rectangle REC. We would like to codify the hypothesis that the asteroid lands within REC. To do so, we collect together all the possible worlds where the asteroid lands within REC. In other words, we consider the set of possible worlds ω such that $(X (ω), Y (ω))$ belongs to REC:

{ω : (X (ω), Y (ω)) \in R E C} .

This set contains exactly those possible worlds where the asteroid lands within REC, so it codifies the hypothesis that the asteroid lands within REC. See Figure 8.

Figure 8 ${ω : (X (ω), Y (ω)) \in R E C}$ contains the outcomes mapped by X and Y into REC.

Random variables are tremendously useful in probability theory. The underlying outcome space Ω is often hard to describe or otherwise resistant to direct mathematical analysis. In particular, it is not easy to define probabilities directly over sets of possible worlds. A random variable shifts attention from Ω to a friendlier outcome space, such as $ℝ$ or $ℝ^{2}$ , greatly augmenting our expressive and analytic power. I will illustrate in the next section.

Figure 7 Rectangle REC contains ordered pairs $(x, y)$ .

2.4 Probability Density

Suppose we take $ℝ$ as the outcome space, so that probabilities attach to sets of real numbers. $ℝ$ is a natural choice when we are modeling a variable that takes real numbers as values. For example, if X is a random variable that models asteroid speed, then the probability assigned to $[a, b]$ is the probability that the asteroid’s speed falls between a and b.

It is often possible to specify a probability distribution over sets of real numbers using a probability density function. A probability density function (pdf) is a nonnegative function over $ℝ$ such that the total area under the curve is 1. Figure 9 illustrates with a sample pdf $p (x)$ . When you see an image like Figure 9, it is vital to remember that the numbers on the vertical axis are not probabilities. They are probability densities. Probabilities are determined by probability densities as follows: the probability assigned to an interval $[a, b]$ is the area under $p (x)$ stretching from a to b. In this manner, the pdf (a function from real numbers to probability densities) determines a probability distribution (a function from sets of real numbers to probabilities).

Figure 9 The curve is the pdf. The area under the curve between a and b is the probability assigned to $[a, b]$ .

The most famous example of probability density is the class of Gaussian distributions, also known as Normal distributions. The pdf for a Gaussian distribution has the familiar shape of a “bell curve.” A Gaussian pdf is completely described by two parameters: its mean and its variance (a measure of how “spread out” the curve is from the mean). See Figures 10 and 11. Many variables encountered in nature are well-described, at least approximately, using a Gaussian pdf.

Figure 10 A Gaussian pdf with mean m and variance $σ^{2}$ .

Figure 11 Three Gaussian pdfs. The orange pdf has mean a. The blue and green pdfs have mean b. The blue pdf has smaller variance than the green pdf. The orange pdf has intermediate variance.

Bayesian cognitive scientists tend to be cavalier about the distinction between probability and probability density. My own previous writings have also treated the distinction quite sloppily. Nevertheless, the distinction is an important one:

Probabilities are assigned to sets whose members belong to an outcome space Ω. Probability densities are assigned to real numbers.
The probability assigned to an event is at most 1. In contrast, probability density may be much greater than 1. A pdf can attain very high values, so long as total area under the curve is 1.

As is customary in the literature, I notate probability using an upper case P and probability density using a lower case p.

To see the distinction between probability and probability density in action, consider a probability distribution P with a Gaussian pdf $p (x)$ . $p (x)$ assigns densities to individual real numbers. P assigns probabilities to sets of real numbers: the probability assigned to interval $[a, b]$ is the area under $p (x)$ stretching from a to b. For every real number s, we have

p (s) > 0.

What about the probability assigned to ${s}$ , that is, the set whose sole member is s? It is not hard to show that

P ({s}) = 0.

Intuitively: the probability assigned to ${s}$ is the area under $p (x)$ stretching from s to s, and that area is simply 0. Thus, the probability density $p (s)$ assigned to an individual point s differs from the probability $P ({s})$ assigned to the event ${s}$ . Note that, even though each individual event ${s}$ receives probability 0, we nevertheless have

P ([a, b]) > 0

when $a \neq b$ . This may at first seem surprising, but it does not violate the probability calculus axioms. The axioms allow each event ${s}$ to receive probability 0 even while $[a, b]$ receives positive probability.

The notion of pdf generalizes to $ℝ^{2}$ . In the two-dimensional case, a probability distribution assigns probabilities to sets containing ordered pairs $(x, y)$ . For example, suppose we are modeling the asteroid’s position $(x, y)$ when it hits the earth’s surface. The probability distribution assigns a probability to each rectangle: this is the probability that the asteroid’s position falls within that rectangle. In the two-dimensional case, a pdf is a nonnegative function $p (x, y)$ over $ℝ^{2}$ such that the total volume under the curve is 1. The probability assigned to a region is the volume under the curve in that region. See Figures 12, 13, and 14. A famous example is the class of two-dimensional Gaussian distributions, which generalize one-dimensional Gaussians to $ℝ^{2}$ . See Figures 15 and 16. Once again, it is crucial to distinguish between probability and probability density. Probability densities attach to ordered pairs (x, y). Probabilities attach to sets of ordered pairs.

Figure 12 A two-dimensional pdf. The pdf assigns a nonnegative value to each ordered pair $(x, y)$ . Total volume under the curve is 1.

Figure 13 An alternative depiction of the pdf from Figure 12. Lighter shading signifies higher probability density assigned to point $(x, y)$ .

Figure 14 The pdf from Figure 12, restricted to the portion lying over a rectangle in the $(x, y)$ plane. The volume under this portion of the pdf is the probability assigned to the rectangle.

Figure 15 A two-dimensional Gaussian pdf.

Figure 16 An alternative depiction of the pdf from Figure 15. Lighter shading signifies higher probability density assigned to point $(x, y)$ .

2.5 Conditional Probability

Conditional probabilities are fundamental to probability theory. Intuitively, the conditional probability $P (A | B)$ is the probability of A given B. For example, we can consider the probability that Seabiscuit wins the race given that he is sick. In elementary applications, conditional probability is defined through the ratio formula:

P (A | B) =_{d f} \frac{P (A \cap B)}{P (B)} .

See Figure 17. As Figure 17 illustrates, the unconditional probability of A may differ significantly from the probability of A given B.

Figure 17 To compute $P (A | B)$ using the ratio formula, divide $P (A \cap B)$ by $P (B)$ . For heuristic purposes, assume that the probability assigned to a region is proportional to the region’s area. Then Figure 17 depicts a case where $P (A)$ is much smaller than $P (A | B)$ .

The ratio formula is only well-defined when $P (B) > 0$ . Yet scientific practice frequently requires conditional probabilities when $P (B) = 0$ . For example, we might want conditional probabilities regarding how long an asteroid will take to reach Earth given that the asteroid has speed s when it enters the solar system. Suppose that our probability distribution P over asteroid speed has a pdf p(x). As indicated in Section 2.4, the probability assigned to the event ${s}$ is 0:

P ({s}) = 0.

Thus, we cannot use the ratio formula to define probabilities conditional on ${s}$ . As this example illustrates, an adequate treatment must move beyond the ratio formula, delineating conditional probabilities $P (A | B)$ for cases where $P (B) = 0$ .

When P is given by a two-dimensional pdf, a fairly straightforward notion of conditional probability is available. Consider a two-dimensional pdf $p (x, y)$ , such as in Figure 12 or Figure 15. We can use $p (x, y)$ to define a conditional density $p (y | x)$ . Intuitively, $p (y | x)$ is a density over y conditional on X having value x. For each possible value x of the random variable X, the conditional pdf yields a one-dimensional pdf over y alone. Basically, $p (y | x)$ is defined by holding x fixed in $p (x, y)$ while allowing y to vary. The only hitch is that the area under the resulting curve may not be 1, while the definition of pdf requires the area under the curve to be 1. Hence, one must also divide by a normalization constant to ensure that probabilities sum to 1. Figures 18 and 19 illustrate using the pdf from Figure 12. To compute $p (y | a)$ , we hold X fixed at value a while allowing y to vary. The result is the cross-section curve depicted in Figure 18. To convert the cross-section curve into a pdf over y, we must divide by a normalization constant to ensure that area under the curve is 1. The normalized curve is depicted in Figure 19. Figures 18 and 19 also depict the same procedure for two other possible values b and c of X. Figures 20 and 21 depict the same procedure, this time applied to the pdf from Figure 15. See Section A6 for full mathematical details.

Figure 18 Conditional densities for the pdf from Figure 12. To compute $p (y | a)$ , we fix X’s value at a and consider the resulting cross-section curve. Area under the cross-section curve may not be 1. Thus, we divide by a normalization constant to ensure that area under curve is 1. The result of dividing by the normalization constant is depicted in Figure 19. Similarly for $p (y | b)$ and $p (y | c)$ .

Figure 19 Three one-dimensional pdfs over Y induced by Figure 18. The blue pdf is $p (y | a)$ , the orange pdf is $p (y | b)$ , and the green pdf is $p (y | c)$ . These three curves are normalized versions of the three cross-section curves from Figure 18.

Figure 20 Conditional densities for the Gaussian pdf from Figure 15. This figure depicts the unnormalized cross-section curves. To convert the cross-section curves into pdfs, we divide by a normalization constant. The normalized curves are depicted in Figure 21.

Figure 21 The one-dimensional pdfs over Y induced by Figure 20. The blue pdf is $p (y | a)$ , the orange pdf is $p (y | b)$ , and the green pdf is $p (y | c)$ .

3 Bayesian Decision Theory

Bayesian decision theory studies an idealized agent who assigns credences to hypotheses. Bayesians claim that the agent’s credences should conform to the probability calculus axioms. Thus, the axioms figure as norms. Bayesians supplement the probability calculus axioms with two additional norms: Conditionalization, which governs how credences change in response to new evidence; and expected utility maximization, which governs how credences guide decision-making. I discuss Conditionalization in Sections 3.1–3.2 and expected utility maximization in Section 3.3.

3.1 Conditionalization

Credences evolve. If I learn that Seabiscuit is sick, then I should lower my credence that he will win the race. Intuitively, this is because I have a relatively low credence that Seabiscuit will win the race given that he is sick. More generally, suppose that I begin with credence $P (H)$ and then learn E. To conditionalize on E is to replace my former credence $P (H)$ with $P (H | E)$ . My old conditional credence $P (H | E)$ becomes my new unconditional credence in H. $P (H)$ is called the prior probability and $P (H | E)$ is called the posterior probability. We may write

P_{n e w} (H) = P (H | E),

to signify that my new credence in H is equal to my old conditional credence in H given E.

The intuitive idea behind the rational norm Conditionalization is that, when I receive new evidence E, I should form new credences given by

P_{n e w} (H) = P (H | E) .

There is considerable variation in how philosophers formulate Conditionalization, depending partly upon how they gloss “new evidence.” In Rescorla (Reference Rescorla2021b), I review some options and give my own preferred formulation. For present purposes, I remain as neutral as possible among alternative formulations. However exactly we formulate Conditionalization, it is a diachronic norm: it governs the evolution of credences over time. In contrast, the probability calculus axioms are purely synchronic: they govern credences at a moment of time. Note also that we must sharply distinguish between conditionalization the operation and Conditionalization the rational norm. The former is something an agent does: revise her credences a certain way. The latter is a rational norm that requires an agent to perform the operation in certain circumstances.

As with the probability calculus axioms, there is a large literature on why agents should conform to Conditionalization (Greaves & Wallace, Reference Greaves and Wallace2006; Lewis, Reference Lewis1999; Rescorla, Reference Rescorla2022; Skyrms, Reference Skyrms1987; Weisberg, Reference Weisberg, Gabbay, Hartman and Woods2009). Why is someone who conforms to Conditionalization rationally superior to someone who violates it? Obviously, the answer may depend on how exactly one formulates Conditionalization. In what follows, I will simply assume that Conditionalization as formulated some way is a rational constraint upon credal evolution.

I have focused thus far on Conditionalization in cases where $P (E) > 0$ , so that the ratio formula applies. When $P (E) = 0$ , the ratio formula is not well-defined. An agent who wants to conditionalize in such cases must look beyond the ratio formula for the needed conditional probabilities. For many applications, the theory of conditional densities suffices. To illustrate, suppose the agent begins with credences given by a pdf $p (x, y)$ . If she receives evidence that random variable X has value x, then she can conditionalize using the conditional density $p (y | x)$ . Her new credences over random variable Y are then determined by $p (y | x)$ . For example, suppose the agent begins with credences given by the pdf from Figure 12 and subsequently learns X’s value. If she learns that X has value a, then conditionalization leads her to new credences over Y depicted by the blue curve from Figure 19. If she instead learns that X has value b, then her new credences over Y are given by the orange curve. If she learns that X has value c, then her credences over Y are given by the green curve. In this manner, the theory of conditional densities helps us generalize Conditionalization beyond cases where $P (E) > 0$ .

3.2 Bayes’s Theorem

Bayesian decision theory is so-called because it assigns a central role to a theorem first proved by Bayes. The theorem states that

P (H | E) = \frac{P (H) P (E | H)}{P (E)} .

(1)

Equation (1) expresses the posterior probability $P (H | E)$ in terms of the prior probability $P (H)$ and the prior likelihood $P (E | H)$ . The denominator $P (E)$ serves mainly as a normalization constant to ensure that probabilities sum to 1, so it is common to write the theorem as

P (H | E) = k P (H) P (E | H),

(2)

where $k = 1 / P (E)$ . One can also write the theorem as:

P (H | E) \propto P (H) P (E | H),

which highlights that the posterior is proportional to the prior times the prior likelihood:

p o s t e r i o r \propto p r i o r \times p r i o r l i k e l i h o o d .

Bayes’s theorem is extraordinarily useful. In many situations, there is a natural prior probability and a natural prior likelihood. The theorem then tells us how to compute the posterior from the priors. See Section A7 for a proof of Bayes’s Theorem.

Bayes’s theorem must be sharply distinguished from Conditionalization. Bayes’s theorem is a direct consequence of the probability calculus axioms and the ratio formula. As such, it is purely synchronic: it governs the relation between an agent’s current conditional and unconditional credences. In contrast, Conditionalization is a diachronic norm. It governs how the agent’s credences at an earlier time relate to her credences at a later time. Any agent who conforms to the probability calculus axioms also conforms to Bayes’s theorem, but an agent who conforms to the probability calculus axioms at each moment may violate Conditionalization. Thus, one cannot derive Conditionalization from Bayes’s theorem or from the probability calculus axioms. One must articulate Conditionalization as an additional constraint upon credal evolution.

That being said, Conditionalization and Bayes’s theorem work together beautifully. An agent who wants to conditionalize can use Bayes’s theorem to compute the posterior $P (H | E)$ and then adopt $P (H | E)$ as her new credence in H. Her new credence in H will be higher to the extent that she already assigned high credence to H and to the extent that H renders her new evidence E more likely.Footnote ⁴

When $P (E) = 0$ , (1) is not well-defined because the denominator is 0. Sometimes, though, a generalized analogue to (1) prevails. When a two-dimensional pdf $p (x, y)$ exists, one can prove:

p (x | y) = k p (x) p (y | x),

(3)

where k is again a normalization constant. $p (x)$ serves as a prior density: it codifies an agent’s initial credences over random variable X. $p (y | x)$ is a density for random variable Y conditional on X having value x. $p (x | y)$ is a density for X conditional on Y having value y: it serves as a posterior density. We may rewrite (3) as:

p (x | y) \propto p (x) p (y | x) .

(4)

This is the form of Bayes’s theorem most commonly used in scientific applications, including within Bayesian cognitive science.

We obtain a helpful visualization of (4) by holding y fixed and regarding $p (y | x)$ as a function solely of x. Viewed in this way, $p (y | x)$ is called the likelihood function or sometimes just the likelihood. Intuitively, the likelihood is an initial attempt at forming a probability density over x. The initial attempt takes into account evidence y but not the prior information encoded by $p (x)$ .Footnote ⁵ Bayes’s theorem tells us how to combine the initial attempt $p (y | x)$ with the prior $p (x)$ , yielding the posterior density $p (x | y)$ . Figures 22 and 23 illustrate. In both figures, the posterior is a compromise between the prior and the likelihood. In Figure 22, the likelihood is wide, so the posterior remains fairly close to the prior. In Figure 23, the likelihood is narrow, so it pulls the posterior far from the prior. For example, suppose that $p (y | x)$ is the conditional density of measuring speed y given that the asteroid has speed x. Assuming noisy but unbiased measurement, the likelihood peaks at y. If measurements are very noisy, then the likelihood is wide (Figure 22), and the prior over asteroid speed exerts more influence on the posterior. If measurements are less noisy, then the likelihood is narrow (Figure 23), and the prior exerts less influence.

Figure 22 The likelihood peaks at the measured value y. The posterior mean is intermediate between the prior mean and y. Intuitively, the posterior is a compromise between the likelihood and the prior.

Figure 23 The prior is the same as in Figure 22. The likelihood once again peaks at y but is narrower. As a result, the posterior is narrower and is pulled closer to the likelihood. In the asteroid example, a narrower likelihood corresponds to a case where speed measurements are less noisy. It makes intuitive sense that less noisy measurements would exert more influence.

3.3 Expected Utility Maximization

The final key notion of Bayesian decision theory is utility: a numerical measure of how much an agent desires an outcome. According to Bayesians, agents should choose actions that maximize expected utility. The expected utility of action a is a weighted average of utilities assigned to possible outcomes, where the weights are probabilities contingent upon performance of a. There are protracted debates about how to formulate expected utility maximization more rigorously (Steele & Stefánsson, Reference Steele, Stefánsson and Zalta2016). We may leave it at an intuitive level. Scientific applications often deal not with utility but instead with cost or loss. The goal is then not to maximize expected utility but to minimize expected cost. For most purposes, there is no substantive difference between a utility-based formulation and a cost-based formulation: one converts a utility function into a loss function by adding a minus sign, and vice-versa.

In many statistical applications, the “action” is to estimate the value of a random variable. The standard procedure is to choose a utility function that favors selection of the true value and penalizes selection of other values. Often, selecting a best estimate will amount to selecting the mode of the posterior density, i.e. the value of x that maximizes the posterior density. There are also cases where the best estimate differs from the mode. If the utility function rewards estimates that are close to the true value but distinct from it, then the best estimate may be quite distant from the mode if enough probability mass lies away from the mode. Figure 24 illustrates: the mode is located in a region of relatively small probability mass; an estimator that values being close to the right answer will choose an estimate from the region of higher probability mass.

Figure 24 A pdf where most probability mass lies far from the mode.

3.4 Implementation

Suppose we want a physical system (such as a computer or a robot) to implement Bayesian inference. Our first task is to decide how the system will encode credences. A major hurdle is that infinitely many distinct probabilities must often be encoded. For example, a pdf determines the probability assigned to each interval $[a, b]$ . There are infinitely many such intervals. A finite physical system cannot explicitly enumerate the credence assigned to each interval. In other words, it cannot explicitly list each individual probability $P ([a, b])$ . After all, a finite physical system cannot explicitly list infinitely many distinct pieces of information. When credences cannot be explicitly enumerated, they must instead be implicitly encoded.

To illustrate implicit encoding, consider the class of Gaussian distributions. Look again at Figure 10. As noted in Section 2.4, a Gaussian distribution is completely described by two numbers: its mean and its variance. For that reason, a physical system can encode a Gaussian distribution by recording its mean m and its variance $σ^{2}$ . This is an example of parametric encoding: the physical system encodes parameters that determine a probability distribution. The system does not explicitly enumerate the credence attaching to each interval $[a, b]$ —that would be impossible. Instead, the system records two numbers (m and $σ^{2}$ ) that determine the credence attached to each interval $[a, b]$ .

Parametric encoding is an option when the probability distribution is finitely parametrizable, which is often but not always. A more generally applicable encoding strategy features sampling. To illustrate, consider a physical system that draws samples stochastically from the outcome space Ω. There is an objective chance that the sampled outcome belongs to event A. We may summarize objective chances through a function:

C h (A),

where $C h (A)$ is the objective chance that the physical system draws an outcome belonging to A. The key idea behind sampling encoding is that these objective chances can encode subjective probabilities (Icard, Reference Icard2016). The subjective probability assigned to A is simply the objective chance that a sample belongs to A:

P (A) = C h (A) .

The system encodes subjective probabilities via the objective probabilities governing its sampling activity.

Parametric and sampling encoding are widely used in statistics (Gelman et al., Reference Gelman, Carlin, Stern, Dunson, Vehatri and Rubin2014), machine learning (Murphy, Reference Murphy2023), and other fields that employ the Bayesian framework.

The next crucial task is to address computation of the posterior. In some special cases, it is easy to compute the posterior from the priors. For example, when the prior probability and the likelihood are Gaussian, the posterior is also Gaussian, and its mean and variance are easily computable from those of the prior and the likelihood. Special cases aside, computing the posterior may require resources of time and memory beyond those available to a realistic agent (Kwisthout et al., Reference Kwisthout, Wareham and van Rooij2011). Look again at Bayes’s theorem (2). Multiplying $P (H)$ and $P (E | H)$ is easy. The normalization constant k is another matter. It is possible in principle to compute k from the prior probability and the prior likelihood, but the computation requires evaluating a (potentially very long) sum of numbers.Footnote ⁶ In practice, it may be impossible to compute k exactly. A similar point applies to (3). Although k is in principle computable from $p (x)$ and $p (y | x)$ , the computation may be impossible in practice.

A computation is tractable when it can be executed by a physical system with limited time and memory at its disposal. A computation is intractable when it is not tractable. These definitions can be made mathematically precise, but the present level of precision suffices for our purposes. The previous paragraph may be summarized as follows: computation of the posterior is not always tractable.Footnote ⁷

The standard solution in Bayesian statistics is to find tractable algorithms that approximately implement Bayesian inference. Even when we cannot exactly compute the posterior, we can often come quite close—close enough for practical purposes. Even when we cannot conform to the normative ideal enshrined by Bayesian decision theory, we can often tractably approximate the normative ideal.

One popular approximation strategy is called Markov chain Monte Carlo (Murphy, 2023, pp. 493–536). MCMC algorithms use sampling to encode a credal assignment that approximates the posterior. An MCMC algorithm for approximating the posterior proceeds in discrete time stages:

t_{1}, t_{2}, t_{3}, \dots, t_{n}, \dots

At each stage, a single sample is drawn. Sampling behavior at each stage is governed by an objective chance distribution. Thus, we have a sequence of objective chance distributions:

C h_{1} (A), C h_{2} (A), C h_{3} (A), \dots, C h_{n} (A), \dots

$C h_{1} (A)$ is the objective chance at time $t_{1}$ of sampling an outcome that belongs to A. $C h_{2} (A)$ is the objective chance at time $t_{2}$ of sampling an outcome that belongs to A. $C h_{n} (A)$ is the objective chance at time $t_{n}$ of sampling an outcome that belongs to A. Objective chances evolve as the algorithm proceeds, converging asymptotically to the posterior: as the algorithm proceeds, $C h_{n} (A)$ grows ever closer to the posterior probability assigned to A. After enough time has passed, the system’s sampling behavior approximates the posterior quite well. See Figures 25 and 26. There are general convergence results ensuring that, in a wide range of cases, objective chances fairly quickly approach posterior probabilities (Brooks et al., Reference Brooks, Gelman, Jones and Meng2011).

Figure 25 An illustration of MCMC approximation, for the pdf from Figure 12. The orange dots are samples in the $(x, y)$ plane. Samples cluster in regions of high probability.

Figure 26 An illustration of MCMC approximation, for the Gaussian pdf from Figure 15.

4 Bayesian Cognitive Science

Bayesian decision theory studies how agents should reason and make decisions. Over the past few decades, cognitive scientists have increasingly used the Bayesian framework to describe actual mental activity (usually human, sometimes nonhuman). The core conjecture is that the mind copes with uncertainty by allocating credence over a hypothesis space. Credences evolve in response to sensory input, and they underwrite such tasks as estimation and decision-making. Credal activity conforms, at least approximately, to Bayesian norms.

Some Bayesian models posit exact Bayesian inference. Other models posit tractable approximations to the Bayesian ideal. I will discuss models of both kinds. I emphasize three domains where the Bayesian research program strikes me as particularly noteworthy: perception (Section 4.1), motor control (Section 4.2), and navigation (Section 4.3).Footnote ⁸

4.1 Perception

How does the perceptual system estimate distal conditions based upon proximal sensory input? For example, how does it estimate the shapes, sizes, and locations of nearby objects based upon retinal stimulations? Proximal sensory stimulations underdetermine distal conditions: numerous possible distal conditions can cause the same proximal stimulations. Moreover, sensory input is corrupted by noise during both transduction and transmission to the brain. Despite ambiguous and noisy sensory input, the perceptual system typically forms highly accurate estimates of distal conditions.

Helmholtz (Reference Helmholtz and Southall1867/1925) proposed that the perceptual system estimates distal conditions through an unconscious inference. Bayesian perceptual psychology develops Helmholtz’s proposal, postulating unconscious Bayesian inferences executed by the perceptual system (Knill & Richards, Reference Knill and Richards1996; Vilares & Kording, Reference Vilares and Kording2011; Rescorla, Reference Rescorla and Matthen2015a). A typical Bayesian model estimates a specific variable (e.g. shape) based on one or more proximal sensory cues (e.g. shading). The perceptual system starts with a prior probability over the distal variable and a prior likelihood that relates the distal variable to proximal sensory input. Upon receiving sensory input, the perceptual system computes the posterior (or an approximation to the posterior) over the distal variable. On that basis, the perceptual system forms a privileged estimate of distal conditions. In most Bayesian models, the estimate is chosen through expected utility maximization. In other models, the privileged estimate is chosen not deterministically but stochastically. For example, the model from (Mamassian, Landy & Maloney, Reference Mamassian, Landy, Maloney, Rao, Olshausen and Lewicki2002) implements probability matching: estimates are chosen stochastically, with objective probability matching the posterior.

A simple example of the Bayesian approach concerns perceptual estimation of shape from shading. As Figure 27 illustrates, shading is an ambiguous cue to shape. In principle, the stimulus on the left could result from a convex object lit from overhead or a concave object lit from below. Despite the ambiguity, we perceive the stimulus on the left as convex and the stimulus on the right as concave. How does the perceptual system estimate shape based upon the ambiguous evidence provided by shading? The dominant theory in perceptual psychology has long been that the perceptual system somehow “assumes” that light comes from overhead rather than below (Rittenhouse, Reference Rittenhouse1786). This theory translates naturally into a Bayesian setting. On a Bayesian approach, the perceptual system estimates shape based on a prior over shapes, a prior over lighting directions, and a prior likelihood that assigns a probability to a given shading pattern conditional on the stimulus having a given shape and the light coming from a given direction (Stone, Reference Stone2011). The prior over lighting directions favors overhead lighting directions. Consequently, the posterior favors the convex interpretation of the left-hand stimulus from Figure 27.

Figure 27 Shading is an ambiguous cue to shape. The stimulus on the left could result from a convex object lit from overhead or a concave object lit from below. The perceptual system “assumes” that light comes from overhead, so we perceive the stimulus on the left as convex and the stimulus on the right as concave.

Reprinted from https://commons.wikimedia.org/wiki/File:%27Light-from-above%27_prior.jpg, under Creative Commons Attribution-Share Alike 4.0 International license.

Bayesian models often posit that, when the perceptual system estimates the value of distal variable X, the prior over X has a pdf $p (x)$ . Models often also posit that the prior likelihood for sensory variable Y given X has a conditional density $p (y | x)$ . Upon receiving sensory input y, the perceptual system forms new credences determined by a density $p_{n e w} (x)$ . In some models, new credences are given by the posterior density:

p_{n e w} (x) = p (x | y) .

In other models, new credences only approximate the posterior:

p_{n e w} (x) \approx p (x | y) .

Based on $p_{n e w} (x)$ , the perceptual system selects an estimate x* of X’s value. See Figure 28.

Figure 28 Approximate Bayesian inference in the perceptual system. When $p_{n e w} (x) = p (x | y)$ , inference is exact rather than approximate.

The motion estimation model given by Weiss, Simoncelli, and Adelson (2002) is a good example of Bayesian perceptual psychology’s explanatory power. The model estimates the velocity of a moving stimulus. The model posits a prior density $p (v)$ over velocities. Crucially, the prior favors slow speeds. This reflects the environmental regularity that objects usually move fairly slowly. The model also posits a likelihood $p (I | v)$ , where I measures light intensity over the retina. Upon receiving input I, the perceptual system computes the posterior $p (v | I)$ and on that basis forms a privileged velocity estimate v*. The model explains an array of illusions that had previously resisted unified explanation. For example, it explains why low contrast stimuli seem to move slower than high contrast stimuli: low contrast stimuli yield a wide likelihood, so the “slow speed” prior exerts more influence over the posterior. See Figure 29. As this example illustrates, Bayesian perceptual models can often explain perceptual phenomena that otherwise elude satisfying explanation.

Figure 29 Illustrates how the “slow speed” prior influences motion estimation. When the stimulus has high contrast, the likelihood is narrow and the “slow speed” prior exerts relatively little influence on the posterior. When the stimulus has low contrast, the likelihood is wide and the prior exerts relatively more influence on the posterior. $\hat{v}$ , the posterior mean, is smaller in the low contrast condition (b) than in the high contrast condition (a).

Reprinted with permission from Springer Nature Customer Service Center GmbH: Springer Nature, Nature, “Noise Characteristics and Prior Expectations in Human Visual Perception” (Stocker & Simoncelli, 2006).

Subsequent research has further illuminated the “slow speed” prior and its crucial role in motion perception (e.g. Stocker & Simoncelli, Reference Stocker and Simoncelli2006). In a particularly notable contribution, Kwon, Tadin, and Knill (Reference Kwon, Tadin and Knill2015) generalized the “slow speed” prior to construct a highly successful model of object-tracking. For further discussion of the motion estimation model, see Rescorla (Reference Rescorla and Matthen2015a); Rescorla (Reference Rescorla2018b). For further discussion of the object-tracking model, see Rescorla (Reference Rescorla, Nes and Chan2020c).

Another successful application of Bayesian perceptual modeling is cue combination. The perceptual system typically estimates a single distal variable based on multiple cues, such as visual and haptic cues to size. Due to sensory noise, estimates based on distinct sensory cues will typically differ at least to a small degree. The perceptual system must combine distinct sensory cues into a single unified estimate of the distal variable. Ernst and Banks (Reference Ernst and Banks2002) showed that the Bayesian framework can successfully model combination of visual and haptic cues to size. Researchers have subsequently generalized this finding to numerous other cases of cue combination within and across modalities (Trommershäuser, Kording & Landy, Reference Trommershäuser, Kording and Landy2011). See Rescorla (Reference Rescorla2020b) for further discussion of cue combination in a Bayesian setting.

Bayesian perceptual inference is subpersonal and inaccessible to conscious introspection or control. These inferences are executed by the perceptual system, not by the perceiver. A typical perceiver is not aware that her perceptual system uses a “slow speed” prior. The perceptual system, not the perceiver, encodes and deploys the prior. The perceiver is not consciously aware of any inference based on the prior.

Perceptual priors are highly mutable, changing rapidly in response to altered environmental statistics. Adams, Graf, and Ernst (Reference Adams, Graf and Ernst2004) exposed subjects to deviant visual-haptic input indicating an altered lighting direction. In response, shape perception and lightness perception rapidly changed, reflecting a change in the “light from overhead” prior. Similarly, the “slow speed” prior rapidly changes in response to fast-moving stimuli (Sotiroupolous, Seitz & Seriès, Reference Sotiropoulos, Seitz and Seriès2011). There is also evidence that prior likelihoods are mutable (Sato & Kording, Reference Sato and Kording2014; Sato, Toyoizumi & Aihara, Reference Sato, Toyoizumi and Aihara2007; Seydell, Knill & Trommershäuser, Reference Seydell, Knill and Trommershäuser2010). Changing priors can themselves be modeled in Bayesian terms (Kwon & Knill, Reference Kwon and Knill2013).

As final illustration of Bayesian perceptual psychology’s explanatory power, consider central tendency bias: perceptual estimates of a magnitude are biased towards the mean of the sample distribution (Hollingworth, Reference Hollingworth1910). Relatively large magnitudes tend to be underestimated, while relatively small magnitudes tend to be overestimated. Depending on the case, the sample distribution may arise naturally or may be experimentally imposed. Central tendency bias is a ubiquitous effect, arising when subjects estimate line length (Duffy et al., Reference Duffy, Huttenlocher, Hedges and Crawford2010), interval duration (Jazayeri & Shadlen, Reference Jazayeri and Shadlen2010), color (Olkkonen, McCarthy & Allred, Reference Olkkonen, McCarthy and Allred2014), and many other magnitudes. It is readily explicable from a Bayesian perspective. The key posit is that the prior adapts to match environmental statistics. For example, when the subject encounters stimuli drawn from an experimentally imposed sample distribution, the prior shifts to match that distribution. The shifted prior pulls estimates towards the prior mean. See Figure 30. Researchers have elaborated this intuitive idea into models that successfully explain central tendency bias for a number of perceptual tasks (Glasauer, Reference Glasauer2019; Glasauer & Shi, Reference Glasauer and Shi2022; Petzschner, Glasauer & Stephan, Reference Petzschner, Glasauer and Stephan2015). The models achieve a close fit with psychophysical data, including detailed patterns governing the extent to which central tendency bias occurs in different situations.

Figure 30 Heuristic Bayesian explanation of central tendency bias. Two possible likelihoods are depicted: the first peaks below the prior mean, and the second peaks above the prior mean. In both cases, the posterior mean is pulled towards the prior mean. Assuming that the prior is adapted to the sample distribution, the posterior mean exhibits central tendency bias.

In summary, Bayesian modeling has proved remarkably successful across a range of perceptual tasks. It amply deserves its orthodox status within contemporary perceptual psychology.

4.2 Motor Control

Suppose I form an intention to perform an action, such as lifting a coffee cup without spillage. My motor system must convert my intention into motor commands that promote fulfillment of my intention. As Bernstein (Reference Bernstein1967) emphasized, the motor system has multiple degrees of freedom when converting intentions into motor commands. For example, there are infinitely many possible hand trajectories through which I can lift the coffee cup without spillage. The motor system must select among these infinitely many options.

Sensorimotor psychology studies how the motor system selects motor commands that promote the agent’s goals. Over the past few decades, Bayesian models have achieved great explanatory success within sensorimotor psychology (Haith & Krakauer, Reference Haith, Krakauer, Richardson, Riley and Shockley2013; Shadmehr & Mussa-Ivaldi, Reference Shadmehr and Mussa-Ivaldi2012). Optimal feedback control (OFC) models have proved especially successful (Todorov Reference Todorov2004; Todorov & Jordan, Reference Todorov and Jordan2002). OFC models have two core elements: an estimator, which uses conditionalization to estimate current environmental conditions (including bodily state); and a controller, which uses expected cost minimization to select suitable motor commands.

When the controller issues a motor command u, it sends an efference copy of the motor command back to the estimator. The efference copy serves as input to a forward model (Wolpert & Flanagan, Reference Wolpert, Flanagan, Bayne, Cleeremans and Wilken2009). Intuitively, the forward model reflects how bodily state will change due to motor commands. More rigorously, it encodes conditional densities $p (x_{t + 1} | x_{t}, u)$ , where $x_{t}$ is bodily state at time t, u is a motor command, and $x_{t + 1}$ is bodily state at time $t + 1$ . Using efference copy and the forward model, the estimator forms an initial probabilistic estimate of bodily state. Since motor execution is noisy, the initial estimate requires sensory correction. For example, an initial probabilistic estimate of hand position can be revised based upon visual and proprioceptive feedback regarding hand position. The estimator sequentially updates credences over environmental conditions based upon sequentially received efference copy and sensory feedback.

Throughout performance of the motor task, the controller uses updated credences to compute expected costs of possible motor commands. A cost function $c (h, u)$ reflects the cost of motor command u assuming that outcome h is the true outcome. During a reaching task, h might specify hand position, hand velocity, and the target location. Typically, the cost function has two components. The first component, which is task-dependent, rewards achievement of the task goal (e.g. reaching the target). The second component, which is task-independent, penalizes energetic expenditure. At every stage, the controller selects a motor command that minimizes expected costs. See Figure 31.

Figure 31 A template for Bayesian models of sensorimotor control. Some models vary the template somewhat. For example, the Saunders & Knill (Reference Saunders and Knill2004) model handles sensory delay by transmitting the initial state estimate rather than the corrected state estimate to the controller.

Modified from Rescorla (2016) with permission from John Wiley & Sons.

OFC models of motor control have achieved great empirical success (McNamee & Wolpert, Reference McNamee and Wolpert2019). Most notably, OFC explains patterns in repeated performance of a task. When a subject repeatedly executes a task, the movement details vary across trials. As Bernstein (Reference Bernstein1967) first showed, and as subsequent research has amply confirmed, movement details vary more along task-irrelevant dimensions than task-relevant dimensions. The discrepancy between task-relevant variation and task-irrelevant variation is one of the most robust findings in sensorimotor psychology, surfacing in a huge range of motor tasks. The discrepancy is readily explicable within the OFC framework (Todorov & Jordan, Reference Todorov and Jordan2002). Whenever bodily trajectory is perturbed (e.g. by noise or by an external influence), the controller must choose whether to correct the perturbation or leave it uncorrected. Correcting the perturbation expends energy, so an optimal controller will only correct perturbations that are task-relevant. As a result, deviations from the average trajectory accumulate along task-irrelevant dimensions but not task-relevant dimensions.

An experiment conducted by Nashed, Crevecoeur, and Scott (Reference Scott2012) nicely illustrates the contrasting response to task-relevant and task-irrelevant perturbations. Subjects reached quickly to a target: either a relatively small circle or else a relatively wide rectangle. In some trials, an external force disrupted the reaching motion. When the target was the circle, the external disruption was task-relevant, so the motor system corrected for it. When the target was the rectangle, the external disruption was task-irrelevant, so the motor system did not correct for it. See Figure 32.Footnote ⁹

Figure 32 Subjects reached either to a circle or a rectangle. Unperturbed hand paths are shown in black. In some trials, hand trajectories were perturbed to the right. How the motor system responded depended upon the task goal: the motor system corrected trajectories when reaching for the circle but not when reaching for the rectangle. In other words, it corrected task-relevant perturbations but not task-irrelevant perturbations.

Reprinted from Scott (2012) with permission from Elsevier.

Priors deployed during motor control are mutable (Berniker, Voss & Kording, Reference Berniker and Voss2010; Fernandes et al., Reference Fernandes, Stevenson, Vilares and Kording2014). Consider a study conducted by Kording and Wolpert (Reference Kording and Wolpert2004). Subjects reached to a visible target in a virtual reality setup. Finger position was hidden during the reaching motion, except that subjects received visual feedback on finger position midway through the motion. Apparent finger position was shifted from actual finger position, with the shift drawn randomly from a prior distribution (a Gaussian distribution for some subjects, a bimodal distribution for other subjects). The motor system learned the experimentally imposed prior (either the Gaussian prior or the bimodal prior) and used it to adjust finger trajectories based on visual feedback.

4.3 Navigation

Animal navigation has been intensively studied for many decades across several disciplines, including psychology, ethology, and neuroscience. At present, Bayesian modeling does not figure as prominently in the study of navigation as it does in perceptual psychology and sensorimotor psychology. Nevertheless, recent studies provide strong evidence that Bayesian inference plays a crucial role in human navigation.

I focus on a navigational strategy called dead reckoning. During dead reckoning, the navigator exploits self-motion cues to maintain a running estimate of her own position. Self-motion cues include optic flow, efference copy, vestibular signals, and so on. Dead reckoning is sometimes called “path integration,” because position is the integral of velocity. Dead reckoning pervades the animal kingdom (Gallistel, Reference Gallistel1990, pp. 57–102), from the desert ant to humans.

A key fact about human dead reckoning is that, in many experimental conditions, subjects overshoot the target destination. Traditionally, overshooting was explained through a “leaky integrator” model (Lappe et al., Reference Lappe, Stiels, Frenz and Loomis2011). The basic idea is that subjects imperfectly integrate velocity to compute position: rather than computing the true integral, subjects compute a slightly smaller quantity. As the distance traveled increases, “leaks” accumulate and the discrepancy between estimated position and true position increases. Lakshminarasimhan et al. (Reference Lakshminarasimhan, Petsalis, Park, DeAngelis, Pitkow and Angelaki2018) offer an alternative Bayesian explanation. They posit a “slow speed” prior over self-motion. The “slow speed” prior biases estimated velocity below the true velocity, which leads the subject to underestimate distance traveled. See Figure 33.

Figure 33 Comparison of the “slow speed” prior model and a “leaky integrator” model. (For heuristic purposes, the comparison only depicts one-dimensional linear velocity. The actual model also considers angular velocity.) The panel on the left shows the subject’s true velocity over time. The top row schematizes the “slow speed” prior model. At a given moment, the “slow speed” prior (in green) combines with the likelihood to yield a posterior over possible velocities. The resulting velocity estimates are consistently smaller than actual velocity, due to the influence of the “slow speed” prior. When velocity estimates are integrated to form position estimates, the position estimates are biased. The bottom panel shows a Bayesian version of the “leaky integrator” approach. (One can also develop the “leaky integrator” approach in a non-Bayesian setting.) The prior is uniform. As a result, velocity estimates are not biased towards smaller velocities. The bias in position estimates stems from leaky integration, not from biased velocity estimates.

Reprinted from (Lakshminarasimhan et al., 2018) with permission from Elsevier.

The “slow speed” model explains several phenomena that the “leaky integrator” model does not. For example, Lakshminarasimhan et al. (Reference Lakshminarasimhan, Petsalis, Park, DeAngelis, Pitkow and Angelaki2018) studied dead reckoning in a virtual reality setup. They manipulated the optic flow cue by altering the density of plane elements: greater density entails a more reliable cue. Decreased cue reliability corresponds to a relatively wide likelihood. The “slow speed” model predicts that, when the likelihood is wide, the posterior will be more strongly affected by the “slow speed” prior, causing even more overshooting. In contrast, the “leaky integrator” model does not predict that a degraded optic flow cue causes increased overshooting. See Figure 34. The human data exhibited more overshooting in response to the degraded optic flow cue, conforming closely to the “slow speed” model’s predictions.

Figure 34 Differing predictions of the “slow speed” model and the “leaky integrator” model. The red likelihood function is narrow, corresponding to a reliable optic flow cue. The blue likelihood function is wide, corresponding to a degraded optic flow cue. In the “slow speed” prior model, the prior exerts more influence in degraded cases, causing a more biased position estimate. The “leaky integrator” model predicts that a change in cue reliability does not affect the position estimate. The “leaky integrator” approach is illustrated here in a Bayesian format, but the same prediction prevails for non-Bayesian versions.

Reprinted from (Lakshminarasimhan et al., 2018) with permission from Elsevier.

Another striking phenomenon explained by the model: when the target is relatively distant, overshooting gives way to undershooting. The farther the subject travels, the greater the uncertainty regarding her position, so the wider her pdf over possible positions. When the pdf becomes quite wide, its area of overlap with the target decreases. As a result, expected utility peaks before the target when the target is relatively far away. For sufficiently large distances, this bias towards undershooting swamps the bias induced by the “slow speed” prior. The Bayesian model, by analyzing how these two biases interact with each other and with optic flow reliability, achieves a good match with actual human performance.

Central tendency bias provides additional evidence for a Bayesian approach to dead reckoning. Petszchner and Glasauer (Reference Petzschner and Glasauer2011) studied a virtual reality task in which subjects traversed an experimentally imposed linear path and then tried to reproduce their displacement. Subjects performed the task in multiple trials during each session. Distances during a session were drawn from one of three distinct sample distributions: small, medium, or large. Subjects exhibited significant central tendency bias: their distance estimates during a session (as gauged by reproduced distance) were biased towards the mean of the session sample distribution. To explain the bias, Petszchner and Glasauer (Reference Petzschner and Glasauer2011) offer an iterative Bayesian model. After each trial, the Bayesian estimator updates its prior over distance traveled. The prior gravitates towards the mean of the session sample distribution, biasing distance estimates towards that mean. The model thereby explains observed central tendency bias, further confirming the hypothesis that human dead reckoning relies upon Bayesian estimation.

Dead reckoning is only one navigation strategy found in the animal kingdom. Equally important is piloting, during which the creature uses landmarks to estimate its own position (Gallistel, Reference Gallistel1990, p. 41, pp. 88–93, pp. 120–123). Even relatively primitive creatures, such as rats and bats, engage in piloting. Of course, humans routinely do so. There is strong evidence that human piloting relies on Bayesian inference (Jetzschke et al., 2017), as does human combination of self-motion cues and landmark cues (Chen et al., Reference Chen, McNamara, Kelly and Wolbers2017).

Dead reckoning and piloting are key to navigation, but they are just the beginning. Piloting presupposes mapping: estimation of landmark locations. Mapping also figures prominently in robotics, where the standard solution centers upon approximate Bayesian inference (Thrun, Burgard & Fox, Reference Thrun, Burgard and Fox2005). Several researchers have conjectured that some mammals likewise implement Bayesian mapping (Gallistel, Reference Gallistel, Jeffries and Yeap2008; Rescorla, Reference Rescorla2009). The conjecture fits well with everything we know about mammalian navigation (Savelli & Knierim, 2019; Shikauchi et al., 2021). Moreover, it can explain within a single theoretical framework disparate navigational phenomena that otherwise resist unified explanation (Kessler, Frankenstein & Rothkopf, 2024). The topic merits, and will surely receive, further investigation.

4.4 Other Psychological Domains

Researchers have applied the Bayesian perspective to numerous domains, such as causal reasoning (Griffiths & Tenenbaum, Reference Griffiths and Tenenbaum2009; Oaksford & Chater, Reference Oaksford and Chater2020), social cognition (Baker & Tenenbaum, 2014), intuitive physics (Battaglia, Hamrock & Tenenbaum, Reference Battaglia, Hamrick and Tenenbaum2013; Sanborn, Masinghka & Griffiths, Reference Sanborn, Masinghka and Griffiths2013), language acquisition (Abend et al., Reference Abend, Kwiatkowski, Smith, Goldwater and Steedman2017), syntactic parsing (Narayanan & Jurafsky, Reference Narayanan and Jurafsky1998), concept acquisition (Goodman et al., Reference Goodman, Tenenbaum, Feldman and Griffiths2008), music cognition (Temperley, Reference Temperley2007), reading (Norris, Reference Norris2006), memory (Hemmer & Steyvers, Reference Hemmer and Steyvers2009), categorization (Sanborn, Griffiths & Navarro, Reference Sanborn, Griffiths and Navarro2010), and so on. Applications vary in their predictive and explanatory power. Few achieve the astonishing explanatory successes found in perception and motor control. Still, they are often more successful than competing non-Bayesian approaches, as readers can confirm for themselves by accessing the above-cited texts.

4.5 Anti-Bayesian Phenomena?

Like every prominent cognitive science research program, Bayesian modeling has attracted lots of criticism (Bowers & Davis, Reference Bowers and Davis2012; Eberhardt & Danks, Reference Eberhardt and Danks2011; Jones & Love, Reference Jones and Love2011; Mandelbaum, Reference Mandelbaum2019). Perhaps the most basic criticism is that many mental phenomena appear radically anti-Bayesian. This criticism traces back to Kahneman and Tversky, who discovered intriguing cognitive phenomena that apparently violate Bayesian norms (e.g. Kahneman & Tversky, Reference Kahneman and Tversky1979; Tversky & Kahneman, Reference Tversky and Kahneman1983). A good example is anchoring bias (Tversky & Kahneman, Reference Tversky and Kahneman1974): when asked to estimate a quantity (such as the distance between Los Angeles and San Francisco), people are biased towards a randomly selected number provided to them. In effect, the randomly selected number serves as an “anchor” that pulls judgment away from a more accurate estimate. Anchoring bias is irrational and hence suggests that people violate the norms of Bayesian decision theory. Beyond the cognitive-level irrationalities discovered by Kahneman and Tversky, researchers have documented seemingly anti-Bayesian phenomena in other domains, including perception (Gardner, Reference Gardner2019; Mandelbaum et al., Reference Mandelbaum, Won, Gross and Firestone2020; Rahnev & Denison, Reference Rahnev and Denisov2018).

Proponents of Bayesian modeling reply that many apparently anti-Bayesian phenomena can in fact be modeled in Bayesian terms (Stocker, Reference Stocker2018). Consider the size‒weight illusion: when you lift two objects of equal weight but different size, the smaller object feels heavier. At first, the illusion looks anti-Bayesian because it flouts a prior expectation that larger objects are heavier. However, the illusion turns out to be explicable by a Bayesian model that estimates relative densities (Peters et al., Reference Peters, Ma and Shams2016).

Even when a phenomenon cannot be modeled in Bayesian terms, it can often be modeled in terms of approximately Bayesian inference (Chater et al., Reference Chater, Zhu, Spicer, Sundh, León-Villagrá and Sanborn2020). In this spirit, Lieder et al. (Reference Lieder, Griffiths, Huys and Goodman2018) show that anchoring bias arises naturally from a sampling approximation to idealized Bayesian inference. They assume that, when subjects are provided with a randomly selected number, this number serves as the initial sample for an MCMC algorithm. (See Section 3.4 to review MCMC algorithms.) Samples are biased towards the initial sample, which may be quite far from an optimal Bayesian estimate. As the algorithm proceeds, it draws samples closer to the optimal Bayesian estimate. The extent of anchoring bias depends upon how long the algorithm runs (i.e. how many samples it draws). When computation is costly (e.g. because computational resources are needed for another task), anchoring bias increases because the system draws fewer samples and does not get as far from the initial sample. On this approach, anchoring bias arises from “rational” use of limited computational resources: the system balances accurate estimation against the cost of computation. The sampling model explains a range of effects, such as increased anchoring bias due to cognitive load or time pressure. Many other seemingly anti-Bayesian cognitive phenomena can be similarly explained in terms of sampling approximation to idealized Bayesian inference (Chater et al., Reference Chater, Zhu, Spicer, Sundh, León-Villagrá and Sanborn2020; Dasgupta, Schulz & Gershman, Reference Dasgupta, Schulz and Gershman2017).

Obviously, there is no guarantee that all psychological processes will turn out to be Bayesian or approximately Bayesian. Sub-systems may conform to Bayesian norms to a greater or lesser degree. For example, it could be that perception conforms quite closely to Bayesian norms while high-level decision-making does not. It could be that certain perceptual processes conform closely to Bayesian norms while other perceptual processes do not, or that certain perceptual processes conform to Bayesian norms under certain circumstances but not other circumstances—e.g. that perceptual processes conform to Bayesian norms only when the perceiver is paying attention (Morales et al., Reference Morales, Solovey, Maniscalco, Rahnev, de Lange and Lau2015). These possibilities require investigation. We must build and test detailed Bayesian models of specific phenomena, evaluating afresh how well each model fits the data. That is exactly what Bayesian cognitive scientists do on a daily basis. Underlying this research program is a key methodological commitment: enough mental processes are at least approximately Bayesian that constructing and testing Bayesian models of specific mental processes is a worthwhile endeavor. So far, the methodological commitment has been amply vindicated.Footnote ¹⁰

4.6 Where do the Priors Come From?

A natural question posed by Bayesian modeling is how the prior probability and prior likelihood arise. For example, the Bayesian dead reckoning model assumes a “slow speed” prior but says nothing about the prior’s etiology. A similar point applies to other Bayesian models found in the literature. The models postulate priors that underlie Bayesian inference, without explaining how the priors arise. Given that priors are highly mutable, a good explanation will surely cite a complex mixture of evolutionary and developmental factors. So far, though, no such explanation is available.

Some critics complain that Bayesian models are unexplanatory due to their reliance on postulated priors (Hutto & Myin, Reference Hutto and Myin2017, pp. 67–74, pp. 154–155; Orlandi, Reference Orlandi2014, p. 91). The worry is that Bayesian models rest upon unexplained explainers. How much can a Bayesian model explain when it posits priors but offers no explanation for the priors?

In my opinion, this complaint has no force. We must distinguish between incomplete theories and unexplanatory theories. Every scientific theory contains unexplained explainers: postulates that serve as explanantia. For example, Newtonian physics postulates that objects have mass, but it does not explain how objects come to have the masses that they have. No one should complain on that basis that Newtonian physics is unexplanatory. In many cases, a successful scientific theory contains huge explanatory gaps. A famous example is the theory of natural selection as formulated in On The Origin of Species. Darwin postulated suitable hereditary mechanisms but had no clue what those mechanisms were. No one should complain on that basis that the theory of natural selection as formulated by Darwin was unexplanatory. A scientific theory can offer powerful explanations even though it includes unexplained explainers.

Of course, it is always good to eliminate unexplained explainers. Modern biology achieved a decisive advance when it discovered the genetic basis of heredity. Bayesian cognitive science will likewise achieve a major advance when it illuminates the etiology of priors. Until that advance, Bayesian cognitive science will be incomplete in a significant way. Even in its present incomplete state, it offers powerful explanations for many psychological phenomena.

5 Realism and Instrumentalism

Any Bayesian model posits credal states (assignments of credences to hypotheses) and credal transitions (transitions among credal states). At a bare minimum, a Bayesian model posits a prior probability, a prior likelihood, and a transition to a posterior or approximate posterior. The Bayesian model may posit a succession of credal states, as in sensorimotor psychology.

Suppose that a Bayesian model is explanatorily successful, in the sense that it supplies compelling explanations for observed phenomena. Let us distinguish two opposing viewpoints one might adopt towards the model: realism and instrumentalism. Realists hold that we have good reason to deem the model an approximately true description of mental activity (Rescorla, Reference Rescorla, Nes and Chan2020c). From a realist perspective, the mind instantiates credal states and transitions at least roughly like those posited by the model. The model describes actual mental states and processes that mediate between inputs (e.g. retinal inputs) and outputs (e.g. perceptual estimates; motor commands). Instrumentalists, on the other hand, regard the model as nothing but a useful predictive device (Block, Reference Block2018; Colombo & Seriès, Reference Colombo and Seriès2012; Orlandi, Reference Orlandi2014). The model helps us summarize the mapping (possibly stochastic) from inputs to outputs, but it does not describe actual mental states and processes with even approximate accuracy. From an instrumentalist perspective, we have no reason to believe that the mind instantiates credal states or that it executes anything resembling (approximate) Bayesian inference. We should conclude only that the mind operates as if it executes (approximate) Bayesian inference. Whereas realists attribute psychological reality to credal states and transitions postulated by explanatorily successful Bayesian models, instrumentalists do not.

To illustrate how realism and instrumentalism differ, consider Figure 28. The figure depicts how the perceptual system converts proximal sensory input y into a perceptual estimate x*. It posits two mental states (the prior probability and the prior likelihood) that along with y cause a third mental state (the approximate posterior), which in turn causes perceptual estimate x*. From a realist perspective, we should take this causal structure seriously as a guide to underlying psychological reality. There really do exist priors, they really do interact with input y to cause an approximate posterior, and this approximate posterior really does cause a perceptual estimate x*. In contrast, instrumentalists do not take Figure 28 as a guide to underlying psychological reality. All that we should take seriously about Figure 28, they say, is the induced mapping from input y to estimate x*.

I have defended realism at length in previous writings (Rescorla, Reference Rescorla and Matthen2015a; Rescorla, Reference Rescorla2015b; Rescorla, Reference Rescorla, Nes and Chan2020c). Here, I will briefly adduce a few considerations in its favor.

5.1 Scientific Realism

My realist perspective on Bayesian cognitive science is grounded in a general commitment to scientific realism. Scientific realism traces back to Putnam (Reference Putnam1975) and has been elaborated by many subsequent philosophers. The basic idea is that explanatory success is a prima facie indication of approximate truth. When a scientific theory is explanatorily successful, we have prima facie reason to believe that it is at least approximately true. For example, the explanatory success of modern physics provides reason to believe in subatomic particles.

My realist perspective on Bayesian cognitive science results from straightforward application of scientific realism to Bayesian modeling. Many Bayesian models, although not all, are explanatorily successful. From a scientific realist perspective, we have reason to regard these models as at least approximately true. We have reason to accept that there exist credal states and transitions roughly like those posited by the model. Just as the explanatory success of modern physics provides reason to believe in subatomic particles, the explanatory success of a Bayesian model provides reason to believe in credal states and transitions.

Not all philosophers accept scientific realism. Some authors favor an instrumentalist perspective on scientific theorizing (van Fraassen, Reference van Fraassen1980). According to instrumentalism, a scientific theory is just a useful tool for making predictions. When a scientific theory is explanatorily successful, we have no reason to believe that the theory is even approximately true. For example, the explanatory success of modern physics provides no reason to believe that there are subatomic particles. Philosophers who favor an instrumentalist perspective more generally will surely want to apply it specifically to Bayesian cognitive science. If you do not believe in a subatomic particles, then you probably do not believe in credal states and transitions!

Typically, researchers who favor instrumentalism about Bayesian cognitive science do not evince a more general commitment to instrumentalism about scientific theorizing. They do not hold that we should be instrumentalists about scientific theories in general. Instead, they argue that we should be instrumentalists for the special case of Bayesian cognitive science. In my opinion, their arguments for that differential stance have little force. I see no compelling reason why philosophers inclined towards scientific realism in general should favor instrumentalism for the special case of Bayesian modeling. I will illustrate my viewpoint by critiquing an instrumentalist argument advanced by Block (Reference Block2023) and tailored to the special case of Bayesian cognitive science.

5.2 Simulation or Implementation?

According to Block, we seldom if ever have reason to believe that a psychological system implements approximate Bayesian inference as opposed to merely simulating approximate Bayesian inference. To support his assessment, Block cites evolutionary considerations (Reference Block2023, p. 208):

Evolution is a pro-instrumentalist mechanism. There is no doubt that behaving according to Bayesian norms is enormously valuable for an organism and we can expect strong evolutionary pressure toward behavior that fits the norms of Bayesian rationality. But Bayesian rational behavior does not have to be implemented using the conceptual apparatus that is best suited to describing Bayesian rational processes by the theorist. The problem with Rescorla’s argument is that it is not clear that the way evolution chose to produce behavior that adheres roughly to Bayesian norms involves the representation of probabilities in the perceptual system.

Block develops his position by citing an experiment on pea plants conducted by Dener, Kacelnik, and Shemesh (Reference Dener, Kacelnik and Shemesh2016). Each plant’s roots were divided between two pots. The two pots received equal mean levels of nutrients. Nutrient levels were constant in one pot and variable in the other. More roots developed in the constant pot if the mean nutrient level was high, and more roots developed in the variable pot if the mean nutrient level was low. This growth pattern comports with expected utility theory, which (assuming a suitably shaped utility function) mandates risk aversion in rich conditions and risk proneness in poor conditions. Block writes (Reference Block2023, pp. 209–210):

[T]he pea plant behaves as if it represented mean levels of nutrients and their degree of uncertainty. Since the pea plant lacks a nervous system, we can be pretty sure that there are no such representations. Somehow, natural selection has found a way for plants to behave according to some of the norms of Bayesian rationality without those representations. The challenge to Rescorla’s reasoning is that we have to allow for the possibility that the same is true of our perceptual systems.

Block concludes that, even if we favor a realist stance towards scientific theorizing in general, we should adopt an instrumentalist stance towards Bayesian perceptual psychology. Presumably he would extend the conclusion to other branches of Bayesian cognitive science.

In evaluating Block’s argument, we must carefully distinguish between subjective and objective probability. Due to the experimental protocols, there are objective probabilities that govern the nutrient level in each pot. We might gloss these either as frequencies or as chances. Either way, they are objective features of the world, lacking any subjective element. Given a pot’s objective probability distribution, we can describe the mean and the variance. The experiment shows that root growth is sensitive to both the mean and the variance. Thus, root growth is sensitive to objective probabilities (or to properties that supervene upon objective probabilities).

One might try to explain that sensitivity by attributing credal states to the pea plants. One might posit subjective probabilities, instantiated by each plant, that track the objective probabilities governing each pot. One might postulate that root growth is influenced by an expected utility computation based upon the posited subjective probabilities. I agree with Block that the proposed explanation is both implausible and unmotivated. It is implausible because plant physiology does not seem able to support expected utility computations. It is unmotivated because nothing about the pea plant study indicates that credal states or utility functions mediate the causal influence of objective probabilities upon root growth. The mere fact that a system is sensitive to the mean and variance of an objective probability distribution does not suggest that the system instantiates credal states. For example, we can construct a machine whose outputs are sensitive to the frequency with which a biased coin lands heads; there is no reason why the machine must instantiate credal states. Mere sensitivity to objective probabilities (or properties that supervene on objective probabilities) is not a prima facie indicator of Bayesian computation. This remains so even when the mapping from objective probabilities to outputs happens to mirror the dictates of expected utility theory.

One can describe any system using Bayesian decision theory. Adapting an example of Dennett’s (Reference Dennett1987, p. 23), one can “explain” why a lectern does not move by saying that the lectern assigns high utility to occupying the optimal location in the universe and assigns high credence to the hypothesis that it currently occupies the optimal location in the universe. Clearly, though, we should not accept the purported “explanation.” It contributes no value to our theorizing. It does not improve upon a non-Bayesian explanation couched wholly in terms of physics (e.g. an explanation that cites the law of inertia). A similar diagnosis applies to the pea plant study. The two cases are not totally analogous, because we know why the lectern does not move (inertia) but do not yet know the physiological mechanisms through which objective probabilities causally influence root growth. Still, as Dener, Kacelnik, and Shemesh (Reference Dener, Kacelnik and Shemesh2016) themselves emphasize, nothing about the pea plant study suggests that the mechanisms involve credal states or utility functions. We have no reason to think that Bayesian modeling would add any explanatory force to an eventual physiological explanation couched in non-Bayesian terms.

A very different diagnosis applies to numerous Bayesian models offered in cognitive science, including but not limited to perceptual psychology. In many cases, the Bayesian model adds considerable explanatory value to our theorizing. For example, the (Lakshminarasimhan et al., Reference Lakshminarasimhan, Petsalis, Park, DeAngelis, Pitkow and Angelaki2018) dead reckoning model explains why more overshooting occurs when the optic flow cue is degraded, and it also explains why undershooting occurs for relatively distant targets. The model thereby achieves the unity characteristic of good explanation. Similarly, the (Weiss, Simoncelli & Adelson, Reference Weiss, Simoncelli and Adelson2002) motion estimation model offers a unified explanation for diverse motion illusions. In these cases, and in many others, the Bayesian model makes a substantial explanatory contribution that looks otherwise unachievable. The explanatory contribution includes qualitative predictions for disparate phenomenon coupled with quantitative predictions that closely match experimental data. From a scientific realist perspective, these explanatory achievements provide reason to believe that the model is approximately true.

The contrast with the pea plant study is glaring. Dener, Kacelnik, and Shemesh (Reference Dener, Kacelnik and Shemesh2016) do not provide a Bayesian model of root growth. Indeed, they do not so much as hint which prior probability or prior likelihood such a model might include. They do not suggest, let alone argue, that a Bayesian model could offer a unified explanation for disparate phenomena or that it could yield quantitative predictions that closely match experimental data. There is not even a Bayesian model here that we can evaluate, much less a model that achieves anything approaching the explanatory success of the Bayesian dead reckoning model, the Bayesian motion estimation model, or numerous other Bayesian cognitive science models. That is why Bayesian cognitive science supports the existence of credal states but the pea plant study, for all its interest, does not.

I agree with Block that a system can simulate (approximate) Bayesian inference. For example, a system might convert inputs into outputs by consulting a look-up table. More realistically, a system might acquire an input‒output mapping through reinforcement learning. In reinforcement learning, the system receives rewards for how it responds to inputs, and it adjusts its responses to obtain optimal or near-optimal rewards. Systems trained through reinforcement learning can mimic certain kinds of approximate Bayesian inference (Weisswange et al., Reference Weisswange, Rothkopf, Rodemann and Triesch2011). In principle, then, evolution might produce a system that operates as if it executes approximate Bayesian inference even though it does not actually execute approximate Bayesian inference. Nevertheless, I think it misleading to describe evolution as a “pro-instrumentalist mechanism.” I see no reason why evolution should favor simulation of approximate Bayesian inference over implementation of approximate Bayesian inference.

When a scientific theory accurately predicts the behavior of a system, there is always a possibility that the theory is utterly false and that the system merely behaves as if the theory is true. For example, it is in principle possible that subatomic particles do not exist and that the physical universe merely behaves as if they exist. Physicists would only regard that in principle possibility as worth taking seriously if it were developed into a rival theory that matched modern physics in explanatory power. Similarly, we should only take seriously the suggestion that mental activity simulates rather than implements approximate Bayesian inference once it is developed into detailed models that rival current Bayesian models in explanatory power. So far, that has not happened. The scientific literature does not offer non-Bayesian models comparable in explanatory power to the Bayesian dead reckoning model, the Bayesian motion estimation model, or numerous other Bayesian models found in contemporary cognitive science. Perhaps impressive non-Bayesian models will eventually emerge. In their absence, the mere possibility that they might emerge should not worry realists about Bayesian cognitive science any more than the mere possibility of a successful physical theory that eschews subatomic particles should worry realists about subatomic particles.

5.3 The Argument From Altered Priors

I now rehearse an additional argument for realism regarding Bayesian cognitive science. My argument rests upon a crucial fact: input‒output mappings rapidly change in response to changing environmental conditions. We have seen several examples:

Shape and lightness perception change in response to stimuli that indicate a deviant lighting direction (Adams, Graf & Ernst, Reference Adams, Graf and Ernst2004).
Motion perception changes in response to fast-moving stimuli (Sotiropoulos, Seitz & Seriès, Reference Sotiropoulos, Seitz and Seriès2011).
The mapping from sensory inputs to motor commands changes in response to shifts in apparent finger position (Kording & Wolpert, Reference Kording and Wolpert2004).
Central tendency bias occurs in a wide range of domains, including perceptual estimation (Section 4.1) and dead reckoning (Section 4.3).

These experimental phenomena, and numerous others, conclusively demonstrate that the mapping from inputs to outputs is highly mutable.

Realists can easily explain in each case why the mapping changes as it does. They can say that the priors change so as to match changing environmental statistics. For example, suppose that a subject exhibits central tendency bias towards the mean of an experimentally imposed sample distribution, as in the (Petzschner & Glasauer, Reference Petzschner and Glasauer2011) dead reckoning experiment. Realists explain the bias as follows: the prior shifts to match the sample distribution, which causes estimates to shift towards the mean of the distribution. Instrumentalists can acknowledge that the input‒output mapping shifts, but they offer no principled explanation for why it shifts as it does. From an instrumentalist perspective, there is no principled reason why estimates should shift to match recent stimuli. The mere fact that a system simulates approximate Bayesian inference using certain priors provides no reason to expect that the system will change any particular way in response to changing environmental statistics. Hence, realism offers a major explanatory advantage over instrumentalism.

Call this the argument from altered priors. Although I have formulated the argument as applied to prior probabilities, similar argumentation applies to prior likelihoods and to posteriors (Rescorla, Reference Rescorla, Nes and Chan2020c).

Block rejects the argument from altered priors: “I find this argument unconvincing because whatever it is about the computations of a system that simulates the effect of represented priors … might also be able to simulate the effect of change of priors” (Reference Block2018, p. 8).

I agree that, in principle, a system that simulates approximate Bayesian inference given certain priors might respond to changing environmental conditions by simulating approximate Bayesian inference given another set of priors. I question whether instrumentalists can develop that possibility into compelling models. The argument from altered priors is abductive: realism provides the best explanation for why input‒output mappings change as they do. One does not undermine an abductive argument by noting that an alternative explanation may emerge. To undermine the argument from altered priors, one must provide a specific alternative explanation and show that it is at least as satisfying as the realist explanation.

In this connection, consider a system trained through reinforcement learning to simulate Bayesian inference given certain priors. By varying the rewards, we can train the system to simulate Bayesian inference given another set of priors. Accordingly, instrumentalists might hope that reinforcement learning can explain changes to the input‒output mapping. In many cases, though, subjects receive either no feedback or extremely limited feedback on their performance. To illustrate, consider the (Petzschner & Glasauer, Reference Petzschner and Glasauer2011) dead reckoning study. Participants received no feedback on their performance during each session, aside from a few initial training trials to ensure familiarity with the virtual reality setup. How, then, can reinforcement learning explain why subjects displayed central tendency bias? There was no “reward” to drive the ongoing change in learned responses. This study provides evidence that subjects iteratively update a distance prior in response to accumulated evidence.

Perhaps instrumentalist theories will eventually emerge that explain changing input‒output mappings without an appeal to changing priors. We would then need to compare those instrumentalist theories with realist Bayesian theories. Until that time, we do well to develop the realist perspective and see where it leads.Footnote ¹¹

5.4 Neural Implementation

To gain more insight into the dialectic between realism and instrumentalism, let us consider the neural implementation of approximate Bayesian inference. How are credal states physically realized in the brain? Which neural operations implement computation of the (approximate) posterior from the priors? These questions do not arise for instrumentalists because instrumentalists do not regard credal states and transitions as psychologically real. For realists, the questions are pressing.

Computational neuroscientists have proposed several theories of how the brain might implement credal states and transitions (Fiser et al., Reference Fiser, Berkes, Orbán and Lengyel2010; Pouget et al., Reference Pouget, Beck, Ma. and Latham2013; Rescorla, Reference Rescorla, Cheng, Sato and Hohwy2024). The proposed theories are biologically plausible and fit well with what we know about the brain, although no single theory has yet emerged as well-confirmed.

The credal states considered in Bayesian cognitive science are usually given by pdfs. Recall that a pdf determines probabilities assigned to intervals $[a, b]$ . There are infinitely many of these intervals. The brain is a finite physical system and hence, as discussed in Section 3.4, cannot explicitly list each individual probability $P ([a, b])$ . Since the brain cannot enumerate the probability assigned to each interval, probabilities must be implicitly encoded by neural activity. The two main implicit encoding schemes under active consideration were mentioned in Section 3.4:

Parametric encoding: the brain encodes parameters for the pdf. One possibility is that parameters are encoded by spike counts in a neural population (Ma et al., Reference Ma, Beck, Latham and Pouget2006). Each neuron is associated with a preferred stimulus value, and each neuron’s spike count is interpreted as the strength of its “vote” for that stimulus value. “Votes” across the neural population determine parameters of a pdf, e.g., the mean and variance of a Gaussian. See Figures 35 and 36.
Sampling encoding: the brain encodes a probability distribution via sampling propensities. For example, a neuron’s membrane potential might encode a sample (Orbán et al., Reference Orbán, Berkes, Fiser and Lengyel2016). The objective chance distribution governing membrane potentials encodes the subjective probability distribution for the variable.

Figure 35 The tuning curve for a neuron summarizes the neuron’s average response to a stimulus value. Figure 35 depicts tuning curves for a hypothetical neural population tuned to a one-dimensional continuous distal stimulus. The horizontal axis contains possible stimulus values. Each tuning curve depicts the average response (measured in spikes per second) of the corresponding neuron to possible stimulus values. Each tuning curve peaks at a preferred value of the stimulus. The black tuning curve has preferred stimulus value a.

Figure 36 Heuristic illustration of how a neural population can implicitly encode parameters of a probability distribution. The top left panel depicts how a hypothetical neural population responds to a stimulus on a given occasion. The horizontal axis groups neurons according to preferred stimulus value. The vertical axis gives the spike count for each neuron during a fixed time interval. Spike counts encode a Gaussian distribution, depicted in the top right panel. The parametric encoding scheme used here is discussed in Beck et al. (Reference Beck, Ma, Latham, Pouget, Cisek, Drew and Kalaska2007): the mean of the Gaussian is a weighted average of stimulus values, where the weights are the individual spike counts; the variance is inversely proportional to the total spike count. The bottom left panel depicts a different neural response. The bottom right panel depicts the encoded Gaussian distribution, using the same encoding scheme.

Computational neuroscientists have produced detailed neural network models that enshrine these encoding schemes. The models show how, in principle, a population of neurons could implement approximate Bayesian inference. Neuroscientists want to discover which implementation scheme(s) the brain actually uses.

Research into neural implementation of approximate Bayesian inference presupposes a broadly realist perspective on credal states and transitions (Ma, Reference Ma2019; Rescorla, Reference Rescorla2021a). If there are no priors, investigating how priors are realized in the brain is a waste of time. If the brain does not execute approximate Bayesian inference, investigating the neural operations that implement approximate Bayesian inference is a bad use of scientific resources. Thus, a major strand in current computational neuroscience presupposes a realist stance towards at least some Bayesian models.

Research into neural implementation also helps clarify realism’s commitments. The core realist thesis is that priors and posteriors are genuine mental states that mediate between inputs and outputs. As genuine mental states, they must be neurally realized in some way or other. Realism is neutral about how exactly they are realized. In particular, realists do not claim that credal assignments are explicitly enumerated in the brain. On the contrary, realists recognize that explicit enumeration is impossible for most cases. They instead appeal to implicit encoding. They hold that credal states posited by Bayesian models are implicitly encoded by the brain. The implicit encoding scheme might be parametric, sampling, or something else entirely (e.g. Ganguli and Simoncelli, Reference Ganguli and Simoncelli2014). The brain may also use multiple encoding schemes simultaneously. Realism does not enshrine a commitment to any particular encoding scheme or class of encoding schemes.

6 Mental Representation

The previous section advanced a realist perspective on the credal states posited by Bayesian models. I now want to probe more deeply into the nature of the posited credal states. I will explore how they relate to the mind’s representational nature.

The phrase “mental representation” is used many different ways in contemporary philosophy and psychology. My own usage reflects a tradition that traces back to Frege (Reference Frege and Beaney1892/1997) and continues through contemporary figures such as Burge (Reference Burge2010) and Fodor (Reference Fodor1975; Reference Fodor1987; Reference Fodor2008). According to this tradition, mental representation is connected with veridicality-conditions: conditions for veridically representing the world. Examples:

Beliefs are the sorts of things that can be true or false. My belief that Napoleon was born in Corsica is true if Napoleon was born in Corsica, false if he was not.
Intentions are the sorts of things that can be fulfilled or thwarted. My intention to eat lentils for lunch is fulfilled if I eat lentils for lunch, thwarted if I do not.
Perceptual states are the sorts of things than can be accurate or inaccurate. Suppose I perceive object o as being a green cube. Then my perceptual state is accurate only if o is green and cubical.

Beliefs have truth-conditions, intentions have fulfillment-conditions, and perceptual states have accuracy conditions. Truth, fulfillment, and accuracy are species of veridicality.

Representational properties are properties that contribute or potentially contribute to veridicality-conditions. For example, suppose I have a belief about Napoleon. The mere fact that my belief is about Napoleon does not determine whether my belief is true or false. Nevertheless, my belief depends for its truth or falsity on how things are with Napoleon (rather than some other person). That my belief is about Napoleon helps determine the belief’s truth-condition. So being about Napoleon is a representational property of my belief. Similarly, suppose I perceive some object as a green cube. The mere fact that my perceptual state represents green cubicality does not determine whether the state is accurate—accuracy also depends on which cube I am perceptually representing. Nevertheless, my perceptual state depends for its accuracy on whether the perceptually represented object is a green cube. That my perceptual state represents green cubicality helps determine the state’s accuracy-condition. So representing green cubicality is a representational property of my perceptual state.

I will argue that credal states posited within Bayesian cognitive science have representational properties, and I will elucidate the explanatory role played by these representational properties.

6.1 Representational Explanation

Bayesian cognitive science seeks to explain mental and behavioral outcomes. It frequently characterizes the outcomes in representational terms. Examples:

Perceptual psychology seeks to explain illusions. An illusion is a perceptual state that inaccurately represents the distal environment. So the science presupposes that perceptual states have accuracy-conditions.
Sensorimotor psychology seeks to explain how the motor system chooses motor commands that promote the agent’s goals. A goal may be fulfilled or thwarted. So the science presupposes mental states with fulfillment-conditions. These are conative states, i.e., mental states whose role is to initiate and sustain action. Often, the conative state is an intention (e.g. an intention to reach to a target). Burge (Reference Burge2022, pp. 502–530) argues that there also exist relatively low-level conative states lacking various features of intention, such as intention’s characteristic ties to theoretical and practical reasoning, and that these low-level conative states set goals for motor control. For present purposes, the key point is that sensorimotor psychology presupposes goal-setting by mental states with fulfillment-conditions.
Sensorimotor psychology seeks to explain why movement details vary more along task-irrelevant dimensions than task-relevant dimensions. The distinction between task-relevant and task-irrelevant dimensions presupposes a goal that may be fulfilled or thwarted.
Research on human dead reckoning seeks to explain overshooting. To overshoot a location, the subject must have that location as a goal. So the science presupposes a conative mental state with a fulfillment-condition. Since human dead reckoning typically interfaces with fairly sophisticated planning and decision-making, it seems likely that the conative state is typically an intention or something much like an intention. In some special cases, though, it may be a relatively low-level conative state more along the lines discussed by Burge.

As these examples illustrate, Bayesian cognitive science often characterizes explananda in representational terms.

The science also frequently characterizes explanantia, including credal states, in representational terms. Examples:

The Bayesian dead reckoning model explains overshooting by positing a “slow speed” prior over self-motion. The prior causes the navigation system to underestimate displacement. To encode a prior that favors slow speeds, the navigation system must be able to represent speed. So the explanation of overshooting presupposes that the navigation system can represent speed. The explanation hinges upon a credal allocation over possible speeds, leading to an inaccurate displacement estimate.
Bayesian perceptual psychology assumes that the perceptual system represents distal properties. It posits a prior regarding represented distal conditions (e.g. a prior that favors overhead lighting directions). When the prior is poorly calibrated to the perceiver’s environment, the resulting perceptual estimates tend to be inaccurate. For example, the “light from overhead” prior produces inaccurate shape estimates in deviant conditions where light comes from below.
To explain how the motor system promotes the agent’s goals, Bayesian sensorimotor psychology posits sequential updating of credal assignments regarding the distal environment and the subject’s own body. Credal assignments influence which motor commands are chosen. When credal assignments are poorly calibrated to the environment (e.g. the prior over shifts in finger position does not match actual finger shifts), the task goal tends to be thwarted.

Generally speaking, Bayesian cognitive science posits credal states regarding environmental conditions, including both distal properties (e.g. size, shape, color, location, density, etc.) and bodily properties (e.g. hand position). In describing credal states, researchers presuppose that the mind can represent the relevant environmental properties. Researchers characterize credal states by invoking representational relations to the environment. They cite these representationally-characterized credal states as explanantia.

Researchers in Bayesian cognitive science do not use the phrase “veridicality-condition.” The speak instead of random variables, probability distributions, pdfs, and other entities drawn from probability theory. Nevertheless, their theorizing assigns a central role to veridicality-conditions. They identify both explananda and explanantia by citing representational properties: either veridicality-conditions or properties that potentially contribute to veridicality-conditions.

Take the Bayesian dead reckoning model. The model seeks to explain overshooting, which presupposes a conative state with a fulfillment-condition. To explain overshooting, the model posits that the navigation system underestimates displacement—in other words, that the displacement estimate is inaccurate. So the model explains overshooting by positing a mental state (the displacement estimate) that is evaluable as veridical or nonveridical. To explain why the navigation system underestimates displacement, the model posits a prior that favors slow speeds. The prior assigns credences to hypotheses regarding the creature’s speed. For example, it assigns a credence to the hypothesis that the creature’s speed lies in the interval $[a, b]$ . This hypothesis is individuated through representational relations to possible speeds (namely, speeds lying between a and b). By citing credal assignments to representationally-individuated hypotheses, the model depicts the navigation system as favoring slow speeds. It thereby explains overshooting. Explanation is laced at every stage with appeals to representational properties.

Similarly, consider Bayesian modeling of size perception (Ernst & Banks, Reference Ernst and Banks2002; Helbig & Ernst, Reference Helbig, Ernst and Grunwald2008). Here we posit a prior over possible distal sizes. The prior combines with sensory input (e.g. haptic or visual input) and a prior likelihood, yielding a posterior over possible distal sizes. On that basis, the perceptual system chooses a privileged size estimate, which goes into the final percept. The percept is veridical only if the perceived object has the estimated size. Thus, the final size estimate is individuated representationally—through its contribution to the percept’s veridicality-condition. The prior and posterior are also characterized representationally. These are credal states that allocate credences over hypotheses regarding distal size. Hypotheses are individuated through their representational properties—through the distal sizes that they represent. So the model posits mental states with representational properties, mediating between proximal sensory input and the (representationally-characterized) perceptual size estimate.

One could offer a similar analysis for virtually every other explanation found within Bayesian cognitive science. Bayesian researchers frequently characterize explananda in representational terms. They almost invariably characterize credal states in representational terms. For that reason, their research fits well with the representationalist paradigm espoused by Burge (Reference Burge2010; Reference Burge2022), Fodor (Reference Fodor1975; Reference Fodor1987; Reference Fodor2008), Peacocke (Reference Peacocke1994; Reference Peacocke1999), Pylyshyn (Reference Pylyshyn1984), Shea (Reference Shea2018), and many others.

6.2 Credal States Versus Mathematical Tools

I now develop my analysis by examining more closely the formal apparatus used by Bayesian modelers. The key point I wish to highlight is the distinction between credal states versus the mathematical tools used to specify credal states.

Look again at Figures 33 and 34. The green downward-sloping curve is a pdf: a function from $ℝ$ to $ℝ$ . The pdf induces a probability distribution over sets of real numbers. The pdf and the induced probability distribution are mathematical tools that theorists use to specify the “slow speed” prior. The “slow speed” prior is a credal state: an assignment of credences to hypotheses. We must sharply distinguish the credal state from the pdf and also from the induced probability distribution. Nothing about the pdf taken on its own suggests we are modeling a credal state that concerns speed. The same pdf could just as well specify a prior over possible sizes, or possible distances, or any other one-dimensional continuous physical magnitude. The pdf in itself does not indicate that we are modeling a “slow speed” prior as opposed to a “small size” prior, a “short distance” prior, or numerous other possible priors. The same goes for the induced probability distribution.

Similar remarks apply to most other Bayesian models. The modeler typically specifies credal states through a probability distribution over sets of real numbers, which in turn is typically specified through a pdf. The probability distribution taken on its own does not even begin to dictate the underlying credal state. The credal state is defined over hypotheses that are individuated through their representational relations to the environment. The probability distribution is a mathematical function individuated without regard to any such representational relations. The same probability distribution could just as well specify many different credal states.

To identify the credal state specified by a pdf, we must look beyond mathematical formalism and consider the broader enterprise to which the formalism contributes. We must first ask which psychological domain is being modeled: perception, or motor control, or navigation, and so on. We must also ask which aspects of the environment are represented by the credal state: shape, or size, or color, or speed, and so on. Usually, we can answer these questions by studying the text that accompanies the formalism. For example, Lakshminarasimhan et al. (Reference Lakshminarasimhan, Petsalis, Park, DeAngelis, Pitkow and Angelaki2018, p. 195) write that overshooting “can be explained by a model in which subjects maximized their expected reward under the influence of a slow-speed prior rather than by leaky integration of unbiased velocity estimates.” This passage and kindred passages show that the pdf from Figures 33 and 34 is intended to specify a credal state that favors slow speeds and that is deployed during dead reckoning. Analogous passages abound throughout Bayesian cognitive science. These passages are not idle prattle or disposable heuristic. They play a crucial theoretical role: they point us towards the credal states specified by Bayesian models.

Pdfs are indispensable mathematical tools. They allow us to specify credal states with mathematical precision, and they allow us to bring the calculus of real numbers to bear. Ultimately, though, they omit something crucial. They omit the representational properties that help individuate credal states.

To bring the distinction between credal states versus mathematical tools into sharper relief, it helps to reflect upon measurement units. Using measurement units, we can describe a physical magnitude (such as a speed) with a real number. For example, we can say that an object’s speed is 10 meters/sec. The physical magnitude is quite distinct from the number 10 that we use to measure it, as evidenced by the fact that a change in measurement units necessitates a change in the number used to specify the same physical magnitude. If we switch from meters/sec to feet/sec, we must now say that the object travels at

3.28084 \times 10 = 32.8084

feet/sec. We cite a different number to specify the same speed. Speeds are distinct from the numbers through which we measure speeds.Footnote ¹²

When we specify a credal state through a pdf, our choice of pdf depends upon a canonical choice of measurement units. A change in measurement units necessitates a change in the pdf we use to specify the credal state. Figure 37 illustrates. The blue pdf corresponds to meters/sec. The orange pdf corresponds to feet/sec. The pdfs are different, but they specify the same underlying probability assignment over possible speeds. They specify the same “slow speed” prior. Full technical details are given in Section A5, but the point should be intuitively clear even absent any technical details. A pdf is defined over real numbers, so it can describe a credal allocation over possible speeds only relative to measurement units that map speeds to real numbers. If we change the measurement units, then we must use a different pdf to model the same credal allocation over speeds. The different pdf will induce a different probability distribution over sets of real numbers, even while the underlying credal allocation remains fixed.

Figure 37 A change in measurement units necessitates a change in pdf. The blue pdf corresponds to meters/sec. The orange pdf corresponds to feet/sec. To convert from meters/sec to feet/sec, multiply by 3.28084. In Figure 37, $b = 3.28084 \times a$ . The area under the blue curve over $[0, a]$ equals the area under the orange curve over $[0, b]$ . The same equality holds for all other points a and b such that $b = 3.28084 \times a$ . Thus, the two pdfs model the same probability assignment over possible speeds.

Our choice of measurement units reflects our societal conventions, not inherent features of the credal state itself. There is no reason to suspect that pre-theoretic human navigation employs our conventional measurement units. Indeed, it may not use any measurement units at all. (Cf. Peacocke, Reference Peacocke2019, p. 48.) The same credal state could just as well be specified by a different pdf. For example, there is no reason to regard the blue pdf from Figure 37 as privileged over the orange pdf. Neither pdf has more psychological reality than the other. Psychological reality resides in the underlying credal state—a credal allocation over hypotheses regarding possible speeds—rather than the pdf.

The “slow speed” prior is a credal state that allocates credences over hypotheses, where the hypotheses are individuated through the specific speeds that they represent. The pdf is a purely mathematical function that reflects a conventional choice of measurement units. The prior does not reflect any such conventional choice. The pdf is a useful tool for specifying the underlying credal state, but its mathematical elegance should not dazzle us into ascribing psychological reality to it. The credal state is psychologically real. The pdf is not psychologically real, and neither is the induced probability distribution over sets of real numbers.

6.3 Random Variables Revisited

We can clarify the distinction between credal states and mathematical tools by revisiting the notion of random variable.

Recall from Section 2.3 that a random variable X maps an outcome space Ω to the real numbers $ℝ$ . For example, suppose the outcome space Ω contains possible speeds of an asteroid. Each outcome ω in Ω is a speed that the asteroid might have. Speeds are physical magnitudes, not real numbers. Assuming a canonical choice of measurement units, we can measure magnitudes using real numbers. Let X be a random variable that maps each speed to the corresponding real number, using meters/sec as canonical units. Thus,

X (ω) = x

when x specifies speed ω in meters/sec. X is a function from Ω (the set of possible speeds) to $ℝ$ .

Given a random variable and an underlying outcome space Ω, we can use a probability distribution over sets of real numbers to specify a probability distribution over sets of outcomes. Continuing with the asteroid example, suppose we are given a probability distribution μ that assigns probabilities to sets of real numbers. Then we can use X and μ to assign probabilities to sets of speeds. For example, what probability should we assign to the event $X^{- 1} [a, b]$ ? This event codifies the hypothesis that the asteroid’s speed falls between a and b. If we are taking μ as a guide, we should assign the same probability to $X^{- 1} [a, b]$ that μ assigns to $[a, b]$ . In other words, if $P (X^{- 1} [a, b])$ is the probability assigned to $X^{- 1} [a, b]$ , then we should have

P (X^{- 1} [a, b]) = μ ([a, b]),

As Figure 38 illustrates, we can use X to transfer the probability distribution μ defined over sets of real numbers into a probability distribution P defined over sets of speeds. More generally, and as discussed more rigorously in Section A3, we can always use a random variable to transfer a probability distribution over sets of real numbers into a probability distribution over sets with members drawn from the underlying outcome space.

Figure 38 Illustrates the relation between P and μ. μ maps $[a, b]$ to the same probability as that to which P maps $X^{- 1} [a, b]$ .

Now consider a different random variable Y that maps each magnitude to the corresponding real number using feet/sec. Thus,

Y (ω) = y

when real number y specifies speed ω in feet/sec. Using the standard conversion from meters/sec to feet/sec, we obtain the following relation between X and Y:

Y (ω) = 3.28084 X (ω) .

The same magnitude ω is mapped to a different real number, depending on whether we are using meters/sec (corresponding to X) or feet/sec (corresponding to Y).

If we want to specify a fixed probability distribution P over sets whose members come from Ω itself, then X and Y mandate different probability distributions over sets of real numbers. Figure 37 illustrates. The pdf in blue generates one probability distribution over sets of real numbers. The pdf in orange generates a second probability distribution over sets of real numbers. Transferring the first probability distribution via X yields the same result as transferring the second probability distribution via Y. The very same probability distribution P over sets of speeds results if we use the blue pdf and X or if we use the orange pdf and Y.

These observations complement my diagnosis from Section 6.2. Our choice of pdf reflects our choice of a random variable, which reflects our choice of measurement units. Different measurement units mandate a different pdf in order to specify the same probability distribution over sets with members drawn from the underlying outcome space. These facts, which are basic to probability theory, reflect the inherently arbitrary nature of measurement using real numbers. Many different measurement units are equally legitimate. Different units yield different pdfs and different probability distributions over sets of real numbers, but the underlying probability distribution over sets of outcomes remains fixed.

In the special case of Bayesian cognitive science, we seek to model credal allocations by an agent or an agent’s psychological subsystems (such as the perceptual system). The most convenient way to specify a credal allocation is usually through a pdf, as in Figures 33 and 34. The pdf is merely a tool for specifying a credal allocation over an underlying outcome space Ω. More precisely: the credal allocation assigns credences to sets whose members are drawn from Ω. The pdf depends upon an arbitrary choice of measurement units. The credal allocation does not. Psychological reality resides in the credal allocation, not the pdf.

6.4 The Objects of Credence

In Sections 6.2 and 6.3, I argued that the pdfs invoked by Bayesian cognitive scientists are mathematical tools for specifying credal states. A credal state assigns credences to sets of outcomes, where outcomes are drawn from an outcome space Ω. What are the outcomes? In other words, what are the elements of Ω? Answering this question is a large undertaking. I will broach a few preliminary considerations that should inform a more complete treatment.

Note first that Kolmogorov’s axiomatization does not tell us what outcomes are. Kolmogorov assigns probabilities to events: sets whose members belong to an outcome space Ω. He places no constraints whatsoever upon Ω’s members. Thus, the mathematical formalism of probability theory does not answer our question.

Explanatory practice within Bayesian cognitive science places some constraints upon Ω, but it does not dictate a unique answer. For example, the Bayesian dead reckoning model posits a prior over possible speeds, so outcomes must intimately relate somehow to speed. However, this constraint leaves room for various interpretations.

One interpretation is that Ω contains possible worlds. We would then construe events as sets of possible worlds. In the Bayesian dead reckoning model, the hypothesis that the creature moves with speed between a and b would be codified as the set of possible worlds where the creature moves with speed between a and b. In a Bayesian model of size perception, the hypothesis that the perceived object has size between a and b would be codified as the set of possible worlds where the perceived object has size between a and b. These codifications fit well with contemporary philosophical work, which often assigns credences to sets of possible worlds. More generally, they fit well with the longstanding philosophical tradition, mentioned in Section 2.1, of glossing propositions as sets of possible worlds.

A second interpretation is that Ω contains physical magnitudes. In the Bayesian dead reckoning model, the hypothesis that the creature moves with speed between a and b would be codified as the set of speeds between a and b. In a Bayesian model of size perception, the hypothesis that the perceived object has size between a and b would be codified as the set of sizes between a and b.

A third interpretation is that Ω contains mental representations. A mental representation is a mental item with representational properties. Mental representations are similar in key respects to the communal representations employed by human society, such as pictures, maps, or natural language sentences, but they are housed in the mind rather than the external world. They can be stored in memory, manipulated during mental activity, and combined to form complex representations. Appeal to mental representations is widespread in cognitive science theorizing (Carey, Reference Carey2009; Fodor, Reference Fodor1975; Fodor, Reference Fodor2008; Gallistel & King, Reference Gallistel and King2009; Pylyshyn, Reference Pylyshyn1984; Rescorla, Reference Rescorla, Smorthchkova, Schlicht and Dolega2020d). If we take Ω to contain mental representations, then we will construe events as sets of mental representations. In the Bayesian dead reckoning model, the hypothesis that the creature moves with speed between a and b would be codified as something like the set of mental representations that attribute speed between a and b. In a Bayesian model of size perception, the hypothesis that the perceived object has size between a and b would be codified as something like the set of mental representations that attribute size between a and b. Although these codifications may look odd to philosophers reared on the possible worlds interpretation, they fit nicely with the widespread cognitive science commitment to mental representations.

Each of the three interpretations is compatible with a realist perspective on Bayesian cognitive science. Moreover, each interpretation codifies hypotheses in representational terms. The first interpretation collects together those possible worlds where the hypothesis is veridical. The second interpretation collects together those physical magnitudes that are consistent with the veridicality of the hypothesis. The third interpretation collects together mental representations according to which the hypothesis is veridical. Thus, all three interpretations analyze credal states representationally—in terms of veridicality-conditions or representational properties that contribute to veridicality-conditions.

All three interpretations deserve detailed consideration, as no doubt do other interpretations. My own sympathies lie with the third interpretation, but I will not attempt to defend it here. My goal instead is to highlight the need for some interpretation. To understand the credal states posited by Bayesian cognitive science, we must identify the entities to which credences attach. We must identify the objects of credence. Assuming that credences attach to sets, our task is to identify which elements belong to the sets. By making progress on this task, we may hope to illuminate the representational nature of credal states.Footnote ¹³

6.5 How Many Outcomes?

In addition to studying what outcomes are, we must also consider how many outcomes there are. A set is countable when we can count its members using the numbers 0, 1, 2, 3, … . A set is uncountable when we cannot so count its members. A random variable is discrete when it has countably many possible values, nondiscrete when it has uncountably many possible values. The Bayesian dead reckoning model features a random variable whose possible values correspond to possible speeds of the navigator (specified through canonical measurement units). Even if we stipulate a maximum possible speed s, there are still uncountably many real numbers lying in the interval $[0, s]$ and hence uncountably many possible speeds. So the random variable is nondiscrete, and the underlying outcome space Ω is uncountable. Similarly for Bayesian modeling of motion estimation (Weiss, Simoncelli & Adelson, Reference Weiss, Simoncelli and Adelson2002), size estimation (Ernst & Banks, 2002), motor control (Todorov & Jordan, Reference Todorov and Jordan2002), and numerous other tasks. In general, whenever cognitive scientists model Bayesian estimation of a physical magnitude that has uncountably many possible values (e.g. time, distance, speed, orientation, size), the resulting Bayesian model invokes a nondiscrete random variable X defined over an uncountable outcome space.

Taken literally, such a model attributes highly infinitary representational capacities. More specifically:

(i) The model posits credal states (a prior and a posterior) that assign probabilities to events $X^{- 1} [a, b]$ . There are uncountably many events $X^{- 1} [a, b]$ , so the model posits a credal assignment over uncountably many events.
(ii) The model posits credal states drawn from among uncountably many possible options. This remains so even if we demand that credal assignments belong to a fixed parametric family, such as the family of Gaussian distributions.
(iii) The model posits a privileged estimate x* of X’s value, as in Figure 28. There are uncountably many possible values x*, so the model posits a privileged estimate selected from among uncountably many options.

Hence, the model attributes highly infinitary representational capacities when specifying both credal states and privileged estimates.

Some philosophers will bristle at these infinitary attributions. The attributions may look incompatible with obvious finitary limits on our representational or computational capacities. It might seem that we should dismiss (i)‒(iii) as mere idealizations, eventually to be obviated by a more plausible model that honors the finitary limits on human mental activity. Shouldn’t a plausible model restrict itself to a finite outcome space?

I agree that there are finitary limits of some sort on human representational and computational capacities. For example, we do not have infinite memory storage capacity: the mind cannot explicitly list infinitely many distinct pieces of information. Yet I wonder whether (i)‒(iii) flout any genuine finitary limits on human mental activity. As discussed in Section 5.4, computational neuroscience offers various theories of how the brain could, in principle, implement or approximately implement Bayesian inference. The theories are biologically plausible, and they fit well with diverse neurophysiological data. Several theories describe the brain as implementing a Bayesian model that satisfies (i)‒(iii). Those theories feature nondiscrete neural variables (e.g. membrane potential), which are taken to provide a substrate for credal states. Thus, (i)‒(iii) look compatible with lots of work in contemporary computational neuroscience.

The classical computational theory of mind (CTM) holds that mental activity is digital computation (Fodor, Reference Fodor1975; Reference Fodor1987; Reference Fodor2008; Gallistel & King, Reference Gallistel and King2009; Pylyshyn, Reference Pylyshyn1984; Rescorla, Reference Rescorla2020). A digital computing system has at most countably many possible computational states. Hence, CTM is incompatible with (ii) and (iii). However, CTM is compatible with (i). There is a well-developed framework—computable probability theory—that studies how digital computing systems can encode and compute over probability distributions (Ackerman, Freer & Roy, Reference Ackerman, Freer and Roy2019). In this framework, the computing system often satisfies (i) but not (ii) or (iii). The system encodes a credal assignment over uncountably many events, but there are only countably many possible credal assignments and privileged estimates x* available to the system. For example, the system may encode a Gaussian distribution, but there are only countably many distinct Gaussian distributions that it could have instead encoded (it can only encode a Gaussian whose mean and variance are drawn from a fixed countable set). In more practical terms, computer scientists and roboticists frequently program digital systems to compute over nondiscrete random variables (e.g. Thrun, Burgard & Fox, Reference Thrun, Burgard and Fox2005). These systems encode a wide range of probability distributions, including Gaussian distributions and many others besides. Their computations satisfy (i) though not (ii) and (iii). Thus, proponents of CTM can happily allow that the mind assigns credences to uncountably many events.

Infinitary Bayesian models raise thorny questions at the intersection of philosophy, psychology, computation theory, and neuroscience.Footnote ¹⁴ I cannot hope to settle these questions here. For present purposes, the key point is that a realist representationalist perspective on Bayesian cognitive science admits several divergent reactions to an explanatorily successful Bayesian model defined over a nondiscrete random variable, including the following three reactions:

Accept the model at face value; embrace (i)‒(iii).
Guided by computable probability theory, emend the model by allowing only countably many of the credal states and estimates posited by the model; embrace (i) but not (ii) and (iii).
Try to replace the model with a purely finitary approximation; reject (i)‒(iii).

Each position is compatible with realism, which commits to credal states and transitions approximately like the ones posited by the model but does not insist that the model is literally true. Each position is compatible with representationalism, which champions the representational nature of credal states but does not mandate infinitary representational capacities.

7 Anti-representationalism

Anti-representationalists hold that we should expunge mental representation from rigorous scientific theorizing. They seek to explain mental and behavioral phenomena in strictly nonrepresentational terms. Different anti-representationalists favor different nonrepresentational paradigms:

Quine (Reference Quine1960) favors Skinnerian stimulus-response psychology.
Churchland (Reference Churchland1981) favors a neurophysiological paradigm.
Field (Reference Field2001) and Stich (Reference Stich1983) favor nonrepresentational computational description.
van Gelder (Reference van Gelder1992) favors dynamical system theory.

Despite these differences, anti-representationalists agree that mental representation makes no useful contribution to scientific theorizing about the mind.

Anti-representationalism conflicts with Bayesian cognitive science. As we have seen, Bayesian researchers routinely characterize explananda in representational terms. If we abjure representational discourse, then we cannot acknowledge those explananda. For example, anti-representationalists cannot replicate how the Bayesian dead reckoning model explains overshooting: overshooting is a representationally-characterized explanandum, because a subject can overshoot a location only when she has that location as a goal. Nor can anti-representationalists characterize a perceptual state as illusory: an illusion requires perceptual states with accuracy-conditions. Nonrepresentational theorizing ignores representational properties and hence cannot mention, let alone explain, representationally-characterized explananda. Since anti-representationalists cannot explain representationally-characterized explananda, they cannot replicate the explanatory benefits secured by Bayesian cognitive science.

Neither can anti-representationalists accept successful Bayesian explanations for nonrepresentational explananda. Suppose we characterize the results of dead reckoning in purely nonrepresentational terms. For example, we can identify the creature’s final position across various trials, without mentioning whether the position overshoots any target location. The Bayesian dead reckoning model explains the nonrepresentationally characterized explanandum. It does so by isolating causally relevant factors (including the “slow speed” prior) that influence position. This is a representational explanation: it explains a nonrepresentational explanandum (position) by citing a credal allocation over representationally-characterized hypotheses. Anti-representationalists cannot accept the explanation. Their anti-representationalist scruples forbid explanations that cite representational properties of mental states.

Anti-representationalists claim that we can replicate any purported benefits of representational explanation through alternative explanations couched in purely nonrepresentational terms. They claim that we can jettison mental representation while preserving the explanatory achievements enabled by representationally-characterized explanantia. The long, dismal history of anti-representationalist theorizing provides little basis for that claim. Anti-representationalists have consistently failed to match even the most elementary explanatory achievements of representationalist cognitive science. For example, Gibson’s (Reference Gibson1979) direct perception framework seeks to analyze perception in nonrepresentational terms, but it cannot explain a huge range of perceptual illusions and constancies (Fodor & Pylyshyn, Reference Fodor and Pylyshyn1981). Similar remarks apply to numerous other anti-representationalist theories that have flitted in and out of fashion over the past century.

In the present dialectical context, the key question is whether anti-representationalists can preserve the explanatory benefits of Bayesian models without invoking representational mental states. I doubt it. One cannot usually strip a scientific theory of its main theoretical concepts while retaining its explanatory benefits. For example, renouncing talk about subatomic particles would severely limit the explanatory power of physics. I see no reason to think that we can renounce talk about representational credal states while retaining the explanatory benefits provided by such talk. Consider the Bayesian dead reckoning model. It relies in an essential way upon the “slow speed” prior. By invoking this prior, the model achieves a much better fit with experimental data than the hitherto dominant “leaky integrator” model. The “slow speed” prior is characterized in representational terms. How, then, can we replicate its explanatory contribution while eschewing representational discourse? The principal explanatory advance made by the model was a posit of representational mental states.

To support my viewpoint, I will now critique two anti-representationalist interpretations of Bayesian cognitive science. The interpretations differ in various ways. They agree that, contrary to what I suggested in Section 6, Bayesian models of the mind do not postulate representational mental states. I will explain why I think both interpretations are mistaken.

7.1 Function-theoretic Computation

Egan (Reference Egan2010; Reference Egan, Smorthchkova, Schlicht and Dolega2020) advocates a function-theoretic approach to mental computation: “The input of a computationally characterized mechanism represents the arguments and the outputs the values of a mathematical function that canonically specifies the task executed by the mechanism” (Reference Egan, Smorthchkova, Schlicht and Dolega2020, p. 33, fn. 7). A computational theory of a mechanism “comprises a specification of the function (in the mathematical sense) computed by the mechanism” (Reference Egan, Smorthchkova, Schlicht and Dolega2020, p. 33). Thus, computational psychology provides “an abstract mathematical description” that prescinds from representational properties of mental states (Egan, Reference Egan2010, p. 256). She admits that cognitive scientists frequently mention representational properties when describing mental states. She maintains that representational discourse “is best construed as a kind of gloss—an intentional gloss—on the computational theory” (Reference Egan, Smorthchkova, Schlicht and Dolega2020, p. 33). The intentional gloss plays a useful heuristic role in our theorizing: it helps us connect our computational description with representationally-characterized explananda; it helps us track how the computational mechanism responds to environmental events; and it can serve as a temporary placeholder until we discover an underlying computational mechanism. Representational properties do not figure in genuinely computational theories and are not necessary for good cognitive science explanations: “the computational theory proper can fully explain the interaction between organism and environment … without adverting to cognitive content” (Reference Egan, Smorthchkova, Schlicht and Dolega2020, p. 34).

Egan’s function-theoretic approach encompasses the following three doctrines, each of which I reject:

(a) Computational models of the mind mention inputs and outputs, but they do not mention internal states that mediate between inputs and outputs.
(b) Computational models describe inputs and outputs in purely mathematical terms, without mentioning any representational properties of the inputs or outputs.
(c) Representational discourse plays a purely heuristic role in cognitive science theorizing. It makes no genuine explanatory contribution.

I will critique doctrines (a)‒(c) in turn.

Doctrine (a) conflicts with huge amounts of cognitive science theorizing. Computational modeling by cognitive scientists routinely posits internal states that mediate between inputs and outputs. All the Bayesian models I have discussed above are examples. For example, Bayesian models of perception as encapsulated by Figure 28 posit three internal credal states: a prior probability, a prior likelihood, and a posterior (or approximate posterior). These credal states mediate between the input (proximal sensory stimulation) and the output (a privileged perceptual estimate of a distal property). More complex models, such as models of motor control, posit a sequence of credal states mediating between inputs and outputs. Evidently, Bayesian models commit to far more internal computational detail than (a) allows. The models do not merely describe a function from inputs to outputs. They say something informative about the internal states and transitions through which the system converts inputs into outputs.

Egan disagrees. She asserts that Bayesian models carry “no commitment to internal states or structures and causal processes defined on them” (Reference Egan, Smorthchkova, Schlicht and Dolega2020, p. 48) and that “Bayesian models, to the extent that they say anything about how the brain actually works, give … a function-theoretic characterization; they specify the function, in the mathematical sense, computed by the mechanism” (Reference Egan, Smorthchkova, Schlicht and Dolega2020, p. 50). She does not justify her analysis by adducing a single Bayesian model found in cognitive science. She does not attempt to reconcile her analysis with the commitment, apparently ubiquitous throughout Bayesian cognitive science, to credal states and transitions. She simply states, without evidence or argument, that Bayesian models are not committed to any internal states or processes.

Egan professes neutrality in the debate between realist versus instrumentalist perspectives on Bayesian modeling (Reference Egan, Smorthchkova, Schlicht and Dolega2020, p. 49, fn. 22). Yet her analysis seems irreconcilable with the most faintly realist perspective. On anything resembling realism, we should accept the existence of credal states and transitions mediating between inputs and outputs. Only if we adopt a strongly instrumentalist perspective may we regard a Bayesian model as specifying a mere function from inputs to outputs. I indicated in Section 5 why I favor realism over instrumentalism.

Doctrine (b) is also problematic, at least as applied to Bayesian modeling of the mind. Bayesian models routinely specify either inputs or outputs in representational terms:

Bayesian sensorimotor models specify a task goal as input to sensorimotor processing. The goal is set by a conative state with a fulfillment-condition. Hence, the Bayesian model presupposes a representationally-specified mental state.
Bayesian perceptual models usually yield as output a perceptual estimate of some distal property. Estimates can be accurate or inaccurate. An estimate that an object has a certain size is accurate only if the object has that size; an estimate that an object moves with a certain speed is accurate only if the object moves with that speed; and so on.

These representational descriptions are inherent to the computational model. They are all we have to go on when identifying the relevant inputs or outputs. For example, suppose a Bayesian model outputs an estimate of an object’s size. The model individuates the estimate through its representational relation to a specific distal size. If we abandon any reference to represented size, we abandon our only way of identifying the model’s outputs.

Doctrine (c) is similarly problematic. As I documented in Section 6, Bayesian models routinely individuate credal states in representational terms. Abandoning representational discourse leaves us with no way to identify the credal states postulated by the model and hence no way to replicate an explanation that cites those credal states. For example, suppose a Bayesian perceptual model posits a prior over distal size. If we refuse to mention sizes represented by the perceptual system, then we cannot identify the hypotheses to which the prior assigns credences, so we cannot cite the prior to explain anything. Accordingly, I disagree with Egan’s claim that “Bayesian models are typically not developed at a level of description that allows us to assess their representational commitments, in the relevant sense. They have no representational commitments, in the relevant sense” (Reference Egan, Smorthchkova, Schlicht and Dolega2020, p. 48). Once again, Egan does not provide any concrete examples to validate her assessment. She does not indicate, for even a single case, how we are to individuate credal states in nonrepresentational terms. The lack of detail is not surprising, since a nonrepresentational individuative scheme looks fundamentally incompatible with the core methodology of Bayesian cognitive science.

Egan is certainly correct that Bayesian cognitive science uses mathematical tools to characterize inputs, outputs, and mediating credal states. Inputs and outputs are typically described using real numbers. Mediating credal states are usually described using pdfs. So Bayesian modeling includes “abstract mathematical descriptions” somewhat along the lines favored by Egan. As explained in Sections 6.2 and 6.3, these mathematical descriptions reflect our arbitrary, conventional choice of measurement units. Psychological reality and explanatory power reside in the representational states specified by our mathematical descriptions, not in the mathematical descriptions themselves. For example, if we describe the perceptual system as estimating that an object has size s, the specific real number s reflects our arbitrary choice of units for measuring size. The number has no psychological reality. It explains nothing. What is psychologically real is that the perceptual estimate represents a specific size—a physical magnitude, not a real number. Similarly, if we describe a credal state using a pdf, the pdf reflects our arbitrary choice of measurement units. It has no psychological reality. It explains nothing. The credal state is what does the explaining.

The mathematical descriptions emphasized by Egan are artifacts of our measurement conventions. Different measurement units would yield a different mathematical description, including a different function from inputs to outputs, while leaving representational description the same. Representational description, not mathematical description, is the locus of psychological reality and explanatory power. For example, suppose we learn that a Bayesian dead reckoner estimates speed 5. Does that knowledge in itself help us explain overshooting? No. We need to specify measurement units! 5 meters/sec, or 5 feet/sec, or something else? The number 5 by itself is explanatorily irrelevant. What matters is the physical magnitude measured by 5—that is, the speed represented by the dead reckoner. The represented magnitude, not the number, is explanatorily important. Similar remarks apply to other mathematical descriptions found in Bayesian cognitive science, including specification of pdfs.

I critiqued Egan along these lines in previous work (Rescorla, Reference Rescorla and Matthen2015a). Egan deems my critique “very puzzling” (2020, p. 48) and retorts (2020, p. 50):

To think that commitment to Bayes’ theorem—a function defined on probability distributions—reflects an arbitrary choice of conventions is analogous to thinking that a claim that a device computes the addition function reflects a commitment to represent addends and sums in base 10. Contra Rescorla, to the extent that Bayesian models are to be construed realistically … such proposals should be construed as hypotheses about underlying psychological reality, committed, in particular, to the claim that the system is computing an approximation to Bayes’ theorem.

I respond as follows:

Bayes’s theorem is not “a function defined on probability distributions.” It is a theorem.
I agree with Egan that “commitment to Bayes’s theorem” does not “reflect an arbitrary choice of conventions.” Bayes’s theorem does not in any way depend for its truth upon our conventions.
As a realist about Bayesian cognitive science, I do indeed hold that the mind often computes an approximation to the posterior. I hold that at least some Bayesian models describe mental processes with at least approximate accuracy.
When we describe a device as computing the addition function, we are not committed to using base 10 notation. Nor are we committed to saying that the device uses base 10 notation. There are many possible numerical notations that a device might use to compute arithmetical functions.
When we describe a credal state using a pdf, our choice of pdf reflects our arbitrary measurement units for the represented environmental variable. Our description does not commit us to saying that the mind uses those particular measurement units. For example, it is highly unlikely that the human navigation system measures speed using meters/sec.
Egan claims that abstract mathematical description of mental computation has explanatory priority over representational description. This position is implausible because the mathematical description typically reflects an arbitrary choice of measurement units.

My conclusion: Egan’s response gives no reason to attribute any psychological reality to abstract mathematical descriptions or to question the explanatory centrality that I attribute to representational descriptions.

In summary, Egan’s function-theoretic conception does not fit well with Bayesian cognitive science because it neglects the crucial role that Bayesian modeling assigns to representational descriptions of explananda and explanantia. In place of representational descriptions, Egan commends abstract mathematical descriptions. Yet abstract mathematical descriptions reflect our own arbitrary measurement units and lack any psychological reality. Egan’s nonrepresentational approach cannot preserve the most basic explanatory achievements of Bayesian cognitive science.

7.2 Radical Enactivism

Hutto and Myin (Reference Hutto and Myin2017) espouse a radical enactivist approach to cognitive science. They view cognition as a dynamic interaction between an embodied brain and a changing environment. They “conceive of the basis of cognition in terms of extensive and dynamically loopy processes that are responsive to information in the form of environmental variables spanning multiple spatial and temporal scales” (p. 9). They also reject representationalism: they “construe cognition as unfolding, world-relating processes rather than as a series of content-bearing states and their interactions” (p. 9). They acknowledge that talk about veridicality-conditions is illuminating when applied to sophisticated symbolic communication (p. 90). They deny that it usefully contributes to theorizing about perception, motor control, or other relatively low-level psychological domains (pp. 12–13).

Hutto and Myin apply their radical enactivist approach to Bayesian cognitive science. Following Clark (Reference Clark2015) and Hohwy (Reference Hohwy2014), they focus almost exclusively upon a neural implementation framework called predictive coding. The basic idea behind predictive coding is that the brain generates a prediction regarding the sensory input it will receive. The brain compares its prediction with actual sensory input, computing a prediction error term. Prediction error informs subsequent computation, shaping future expectations so as to minimize future prediction error. Many predictive coding models have a hierarchical structure: higher levels of the network compute predictions about lower-level activity, and the lower level computes a prediction error term that is transmitted back to the higher level. There is nothing inherently Bayesian about predictive coding models, but when set up in the right way they can implement an approximation to Bayesian inference. This can be done either through parametric encoding (Friston, Reference Friston2010) or through sampling encoding (Lee & Mumford, Reference Lee and Mumford2003). Hutto and Myin use the label the Predictive Processing account of Cognition (PPC) to describe theories that implement approximate Bayesian inference through predictive coding.

Hutto and Myin offer a radical enactivist interpretation of PPC. Their core interpretive claim is that we need not gloss talk about “prediction” and “expectation” in representational terms. They write: “Having expectations about what we will experience sensorily need not be thought of as involving the making of any kind of contentful claim about the state of the world. Nor need we think of sensory perturbations that are involved in such matches and mismatches as supplying rich contentful messages that contradict the content of our expectation” (pp. 70–71). Accordingly, we need not interpret PPC models representationally: “our expectations can fail to match incoming sensory experience without this activity being construed as a content-based operation” (p. 71). They conclude that PPC provides no support for representationalism.

I agree with Hutto and Myin that, in many cases, we should not interpret PPC talk about “prediction” and “expectation” in representational terms. I agree that, in many cases, we should not describe the mind as “representing” expected experiences. As Burge (Reference Burge2010, pp. 367–463) notes, there is no evidence that the perceptual system represents proximal sensory stimulations. The perceptual system converts nonrepresentational sensory stimulations into perceptual representations, without representing the stimulations. There is no explanatory benefit to saying that the perceptual system represents its own sensory input. When a PPC neural network compares predicted sensory input with actual sensory input, we usually should not describe the comparison in representational terms. We should instead say that the network compares an input signal with a feedback signal generated by a higher level of the network. We may describe both signals in neural terms, e.g., as firing rates, and we may describe the “prediction error” computation as a neurophysiological operation on those signals. Representational properties play no role in characterizing the “prediction error” computation.

A similar diagnosis applies to higher levels in hierarchical PPC models, such as the celebrated (Rao & Ballard, Reference Rao and Ballard1999) model. Each level compares neural activity with a feedback “prediction” signal received from a higher level, computing an “error” term subsequently transmitted to the higher level. The feedback signal and the “error” computation can again be described in nonrepresentational, neurophysiological terms.

Typically, then, we should not describe a PPC neural network as representing its inputs or its own neural activity. The network receives but does represent proximal sensory inputs. It instantiates but does not represent neural activity. Talk about “prediction” and “expectation” may be harmless enough for some purposes, but I agree with Hutto and Myin that we achieve no explanatory gain by glossing this talk in representational terms.Footnote ¹⁵

However, the nonrepresentational interpretation of prediction talk is doubly irrelevant to representationalism about Bayesian cognitive science.

First, we should not focus exclusively on PPC models. Most Bayesian modeling is not tied to the PPC research program. Most Bayesian models found in cognitive science are neutral about neural implementation mechanisms. Many promising implementation schemes, such as the schemes discussed in Ma et al. (Reference Ma, Beck, Latham and Pouget2006) and Orbán et al. (Reference Orbán, Berkes, Fiser and Lengyel2016), do not feature anything like predictive coding (Rescorla, Reference Rescorla2017; Rescorla, Reference Rescorla, Cheng, Sato and Hohwy2024). Thus, the interpretation of PPC modeling is distinct from the interpretation of Bayesian modeling more generally. Hutto and Myin give no reason for focusing narrowly on PPC to the exclusion of generic Bayesian modeling. Indeed, their exposition tends to elide the difference between PPC and Bayesian cognitive science (e.g. pp. 150–151). Although predictive coding has received considerable recent attention in the philosophical community, empirical support for it remains equivocal (Aitchison & Lengyel, Reference Aitchison and Lengyel2017). In my opinion, we currently have no reason to suspect that approximate Bayesian inference is typically implemented in PPC fashion.

Second, and more importantly, representationalists about Bayesian cognitive science do not claim that the mind represents either sensory input or neural activity. Representationalists claim that the mind represents environmental conditions, including both distal conditions and bodily state. Assume for the sake of argument that a neural system implements approximate Bayesian inference through a predictive coding implementation mechanism. We will describe the system using a Bayesian model, which posits credal states and transitions, and a PPC model, which specifies how the Bayesian model is neurally implemented by a predictive coding mechanism. I agree with Hutto and Myin that there is no reason to think that the system represents its own inputs or own neural activity. Nevertheless, we have strong reason to think that the system represents the environment. We have strong reason to describe the system’s credal states in representational terms, as allocating credences over representationally-individuated hypotheses. Only then can we preserve explanations that rely on representationally-characterized credal states. For example, how can we explain overshooting in dead reckoning unless we posit a prior that favors slower speeds? I have no idea how enactivists would interpret the “slow speed” prior in nonrepresentational terms, let alone how the ensuing explanations would work.

Hutto and Myin (pp. 151–155) express skepticism about my representationalist interpretation of Bayesian models. They do not provide a developed alternative interpretation. They do not indicate how to gloss credal states and transitions in nonrepresentationalist enactivist terms. In fact, they barely discuss credal states: they mention priors a mere handful of times, and they do not mention posteriors at all. They do not analyze a single specific Bayesian model of mental activity, even in the most schematic way. Their treatment gives no hint how enactivists might eschew representational vocabulary while preserving the explanatory power of Bayesian modeling.

7.3 Interpreting Bayesian Cognitive Science

When philosophers interpret a scientific theory, they often employ theoretical notions (such as veridicality-condition) that play no explicit role in scientific discourse. Inevitably, there is a gap between philosophical interpretation and scientific texts. Still, some interpretations usually fit much better with scientific practice than others.

In the present case, a representationalist interpretation fits much better with Bayesian cognitive science than the function-theoretic interpretation offered by Egan or the enactivist interpretation offered by Hutto and Myin. The representationalist interpretation describes how to interpret the priors and posteriors that figure so prominently in Bayesian theorizing. The function-theoretic and enactivist conceptions say virtually nothing about how to interpret priors and posteriors, save perhaps to dismiss them in instrumentalist fashion as useful fictions. The representationalist interpretation analyzes in quite precise detail the explanations offered by Bayesian cognitive scientists, such as the explanations embodied by Figures 33 and 34. The function-theoretic and enactivist interpretations have little if anything to say about those explanations. Absent more compelling anti-representationalist interpretations, the representationalist interpretation looks secure.

8 Conclusion

The mind operates amid constant uncertainty stemming from multiple sources, including noise, ambiguous input, and conflicting sensory cues. Bayesian cognitive science postulates that the mind grapples with uncertainty by implicitly encoding credal assignments over hypotheses. The encoded credences influence inference and decision-making, roughly in accord with Bayesian norms. The Bayesian program draws support from strong empirical evidence across a range of psychological domains.

I have analyzed Bayesian modeling from a realist representationalist perspective that takes seriously the postulation of credal states and transitions. Realists hold that, when a Bayesian model is explanatorily successful, we have good reason to accept the existence of credal states and transitions roughly like those posited by the model. Representationalists hold that the posited credal states assign credences to hypotheses individuated through their representational properties. The realist representationalist interpretation fits much better with scientific practice than do rival instrumentalist or anti-representationalist interpretations.

Throughout my discussion, I have highlighted foundational questions raised by the Bayesian paradigm. Which mental processes approximately conform to Bayesian norms, and which do not? How do nature and nurture jointly influence priors employed by the mind? How are credal states neurally implemented? How does the brain transition from one credal state to another? What computational strategies does it use to approximate intractable Bayesian inferences? What is it to attach a credence to a hypothesis? Given that hypotheses are sets of outcomes, what exactly are the outcomes? How literally should we construe an infinitary Bayesian model built atop an uncountable outcome space? Ongoing research into these and other foundational questions promises to illuminate how the representational mind, by approximating rational norms, copes with perpetual uncertainty.

Appendix: Foundations of Probability Theory

This appendix presents some key probabilistic concepts as they relate to Bayesian modeling. It serves as a more mathematically rigorous complement to the informal exposition from Sections 2 and 3.

A few preliminary definitions are in order. Let A and B be sets. A function f from A to B is an injection iff $f (a)$ and $f (b)$ are distinct whenever a and b are distinct. A function f from A to B is a surjection iff, for each $b \in B$ , there exists $a \in A$ such that $f (a) = b$ . A bijection is a function that is an injection and a surjection. $N$ is the set of natural numbers: ${0, 1, 2, 3, \dots}$ . A is infinite iff there exists an injection from $N$ to A. A is countably infinite iff there exists a bijection from $N$ to A. A is countable iff it is finite or countably infinite. A is uncountable iff it is infinite but not countably infinite. $R$ is the set of real numbers. $[a, b]$ is ${x \in R : a \leq x \leq b}$ . $R^{n}$ is the set of n-tuples drawn from $R$ , that is,

{(x_{1}, x_{2}, \dots, x_{n}) : x_{i} \in R, f o r a l l i} .

Recall that $R$ is uncountable, as is $R^{n}$ , and that $[a, b]$ is uncountable whenever a ≠ b.

A1 Measurable Spaces

In Kolmogorov’s axiomatization, probabilities attach to sets whose members are drawn from an outcome space Ω. The powerset of Ω is the set containing all subsets of Ω. We notate it as $℘ (Ω)$ . When Ω is finite, we can assign probabilities to all members of $℘ (Ω)$ . When Ω is uncountable, it is often impossible to assign intuitively plausible probabilities to all members of the powerset (Proschan & Shaw, Reference Proschan and Shaw2016, pp. 17–35). Instead, probability theorists assign probabilities to certain privileged members of $℘ (Ω)$ . The privileged members, called events, form a σ-field over Ω. A σ-field over Ω is a subset F of $℘ (Ω)$ such that:

Ω belongs to F.

If H belongs to F, then $H^{c}$ belongs to F.

If $H_{1}, H_{2}, \dots, H_{n}, \dots$ belong to F, then their union $\underset{n}{\cup} H_{n}$ also belongs to F.

The union $\underset{n}{\cup} H_{n}$ is the set containing all elements that belong to at least one of the sets $H_{n}$ . There may be a countable infinity of sets $H_{n}$ .

We typically choose a σ-field that arises organically from our interests. For example, suppose we are modeling an asteroid’s speed using outcome space $R$ . A natural question is whether the asteroid’s speed x falls in the interval $[a, b]$ . We would like to assign probabilities to all these intervals. At the very least, then, our σ-field should contain every interval $[a, b]$ . Consider the minimal σ-field containing all intervals $[a, b]$ . Call it B. Intuitively: we throw just enough sets into B to ensure that B contains each interval $[a, b]$ and is closed under complementation and countable union. B’s members are called the Borel sets. B usually serves as the most natural σ-field when the outcome space is $R$ . Similarly, suppose that the outcome space is $R^{2}$ , i.e., the set of ordered pairs of real numbers. Consider the minimal σ-field containing all rectangles. Elements of this σ-field are again called Borel sets. The same construction generalizes to $R^{n}$ , for arbitrary n.

An outcome space Ω along with a σ-field F form a measurable space, typically notated as (Ω, F).

A2 Probability Measures

We now consider a function P that assigns probabilities to events belonging to F. For each $H \in F$ , $P (H)$ is the probability assigned to H. As indicated in Section 2.2, Kolmogorov places three axiomatic constraints on P. Here are the first two axioms:

0 \leq P (H) \leq 1.

P (Ω) = 1.

As for the third axiom (additivity), recall my formulation from Section 2.2:

P (H_{1} \cup H_{2}) = P (H_{1}) + P (H_{2})

when $H_{1}$ and $H_{2}$ are disjoint. This formulation is called finite additivity. Assuming finite additivity, one can easily prove:

(5)

P (H_{1} \cup H_{2} \cup \dots \cup H_{n}) = P (H_{1}) + P (H_{2}) + \dots + P (H_{n})

when $H_{1}, H_{2}, \dots, H_{n}$ is a finite list of pairwise disjoint events. See Figure 39. Kolmogorov assumes a stronger axiomatic constraint that generalizes (5) to a potentially infinite list of pairwise disjoint events $H_{1}, H_{2}, \dots, H_{n}, \dots$ The stronger constraint, called countable additivity, demands that:

Figure 39 $H_{1}, H_{2}, \dots, H_{10}$ are pairwise disjoint events. Finite additivity requires that their union (the total shaded area) receive a probability equal to the sums of the probabilities assigned to them individually.

P (\underset{n}{\cup} H_{n}) = \sum_{n} P (H_{n}),

where $\underset{n}{\cup} H_{n}$ is the countable union of the H_n. Kolmogorov’s axiomatization employs countable additivity as opposed to mere finite additivity.

Countable additivity offers an important advantage over mere finite additivity: it constrains the probabilities assigned to many more events. Using countable additivity, we can extrapolate probability assignments from elementary events (e.g. intervals of real numbers) to numerous complex events left unaddressed by mere finite additivity (e.g. countable unions of disjoint intervals). Accordingly, countable additivity is widely assumed within probability theory (Billingsley, Reference Billingsley1995). Some versions of Bayesian decision theory employ only finite additivity (e.g. de Finetti, Reference de Finetti1972; Savage, Reference Savage1972), but most versions assume countable additivity (e.g. DeGroot, Reference DeGroot1970; Easwaran, Reference Easwaran2013; Ghosal & van der Vaart, Reference Ghosal and van der Vaart2017; Gelman et al., 2014). In a Bayesian context, the dispute between finite and countable additivity is a normative one. It concerns the norms governing rational allocation of credence over a hypothesis space. Proponents of countable additivity claim that rational credences should be countably additive, while opponents maintain that rational credences need only be finitely additive. For discussion of finite versus countable additivity in the Bayesian context, see Liu (Reference Liu2020).

When a probability assignment P satisfies all three axioms (including countable additivity), it is called a probability measure (or a probability distribution), and (Ω, F, P) is called a probability space.

A3 Random Variables Defined Rigorously

Let X be a function from Ω to $R$ . To assign probabilities to hypotheses regarding X’s possible values, we must ensure that our σ-field F contains all the hypotheses. For each $B \in B$ , let

X^{- 1} (B) =_{d f} {ω \in Ω : X (ω) \in B} .

X is a random variable on the probability space (Ω, F, P) iff

$X^{- 1} (B) \in F$ , for every $B \in B$ .

This condition ensures that, for each Borel set B, F contains the hypothesis that X’s value falls within B. For example, F includes each event $X^{- 1} [a, b]$ . For any real number x, the event

{ω \in Ω : X (ω) = x}

is typically notated as

X = x .

We may write $P (X = x)$ for the probability that X has value x (e.g. the probability that the asteroid has speed x). Similarly, the event

{ω \in Ω : X (ω) \neq x}

is typically notated as

X \neq x .

We may write $P (X \neq x)$ for the probability that X does not have value x.

Given a probability space (Ω, F, P) and a random variable X, we can define a probability measure μ over the measurable space $(ℝ, B)$ :

$μ (B) =_{d f} P (X^{- 1} (B))$ , for every $B \in B$ .

Figure 38 illustrates, for the special case where $B = [a, b]$ . μ is called X’s distribution. It is often easier to work with probability measures over $(ℝ, B)$ than with probability measures over (Ω, F), especially when Ω is complicated.

These definitions generalize from $R$ to $R^{n}$ . The definitions are the same, except that we consider Borel sets over $R^{n}$ rather than $R$ .

Given a function X from Ω to $R$ and a probability measure μ over $(ℝ, B)$ , we can use X and μ to define a probability space with Ω as the outcome space. Define $σ (X)$ , the σ-field generated by X, by

σ (X) =_{d f} {X^{- 1} (B) : B \in B} .

Define a probability measure P over $σ (X)$ by

$P (X^{- 1} (B)) =_{d f} μ (B)$ , for every $B \in B$ .

Then (Ω, $σ (X)$ , P) is a probability space, and X is a random variable defined on (Ω, $σ (X)$ , P). This procedure generalizes from $R$ to $R^{n}$ .

A4 Discrete and Nondiscrete Random Variables

A discrete random variable has countably many possible values. Many random variables encountered in scientific applications are nondiscrete, that is, they have uncountably many possible values.

Here is a fundamental constraint on random variables: at most countably values x of random variable X can receive positive probability. In other words, there are at most countably many real numbers x such that

P (X = x) > 0.

To prove this statement, let us for each natural number $n > 0$ define $D_{n}$ as follows:

D_{n} =_{d f} {x \in R : P (X = x) > \frac{1}{n}} .

Suppose for purposes of reductio that $D_{n}$ has at least n members $x_{1}, x_{2}, \dots, x_{n}$ . The events $X = x_{i}$ and $X = x_{j}$ are disjoint when $i \neq j$ . By finite additivity,

P (X = x_{1} \cup X = x_{2} \cup \dots \cup X = x_{n}) = P (X = x_{1}) + P (X = x_{2}) + \dots + P (X = x_{n}) .

Each individual term $P (X = x_{i})$ is greater than $1 / n$ , so the sum on the right is greater than

\frac{n}{n} = 1,

which contradicts our axiomatic assumption that 1 is the maximal probability. By reductio, each set $D_{n}$ contains fewer than n members. See Figure 40. Using set theory, one can then show that $\underset{n}{\cup} D_{n}$ is at most countably infinite. Every x such that $P (X = x) > 0$ must belong to some set $D_{n}$ and hence must belong to $\underset{n}{\cup} D_{n}$ . Therefore, there are at most countably many x such that $P (X = x) > 0$ . Note that our proof uses only finite additivity, with no need for countable additivity.

Figure 40 $D_{5}$ contains at most four members $x_{1}, x_{2}, x_{3}$ , and $x_{4}$ . Figure 40 depicts a case where it has exactly four members. Similarly, each set $D_{n}$ contains at most $n - 1$ members.

Many philosophers endorse the doctrine, sometimes called Regularity, that agents should assign credence 0 only to impossible hypotheses (Kemeny, Reference Kessler, Frankenstein and Rothkopf1955; Skyrms, Reference Skyrms1995; Stalnaker, Reference Stalnaker1970). The idea is that, if H is in some sense possible, then a rational agent will acknowledge its possibility by allocating it at least some nonzero credence. The foregoing proof shows that Regularity dramatically conflicts with the probability calculus axioms, no matter how exactly we gloss “possibility.” The axioms demand that at most countably many values of a random variable receive nonzero probability. When a random variable is nondiscrete, uncountably many of its possible values must receive probability 0. This remains so even if one favors finite additivity over countable additivity. In response, Skyrms (Reference Skyrms1980) recommends that we preserve Regularity by revising the probability calculus axioms. The recommendation has not found much uptake within probability theory or its scientific applications, including Bayesian applications. Scientific practitioners of the Bayesian framework routinely set $P (X = x) = 0$ for uncountably many possible values x. So Regularity conflicts not just with orthodox probability theory but also with scientific practice.

These observations prompt us to reflect upon the meaning of extremal credences 0 and 1. Let X be a nondiscrete random variable, such as asteroid speed, and suppose that an agent sets $P (X = x) = 0$ . The probability calculus axioms demand that the agent also set $P (X \neq x) = 1$ . Certainty in the event $X \neq x$ does not entail that the agent regards value x as impossible. The agent fully realizes that the asteroid may have speed x. By assigning probability 0 to speed x, the agent does not completely reject the possibility of speed x. She merely regards this possibility as so negligible that it merits no positive credence. Assuming the agent’s credences conform to the probability calculus axioms, she must similarly regard uncountably many other values of X as negligible possibilities.

A random variable is said to be continuous when $P (X = x) = 0$ for all x. A continuous random variable violates Regularity in a very extreme way: every event $X = x$ receives probability 0. Note that some random variables are neither discrete nor continuous (Billingsley, Reference Billingsley1995, pp. 257–258): such a variable has uncountably many possible values x, and certain values receive positive probability.

A5 Probability Density Functions

A probability density function (pdf) is a nonnegative function $p (x)$ from $R$ to $R$ such that the area under the curve is 1:

\int_{- \infty}^{\infty} p (x) d x = 1 .

A pdf induces a probability measure μ over $(ℝ, B)$ . The probability assigned by μ to $[a, b]$ is the area under $p (x)$ stretching from a to b:

μ ([a, b]) = \int_{a}^{b} p (x) d x .

Probability assignments to the intervals $[a, b]$ determine unique probability assignments to all Borel sets. Thus, each pdf induces a unique probability measure μ.

Because probability density determines probability via integration, changes to the pdf do not affect probabilities when they do not affect integration. Compare Figure 9 with Figure 41. These are two different pdfs: they assign different densities to c. Nevertheless, they induce the same probabilities, because a change in density at a single point does not affect integration. More generally: when pdf $p (x)$ induces probability measure μ, there are infinitely many distinct pdfs that induce the same measure μ.

Figure 41 This pdf alters the pdf from Figure 9 at a single point c. The alteration does not affect the area under the curve, so the two pdfs determine the same probability distribution.

Suppose random variable X is defined on probability space (Ω, F, P) with distribution μ. Suppose μ is induced by a pdf $p (x)$ . It is not hard to show that X is a continuous random variable: $P (X = x) = 0$ for all x (Billingsley, Reference Billingsley1995, p. 212). Equivalently, $μ ({x}) = 0$ for all x. The converse is not true: in some cases, the distribution of a continuous random variable is not induced by any pdf (Proschan & Shaw, Reference Proschan and Shaw2016, pp. 94–95). See Figure 42.

Figure 42 Typology of random variables. A random variable is either discrete or nondiscrete. A continuous random variable is nondiscrete; the converse is not true. A random variable is continuous if its distribution is induced by a pdf; the converse is not true.

Given a random variable X defined on probability space (Ω, F, P), define a new random variable Y resulting from multiplication by a constant k:

Y (ω) = k X (ω) .

Suppose that X’s distribution has a pdf $p (x)$ . One can show that Y’s distribution has a pdf $q (y)$ given by

(6)

q (y) = \frac{p (y / k)}{k} .

See Ma, Kording & Goldreich (2023, pp. 333–336). The change in variable (from X to Y) necessitates a change in pdf.

To illustrate, let Ω contain possible speeds for an object. Each outcome ω is a particular speed that the object might have. Assume that Ω is endowed with an appropriate σ-field F. Assume also an underlying probability measure P defined on F. Speeds are physical magnitudes and hence are distinct from real numbers (Peacocke, Reference Peacocke2019). We can describe physical magnitudes with real numbers by choosing measurement units, such as meters/sec or feet/sec. The first choice of measurement unit corresponds to one random variable X from Ω to $R$ . The second choice corresponds to a second random variable Y from Ω to $R$ , where

Y (ω) = 3.28084 X (ω) .

The underlying probability measure P induces different distributions for X and Y. If the pdf for X is given by $p (x)$ , then the pdf for Y is given by (6), taking $k = 3.28084$ . Figure 37 illustrates. The blue pdf corresponds to meters/sec. The orange pdf corresponds to feet/sec. The two pdfs are associated with the same underlying probability measure over possible speeds.

A6 Conditional Density and Beyond

Using conditional densities, we can extend the notion of conditional probability well beyond the elementary case where the ratio formula prevails.

Suppose we are given a two-dimensional pdf $p (x, y)$ . We want to define a new pdf over y conditional on X having value a. So we want to define a one-dimensional conditional density over y, which we may notate as:

p (y | X = a) .

Intuitively: this is a density over y given that X has value a. To define $p (y | X = a)$ , we confine attention to points such that $X = a$ . We consider p’s values on those points alone:

p (a, y) .

One might hope to set the conditional density $p (y | X = a)$ equal to $p (a, y)$ , where we hold a fixed and allow y to vary. The only hitch is that $p (a, y)$ , viewed as a function of y, may not be a pdf: the area under the curve may not be 1. We must settle for proportionality rather than equality:

p (y | X = a) \propto p (a, y) .

Intuitively, $p (y | X = a)$ confines attention to outcomes where $X = a$ and then allocates probability density in proportion to the original density function $p (a, y)$ . To obtain $p (y | X = a)$ from $p (a, y)$ , we need merely divide $p (a, y)$ by a constant to ensure that the area under the curve is 1. This constant is called a normalization constant.

More formally, we may define conditional density as follows. Take $p (x, y)$ as given and define $p (x)$ , the marginal pdf for X:

p (x) =_{d f} \int_{- \infty}^{\infty} p (x, y) d y .

$p (x)$ is computed by holding x fixed and integrating $p (x, y)$ over all possible values of y. Assuming that $p (x) > 0$ , we may define the conditional density of Y given $X = x$ by the equation

(7)

p (y | X = x) =_{d f} \frac{p (x, y)}{p (x)} .

$p (x)$ is the normalization constant: it ensures that probabilities sum to 1. When it is clear which random variable X is at issue, we may notate (7) more compactly as

(8)

p (y | x) = \frac{p (x, y)}{p (x)} .

$p (y | x)$ results from $p (x, y)$ by holding x fixed and then normalizing. See Figures 18, 19, 20, and 21. These definitions generalize to higher dimensions.Footnote ¹⁶

It is often most natural to regard $p (x)$ and $p (y | x)$ as primitive rather than defined. For example, $p (x)$ might be a pdf for asteroid speed and $p (y | x)$ might be the conditional density of measuring speed y given that the asteroid has speed x. Taken together, $p (x)$ and $p (y | x)$ determine a joint density $p (x, y)$ : we simply view equation (8) as a definition of $p (x, y)$ rather than of $p (y | x)$ . In practice, we need not usually consider the joint density. It lies in the background of our theorizing, but we only explicitly consider $p (x)$ , $p (y | x)$ , and $p (x | y)$ .

The ratio formula and the theory of conditional densities suffice for most applications of Bayesian decision theory. However, there are situations where we would like to define conditional probabilities yet neither the ratio formula nor the theory of conditional densities applies. To illustrate with a cognitive science example, consider the Bayesian causal inference model given by Kording et al. (Reference Kording, Beierholm, Ma, Quartz, Tenenbaum and Shams2007). The model evaluates whether visual input $e_{V}$ and auditory input $e_{A}$ derive from a single distal source. C is a binary random variable that registers the number of sources: $C = 1$ registers a single source, and $C = 2$ registers two distinct sources. Upon receiving inputs $e_{V}$ and $e_{A}$ , the model computes the posterior probability $P (C = 1 | e_{V}, e_{A})$ that those inputs derive from a single distal source. Assume that there are uncountably many possible inputs, as Kording et al. (Reference Kording, Beierholm, Ma, Quartz, Tenenbaum and Shams2007) do and as is standard in Bayesian perceptual psychology. Then we cannot define conditional probabilities $P (C = 1 | e_{V}, e_{A})$ using either the ratio formula or conditional densities. The ratio formula does not apply because there is probability zero of any given input pair $(e_{V}, e_{A})$ , except perhaps for countably many such pairs. Nor does the theory of conditional densities apply: C is discrete, so no joint density exists. As this example illustrates, a general theory of conditional probability must look beyond both the ratio formula and conditional densities.

The most successful general theory traces back to the same treatise where Kolmogorov (Reference Kolmogorov1933/1956) codified the probability calculus axioms. The ratio formula and the theory of conditional densities are special cases of Kolmogorov’s theory (Billingsley, Reference Billingsley1995, p. 432; Rescorla, Reference Rescorla2015c). Kolmogorov’s theory is general enough to handle the Bayesian causal inference model, along with countless other applications. Perhaps because Kolmogorov’s theory is forbiddingly technical, it was long neglected by the philosophical community. Recently, it has begun to receive sympathetic attention from philosophers (Easwaran, Reference Easwaran, Bandyo-padhyay and Forster2011b; Huttegger, Reference Huttegger2015; Meehan & Zhang, Reference Meehan and Zhang2020; Nielsen, Reference Nielsen2021; Rescorla, Reference Rescorla2018a; Rescorla, Reference Rescorlaforthcoming). Easwaran (Reference Easwaran, Pettigrew and Weisberg2019) gives a detailed introduction, with comparisons to alternative theories of conditional probability.

A7 Proof of Bayes’s Theorem

Suppose that $P (H) > 0$ and $P (E) > 0$ . The ratio formula determines conditional probabilities $P (H | E)$ and $P (E | H)$ :

P (H | E) = \frac{P (H \cap E)}{P (E)}

P (E | H) = \frac{P (E \cap H)}{P (H)} .

Algebraic manipulation yields

P (H | E) P (E) = P (H \cap E) = P (E \cap H) = P (E | H) P (H),

which immediately entails Bayes’s Theorem:

(9)

P (H | E) = \frac{P (H) P (E | H)}{P (E)} .

It is remarkable that this theorem follows almost trivially from the ratio formula yet offers such profound insight into rational inference.

Now consider the case where we have a two-dimensional pdf $p (x, y)$ . Define conditional densities and marginals as in Section A6:

p (x) =_{d f} \int_{- \infty}^{\infty} p (x, y) d y

p (y) =_{d f} \int_{- \infty}^{\infty} p (x, y) d x

p (y | x) =_{d f} \frac{p (x, y)}{p (x)}

p (x | y) =_{d f} \frac{p (x, y)}{p (y)},

where the third definition presupposes $p (x) > 0$ and the fourth presupposes $p (y) > 0$ . From these latter two definitions,

p (x | y) p (y) = p (x, y) = p (y | x) p (x) .

By algebra,

(10)

p (x | y) = \frac{p (y | x) p (x)}{p (y)},

which is Bayes’s theorem for pdfs. Note that $1 / p (y)$ does not depend upon x. It figures solely as a normalization constant. Although (9) and (10) look similar and have similar proofs, they are distinct: (9) concerns conditional probabilities, while (10) concerns conditional densities.

Bayes’s theorem generalizes beyond the formulations given here, using Kolmogorov’s theory of conditional probability (Ghosal & van der Vaart, Reference Ghosal and van der Vaart2017, p. 7). There are also some situations where no analogue to Bayes’s theorem is available (Ghosal & van der Vaart, Reference Ghosal and van der Vaart2017, pp. 7–8). In those situations, one can still conform to Conditionalization: one can respond to new evidence by replacing the prior with the posterior. Unfortunately, one can no longer use anything like (9) or (10) to compute the posterior.

Acknowledgments

I presented portions of this material at the 2019 Norwegian Summer Institute on Language and Mind; a fall 2020 graduate seminar at UCLA; three sessions of a spring 2022 graduate seminar led by Roberto Casati at the Institut Jean Nicod; a spring 2024 Princeton University cognitive science colloquium; and a spring 2024 workshop on bounded rationality at the University of California, Berkeley. I am grateful to all participants in these events, especially Tyler Brooke-Wilson, Roberto Casati, Kenny Easwaran, Adam Elga, Verónica Gómez Sánchez, Steven Gross, Elizabeth Harman, Geoffrey Lee, Sarah-Jane Leslie, John MacFarlane, Alonso Molina, Nico Orlandi, Jiarui Qu, Georges Rey, Paul Talma, David Thorstad, Alejandro Vesga, Francesca Zaffora Blando, and Snow Zhang for their helpful feedback. I also thank Cosmo Grant, Thomas Icard, Keith Frankish, and two anonymous referees for their comments on an earlier draft of the manuscript. Finally, I thank Olivia Bollinger, who prepared Figures 31 and 42, and Jiarui Qu, who prepared all the other original figures.

Keith Frankish
The University of Sheffield
Keith Frankish is a philosopher specializing in philosophy of mind, philosophy of psychology, and philosophy of cognitive science. He is the author of Mind and Supermind (Cambridge University Press, 2004) and Consciousness (2005), and has also edited or coedited several collections of essays, including The Cambridge Handbook of Cognitive Science (Cambridge University Press, 2012), The Cambridge Handbook of Artificial Intelligence (Cambridge University Press, 2014) (both with William Ramsey), and Illusionism as a Theory of Consciousness (2017).

About the Series

This series provides concise, authoritative introductions to contemporary work in philosophy of mind, written by leading researchers and including both established and emerging topics. It provides an entry point to the primary literature and will be the standard resource for researchers, students, and anyone wanting a firm grounding in this fascinating field.

Element contents