1. The crisis
“Don’t trust everything you read in the psychology literature. In fact, two thirds of it should probably be distrusted” (Baker Reference Baker2015, 1). Thus opens a report in the journal Nature, commenting on the findings of the Open Science Foundation project, which conducted replication attempts of 100 psychology experiments and reported that only “39% of effects were subjectively rated to have replicated the original result” (Open Science Collaboration 2015, 943) Such claims lie at the foundation of a crisis in confidence in the field, whereby the failure of findings to replicate is often taken to imply (tacitly or otherwise) that they are false. The characterization of this mass failure of reproducibility of psychological findings as a “crisis” rests on the assumption that “replication is one of the most important tools for the verification of facts within the empirical sciences” (Schmidt Reference Schmidt2009, 90). Under such a characterization, those findings that can be repeated by different researchers in different laboratories can be considered verified facts, and those that cannot are dismissed as coincidental or the result of bad scientific practice (Loscalzo Reference Loscalzo2012; McNutt Reference McNutt2014; Nosek et al. Reference Nosek, Hardwicke, Moshontz, Allard, Corker, Dreber, Fidler, Hilgard, Kline Struhl, Nuijten, Rohrer, Romero, Scheel, Scherer, Sch¨onbrodt and Vazire2022; Simons Reference Simons2014). Subsequently, those fields that have a higher rate of failed replications are considered less trustworthy than those that have lower rates. Thus, the high rate of replication failure in psychology constitutes, in this diagnosis, a crisis, in that the work produced by its researchers is considered to be unreliable.
The assumption that the successful replication of experiments distinguishes “trusted” from “untrusted” science has not gone unchallenged by philosophers, many of whom have argued that a high rate of replication failure can be perfectly compatible with responsibly conducted, high-quality science (Bird Reference Bird2021; Feest Reference Feest2019; Fletcher Reference Fletcher2021; Irvine Reference Irvine2021; Lavelle Reference Lavelle2022; Leonelli Reference Leonelli, Fiorito, Scheall and Suprinya2018; Schickore Reference Schickore2011). This article offers a new addition to this counteroffensive. It argues that when researchers are working in fields that we don’t yet know very much about, failed replications are not only to be expected but are necessary to furthering our understanding. I demonstrate this by a novel application of Hasok Chang’s (Reference Chang2004, Reference Chang2012) framework of “epistemic iteration” to a very live and controversial puzzle in infant cognition, namely, whether babies can attribute false beliefs to others. Chang’s aim is to show how progress can be made even when our starting point is shrouded in uncertainty, and I argue that the unfolding of the infant false-belief research program exemplifies this. Furthermore, Chang’s notion of scientific progress gives a front-and-center place to the idea that there are always multiple epistemic goals in play. Although this is not a new idea, its emphasis helps us to see how even though failed replications may not be informative about the hypothesis under consideration, they nevertheless contribute to other epistemic aims, such as the validation or calibration of measurements or the refinement of concepts (see sec. 3; see also Van Dongen et al. Reference Van Dongen, Van Bork, Finnemann, Haslbeck, Van Der Maas, Robinaugh, De Ron, Sprenger, Borsboom, Machery, Chase, Makovec, Koberinski, Hui Choi, Elber, Krempel and Blanken2022). Finally, the article uses the case study to illustrate one of Chang’s most important contributions: that our scientific inquiries have to start somewhere. With hindsight, that starting point may look terribly bad. But in order for hindsight to occur, the starting point needs to be there. This is why failed replications are a necessary and expected part of good science: they are needed in order for the epistemic gains to be made that move us forward. A narrative of failed replications centered around “distrust” not only masks these gains but also runs the risk of losing them altogether by casting dismissive doubt on the value of those fields currently experiencing high rates of failed replication.
2. Anticipatory looking: A case study
2.1. Children, babies, and the false-belief task
The field of infant psychology is one that, I believe, is currently experiencing a large amount of uncertainty in some of its methods of measurement while also grappling with conceptual questions about how to characterize the phenomena such methods intend to measure. Nowhere is this more manifest than in research examining infants’ abilities to attribute psychological states to other agents. On the one side, there are high-stakes debates about the nature of the psychological states that infants attribute to others and, in particular, whether they can attribute false beliefs to them. On the other side, there is a growing awareness that the methods of measurement, in particular, those that rely on infants’ spontaneous looking behaviors, are not as well understood as previously thought. Much of the key work in this field concerns preverbal infants Footnote 1 who have “limited attention spans, processing capacities and fine and gross motor skills” (Kominsky et al. Reference Kominsky, Lucca, Thomas, Frank and Hamlin2022, 1). Consequently, most experimental paradigms rely on indirect measures to explore infants’ cognitive capacities, for example, by measuring how long a baby looks at a particular event or where the baby looks. Some of the established causes for low rates of replication in the psychological sciences are attributed to small sample sizes leading to low statistical power, the specialized nature of the equipment required, and a lack of standardization across measurements (Asendorpf et al. Reference Asendorpf, Conner, De Fruyt, De Houwer, Denissen, Fiedler, Fiedler, Funder, Kliegl, Nosek, Perugini, Roberts, Schmitt, Van Aken, Weber and Wicherts2013; Collins Reference Collins1985; Nosek et al. Reference Nosek, Hardwicke, Moshontz, Allard, Corker, Dreber, Fidler, Hilgard, Kline Struhl, Nuijten, Rohrer, Romero, Scheel, Scherer, Sch¨onbrodt and Vazire2022). Infant psychology is a field afflicted by all these factors, plus the additional problem of incredibly sensitive and temperamental participants (Byers-Heinlein et al. Reference Byers-Heinlein, Bergmann, Davies, Frank, Kiley Hamlin, Kline, Kominsky, Kosie, Lew-Williams and Liu2020; Frank et al. Reference Frank, Bergelson, Bergmann, Cristia and Floccia2017; Lavelle Reference Lavelle2022; Peterson Reference Peterson2016). It is therefore unsurprising that there have been multiple studies in the field that researchers have had trouble replicating. This case study focuses on one such replication project concerning infants’ understanding of other people’s psychological states.
For decades, it was widely accepted that children could not successfully attribute false beliefs to other people until around their 4th birthday. This was due to their performance on elicited-response false-belief tasks. In the original elicited-response false-belief task (Wimmer and Perner Reference Wimmer and Perner1983), children watch a puppet, Maxi, hide some chocolate in one of two cupboards. Maxi leaves the chocolate in cupboard X and goes out to play. In his absence, his mother enters and moves the chocolate from cupboard X to cupboard Y. She leaves and Maxi returns, and then the child is asked where Maxi will look for his chocolate. Three-year-olds overwhelmingly respond that Maxi will look in cupboard Y, that is, where the chocolate really is and not where Maxi believes the chocolate to be. Around 4 years of age, children correctly answer that he will look in cupboard X. The authors explained their result with the hypothesis that 3-year-old children are limited in their ability to attribute psychological states to other people and are unable to attribute false beliefs to others, whereas 4-year-old children have developed this ability. This task, and those like it, is an elicited-response task because it requires the child to respond to a question asked by the experimenter: “Where will Maxi look for his chocolate?”
This result for the elicited-response false-belief task has been replicated hundreds if not thousands of times. It was therefore groundbreaking when Kristine Onishi and Renée Baillargeon published an article in 2005 arguing that 15-month-olds showed evidence of attributing false beliefs to others. Because 15-month-olds cannot participate in elicited-response tasks, the researchers used a spontaneous-response paradigm that measured how long an infant looked at an event in which an agent acted in a way that matched with their (the agent’s) belief, in contrast to events in which the agent acted in a way that did not match with their belief. This is the violation-of-expectation paradigm, which works on the premise that infants look longer at events that surprise them (i.e., that violate their expectations of what they predict will happen) than they do at events that match their expectations. They reported that infants would look longer at those test trials where the actor did not act in accordance with her (the actor’s) belief about a toy’s location, regardless of whether that belief was true or false, making the following claim:
‘Whether the actor believed the toy to be hidden in the green or the yellow box and whether this belief was in fact true or false, the infants expected the actor to search on the basis of her belief about the toy’s location. These results suggest that 15-month-old infants already possess (at least in a rudimentary and implicit form) a representational theory of mind: They realize that others act on the basis of their beliefs and that these beliefs are representations that may or may not mirror reality.’ (Onishi and Baillargeon Reference Onishi and Baillargeon2005, 257)
Naturally, this article caused quite a stir, disrupting the “developmental dogma” of the previous 20 years that children below the age of 4 years could not attribute false beliefs to others (Rakoczy Reference Rakoczy2017). Until this point, the dominant conceptual frameworks had been designed to explain the developmental dogma; now these theories were hastily reconfigured to explain the new “developmental gap” in performance between infants’ responses on spontaneous-response tasks and children’s performance on elicited-response tasks. Onishi and Baillargeon’s work was succeeded by a slew of research using a variety of spontaneous-response methods to test infants’ understanding of false beliefs, with a recent statement from Rose Scott and colleagues that “over thirty reports, using eleven different behavioral and neural methods, have yielded positive evidence of early false-belief understanding in non-traditional [i.e., spontaneous] tasks” (Scott et al. Reference Scott, Roby and Baillargeon2022, 258). This article follows the replication attempts of a spontaneous-response task originally created by Victoria Southgate and colleagues (Reference Southgate, Senju and Csibra2007). Footnote 2 This task uses the “anticipatory looking” (AL) paradigm, which is based on the premise that babies will look to where they expect an agent to go before they see that agent’s movements. Therefore, if babies expect agents to behave in ways that are congruent with their (the agent’s) beliefs, they should look to where an agent will look for an object based on where that agent believes the object to be. The AL paradigm forms the basis of my case study because there are multiple documented replication attempts, many of which use Southgate’s stimuli.
At this point, an important disclaimer is in order. Onishi and Baillargeon, Southgate et al., and many others take the results of spontaneous-response false-belief tasks to support the hypothesis that infants can attribute false beliefs to others. This is a controversial explanation of the data. Other hypotheses abound: that infants’ looking behavior evidences the ability to track behavioral patterns in other agents, but they do not attribute psychological states to them (Heyes Reference Heyes2014a, Reference Heyes2014b; Santiesteban et al. Reference Santiesteban, Catmur, Hopkins, Bird and Heyes2014), or that infants attribute psychological states to others that are similar to beliefs but that differ by being nonrepresentational (Apperly and Butterfill Reference Apperly and Butterfill2009; Butterfill and Apperly Reference Butterfill and Apperly2013; Low et al. Reference Low, Apperly, Butterfill and Rakoczy2016). This article will not evaluate these hypotheses. Footnote 3 Instead, it focuses on the existence of a phenomenon: whether infants anticipate that an actor will behave in a way that accords with her (the actor’s) psychological states. I will refer to this as the anticipation phenomenon. The anticipation phenomenon describes a certain pattern of infant looking behavior, but it remains neutral on its causes; that is, it makes no claims about whether the infant displays this looking behavior because she is attributing psychological states to the agent, because she is tracking some behavioral pattern, or for any other reason. Because the anticipation phenomenon is distinct from the diverse hypotheses evoked to explain it, should it turn out not to exist, then each of the hypotheses just mentioned would require significant revision. Whether the anticipation phenomenon exists is the central question of this replication debate.
2.2. The anticipatory looking false-belief task
In 2007 Victoria Victoria Southgate and colleagues published a study that used the AL paradigm to examine 2-year-olds’ understanding of false beliefs. In this paradigm, participants watch a video showing a puppet; two boxes, each with a window above it; and a human actor. First, the baby watches the familiarization trials: the puppet puts a ball in a box while the actor watches; a chime sounds, and two windows above the boxes flash; and then the actor reaches through the window above the box with the ball in it, placing her hand in the box. The baby watches this sequence twice (once for each box). The aims of the familiarization trials are to show the baby that the actor wants the ball and for the experimenters to check that the baby’s looking behavior demonstrates that the baby expects the actor to reach for where the ball is—that is, that when the chime sounds, the baby looks to the box where the ball is (more on this follows). Next, the babies watch one of two test conditions. In the first false-belief condition (false-belief 1), the actor watches as the puppet puts the ball in the left-side box, then moves it to the right-side box and closes the lid of the left-side box. The actor then turns away, distracted by a phone ringing. The puppet takes the ball out of the right-side box and leaves the scene, taking the ball with it. The actor turns back to the scene, the chime sounds, and the windows above the boxes flash. In this trial, babies should expect the actor to reach through the right window, with this expectation manifesting through (a) the babies looking first to the right window as soon as they perceive the chime and flashing cues (first-look measurement) and (b) their looking longer at the right window than the left window. The puppet’s behavior in the other test condition—false-belief 2—is the same as in false-belief 1, but the actor is distracted as soon as the puppet places the ball in the left box and does not turn back to the scene until the puppet has left, meaning that she should reach through the left window when she turns back to the scene.
Southgate and colleagues (Reference Southgate, Senju and Csibra2007) reported that 9/10 infants in false-belief 1 looked to the correct window when they perceived the cues, and 8/10 did so in false-belief 2. Regarding how long infants looked at the correct window, they write, “As the infants were familiarized to a delay of 1750ms between the onset of illumination and the opening of a window, we coded only the first 1750ms after onset of illumination on the test trial. The infants spent almost twice as long Footnote 4 focusing on the correct window as the incorrect window” (Southgate et al. Reference Southgate, Senju and Csibra2007, 590).
As mentioned earlier, one of the roles of the familiarization trials is to ascertain that infants show the right looking behaviors. Infants who did not look toward where the actor should reach for the ball by the end of the second familiarization trial were excluded from the study. This is because of two assumptions in the methodology:
-
1. The baby’s gaze direction indicates that they anticipate something to happen at that location.
-
2. The baby’s anticipation is caused by some kind of cognitive mechanism that tracks the actor’s movements and predicts what she will do next.
These assumptions should be uncontroversial. Footnote 5 If infants do not show the right pattern of gaze in the familiarization trials, this suggests either that they are not able to track simple goal-directed actions or that their ability to do so is not revealed by the methodology. Because both of these explanations for their behavior mean that the AL methodology is not appropriate for examining that infant’s understanding of false beliefs, those who showed this behavior were excluded from the study. An additional 11 babies were excluded from the study for failing to meet this criterion.
2.3. Replicating the anticipatory looking false-belief task
Southgate et al.’s (Reference Southgate, Senju and Csibra2007) AL false-belief task has faced mixed replication success. Sebastian Dörrenberg and colleagues tested 66 2-year-olds with Southgate’s stimuli and found that participants looked longer at the correct window only in false-belief 1. Similarly, infants’ first looks upon perceiving the cues were to the correct window in false-belief 1, but they more often went to the incorrect window in false-belief 2. Tobias Schuwerk and colleagues (Reference Schuwerk, Kampis, Bohn, Fisher, Wiesmann, Hyde, Kulke Friedrich-Alexander, Mahowald, Mascaro, Prein and Raz2022) also used Southgate’s stimuli, but they had to exclude 58% of participants (28 out of 48 children) for failing to look in the correct direction at the end of the familiarization period. Of the 20 participants who remained, only 7 looked first toward the correct window, and there was no difference in how long they looked at the correct and incorrect windows across both trials. In the same year, Louisa Kulke and Hannes Rakoczy (Reference Kulke and Rakoczy2018) collected data on both published and unpublished attempts to replicate Southgate et al.’s experiment, showing that of the 20 researchers who responded to their call for data, only 5 managed to successfully replicate Southgate et al.’s data (see Table 1 for their criteria for evaluating replications).
Replication | Partial replication | Nonreplication | |
---|---|---|---|
Unpublished | 0 | 7 | 5 |
Published | 5 | 3 | 0 |
What can be gleaned from this collection of replication data? Taking the more upbeat news first, it appears that more participants succeed in false-belief 1 than in false-belief 2 (Baillargeon et al. Reference Baillargeon, Buttelmann and Southgate2018). If robust, this pattern is something that theories of mind reading could reasonably accommodate. For example, infants need to hold in mind the actor’s false belief for longer in false-belief 2 in contrast to false-belief 1, requiring a greater demand on their limited processing capacity and resulting in their forgetting the actor’s belief and defaulting to reality. This would be in keeping with prominent accounts of why 3-year-olds fail elicited-response tasks (Carruthers Reference Carruthers2013, Reference Carruthers2018, Reference Carruthers2020; Scott and Baillargeon Reference Scott and Baillargeon2009, Reference Scott and Baillargeon2017).
More worrying, however, is the lack of a pattern in infants failing the familiarization trials, ranging from over 50% of participants being excluded at this stage (Schuwerk et al. Reference Schuwerk, Priewasser, Sodian and Perner2018) to just 4% in other studies (Dörrenberg et al. Reference D¨orrenberg, Rakoczy and Liszkowski2018). On the basis of these data alone, one might question the AL paradigm’s suitability for measuring infants’ anticipation of another’s goal-directed movement, and this problem is made all the more pressing because we do not understand why it works for some babies and not others. These data serve to highlight lacunae in our understanding of this methodology.
In their response to this and other replication work concerning different false-belief tasks, Baillargeon et al. (Reference Baillargeon, Buttelmann and Southgate2018) wrote the following:
We do not agree with claims in some of the special-issue papers that these negative findings cast doubt on the conclusion that some capacity for belief understanding is already present in infants and toddlers…. [T]he non-replications stand in contrast to a large body of positive and convergent findings: as was mentioned earlier, over 30 published reports, using 11 different methods, have now provided evidence of false belief understanding in children under 3-years of age. (123)
Notably, these authors each support theories of mind reading that predict that infants should be able to attribute false beliefs and other psychological states to other people. Yet researchers whose theoretical commitments lead them to be less confident that infants’ understanding of psychological states stretches to false belief take quite a different interpretation of the replication data, claiming that we are not yet in a position to know whether infants attribute false beliefs to others (Poulin-Dubois et al. Reference Poulin-Dubois, Rakoczy, Burnside, Crivello, Dorrenberg, Edwards, Krist, Kulke, Liszkowski, Low, Perner, Powell, Priewasser, Rafetseder and Ruffman2018). Footnote 6
Allow me to reiterate that the focus of this article is the anticipation phenomenon (sec. 2.1), not whether infants can attribute false beliefs to others. One can reasonably reframe the debate just discussed to reflect this: one side believes that the data support the existence of the anticipation phenomenon, whereas the other does not; one side believes that a particular effect—infants looking toward where an agent will act—has been replicated, whereas the other does not. What makes the debate more intractable are new doubts, revealed by this replication work, about how the AL paradigm works. This yields a double uncertainty. First, there is uncertainty about the phenomenon: we do not know whether infants expect an agent to act in accordance with her (the agent’s) psychological states, which is why we are conducting the experiments in the first place. But additionally, there is also uncertainty about our methods of measurement: we do not know if the AL paradigm is a reliable method, so when infants’ looking behavior suggests they have not correctly anticipated the agent’s behavior, we don’t know if this is because they have not done so or if they have but it somehow has not been captured by the constraints of the AL paradigm. These uncertainties about the measurement and the phenomenon in turn fuel interpretation of the replication data in different ways, dependent on one’s prior theoretical leanings. Those who think infants can attribute psychological states to others will suggest there is something amiss with how the AL paradigm has been implemented, whereas those on the other side of the debate are more likely to accept the suitability of the AL paradigm but question the existence of the phenomenon. This comes out particularly fiercely in an exchange about the suitability of the violation-of-expectation method for measuring infants’ understanding of false beliefs, with Renée Baillargeon et al (Reference Baillargeon, Buttelmann and Southgate2018) suggesting that small differences in how the paradigm was implemented were responsible for the failure to replicate her work. By contrast, Paula Rubio-Fernandez (Reference Rubio-Fern´andez2019) has expressed concerns that researchers are adjusting how they implement the paradigm until it yields results supportive of the view that infants can attribute false beliefs to others (see also Peterson Reference Peterson2016). And yet, if the phenomenon does exist (as many researchers believe it does), then calibrating our methods of measurement such that they can detect it could be a perfectly reasonable thing to do. The problems arise when, as here, there are doubts about the existence of the phenomenon.
This section has reviewed an ongoing debate about how to interpret attempts to replicate Southgate et al.’s (Reference Southgate, Senju and Csibra2007) experiment using the AL paradigm to ascertain if infants can discriminate between belief-congruent and belief-incongruent behaviors. Thanks to these replication endeavors, an important gap in our knowledge about the AL methodology has become apparent: we do not understand why a significant number of babies fail the familiarization trial. This leads to more pressing questions in our application of the paradigm: What needs to be in place for us to be confident that it is suited to tracking infants’ anticipations about events? And when infants’ looking behavior fails to support the anticipation hypothesis (sec. 2.1), is this because they have not made this discrimination or because it has not been detected by the AL method?
The next section turns to work by Hasok Chang (Reference Chang2004, Reference Chang2012) that argues that even when a field faces a conundrum such as the one outlined here, it is still able to yield epistemic goods. This is due to the process of “epistemic iteration,” wherein by repeating experiments and keeping a variety of different theoretical options open, researchers are able to meet their epistemic goals and, in so doing, make progress with their discoveries. I will argue that replication is an essential part of the epistemic iterative process and that therefore, fields that experience high rates of failed replications can nevertheless be seen as producing important knowledge.
3. Epistemic iteration
3.1. Imperfect ingredients and the “principle of respect”
The structure of the puzzle outlined in section 2.3 is by no means unique to infant psychology. Every scientific field will, at various points in its history, have faced a problem where the current standards of measurement were inadequate for examining the phenomena researchers were interested in. Yet despite these uncertain foundations, the scientists involved were able to progress toward their epistemic goals: calibrating a widely agreed new standard, improving theoretical unity or explanatory power, improving quality and quantity of evidence, or some other epistemic virtue (Chang Reference Chang2004, 227). This movement, argues Chang, occurs thanks to the process he calls epistemic iteration:
Epistemic iteration is a process in which successive stages of knowledge, each building on the preceding one, are created in order to enhance the achievement of certain epistemic goals. In each step, the later stage is based on the earlier stage, but cannot be deduced from it in any straightforward sense. Each link is based on the principle of respect and the imperative of progress, and the whole chain exhibits innovative progress within a continuous tradition. Iteration provides a key to understanding how knowledge can improve without the aid of an indubitable foundation. What we have is a process in which we throw very imperfect ingredients together and manufacture something just a bit less imperfect. (Reference Chang2004, 46)
Progress begins when a community acknowledges that its current system of knowledge is imperfect. In Chang’s example, scientists realized that our sensations of hot and cold were insufficient to permit the investigation of the phenomena they were interested in. In our case, we could say that prior to Onishi and Baillargeon’s pioneering work, we lacked a method to investigate infants’ understanding of false beliefs because the only methods available were designed for children over 36 months. Moving forward to the debate as it stands today: replications of Southgate et al.’s (Reference Southgate, Senju and Csibra2007) work have served to spotlight “imperfections” in our understanding of the AL paradigm, for example, our lack of knowledge of why performance in the familiarization trials is so variable. This is one of the most valuable functions of replications: highlighting gaps in our knowledge of which we were previously unaware (see sec. 3.3).
How do we move on from this state of uncertainty? Here, Chang (Reference Chang2004) argues that we should develop a new standard, whose relation to the old one is captured by the “principle of respect.” Our first iteration of thermoscopes needed to respect our folk sensations of temperature, showing that the things we reliably perceive as hot show a higher temperature than those that we reliably perceive as cold. But while guided by our sensations, the thermoscopes were not constrained by them because, in being more accurate than our sensations, they could later be used to correct judgments of temperature based on sensation alone: a hand that has been in the snow will feel a bucket of tepid water as warm, and one that has been snug in a mitten will feel it as cold, but the thermoscope will reveal that the water is a uniform temperature (Chang Reference Chang2004, 43).
We see the principle of respect in action in the ongoing multilaboratory Many Babies 2 collaboration, which is conducting a large-scale replication project concerning whether babies expect an agent to look for something based on the agent’s knowledge of where that thing is (Schuwerk et al. Reference Schuwerk, Kampis, Bohn, Fisher, Wiesmann, Hyde, Kulke Friedrich-Alexander, Mahowald, Mascaro, Prein and Raz2022). The study uses the AL paradigm. One of the “imperfect” foundations upon which we set the AL paradigm is our acceptance that babies can attribute goals to other agents and expect them to act on these goals. There are multiple lines of support for this acceptance. First, we know that adults cannot help but see certain movements as goal directed, as was shown most famously by Heider and Simmel’s (Reference Heider and Simmel1944) work. Second, it is a feature widely observed in the nonhuman animal kingdom, from a pride of lions hunting an impala to Sarah the chimpanzee recognizing the various outcomes her trainer’s behavior was aimed toward (Woodruff and Premack Reference Woodruff and Premack1978). Third, there are strong evolutionary arguments for the ability to recognize goal-directed movements early in development as a critical means of enhancing survival. Fourth, there are a number of experiments, using a range of different methods (e.g., the visual habituation paradigm), yielding evidence to support the claim that by 8 months, infants reliably distinguish goal-directed from non-goal-directed movements. Footnote 7 And last, but by no means least, caregivers through the ages have treated their babies as though they can recognize goal-directed actions. Taken as a whole, this collection of reasons from a range of disciplinary perspectives—although imperfect—nevertheless gives a foundation against which to calibrate an instance of the AL paradigm: if babies do not respond to a particular set of stimuli in ways that indicate that they have attributed a goal to the protagonist, then those stimuli need to be reconfigured until such a response is reliably procured. The epistemic iteration framework explains why this kind of calibration is acceptable: we are calibrating to an imperfect starting point, but provided we keep an open mind about how the next iteration of measurement might change this (see following discussion), it will be good enough. From their pilot work, the researchers on the Many Babies 2 team are confident that their implementation of the AL paradigm is able to track babies’ expectations of the goal-directed movements of others, with 68% of toddlers (65; 18–25 m) and 69% of adults (42) looking to where a chaser (a bear) would go in order to catch a chasee (a mouse) (Schuwerk et al. Reference Schuwerk, Kampis, Bohn, Fisher, Wiesmann, Hyde, Kulke Friedrich-Alexander, Mahowald, Mascaro, Prein and Raz2022, 19). Footnote 8
3.2. Enrichment, correction, and contradiction
In the previous section, I loosely used the phrase “keep an open mind” about how iteration could change our imperfect starting point. I now draw on three more concepts from Chang to explain what this entails.
First, our new measurements may contradict our previous ones in some ways (see the earlier example of the tepid bucket of water). Some contradiction can be tolerated: after all, the whole point of developing a new system of measurement is because the previous one is in some way inadequate, so we should expect some differences in their outputs. But if every instantiation of the new system leads to a contradiction with the old, then this gives us good reason to abandon the new system. For example, if we could not generate any stimuli that caused babies to look to where a goal-directed agent should go, then this would raise questions about the suitability of the AL method for this age group. Such doubt would be compounded if other methods did show that babies anticipate other people’s goal-directed actions. But there is also a more subtle manifestation of this problem peculiar to infant cognition. Babies have very limited cognitive and motor abilities, and in adjusting the stimuli until participants show AL behaviors, one can end up with images and situations that are very far removed from the everyday reality that babies typically encounter. For example, the Many Babies 2 stimuli are a simple cartoon bear and mouse, an upside-down Y-shaped tunnel through which the bear chases the mouse, and a box at each end of the “Y” where the mouse hides. But generalizability is inherent to the nature of the cognitive capacity we are studying: if babies only show looking behaviors consistent with goal attribution in a very specific circumstance and no other, then this is insufficient to support the claim that they anticipate the goal-directed behaviors of others because this ability is meant to underpin all (or most) perceptions of goal-directed actions, not just a tiny subset of them. Footnote 9 If babies’ looking behavior were specific to just one set of stimuli, this would contradict the hypothesis at the center of our imperfect foundation and lend support to abandoning the AL paradigm.
The second virtue of the iterative process is “enrichment,” wherein “the initially affirmed system is not necessarily negated but refined, resulting in the enhancement of some of its epistemic virtues” (Chang Reference Chang2004, 228). The researchers on the Many Babies 2 team are confident that their stimuli reliably cause babies to look where they expect the bear to chase the mouse. This places them in a position to extend their method from collecting data about a phenomenon about which we are reasonably confident (babies’ ability to anticipate goal-directed action) to one about which we are less certain: babies’ ability to anticipate what someone will do based on their epistemic states (knowledge vs. ignorance). This work is currently underway, using the same stimuli as described for the earlier study but with a minimal adjustment: whether the bear sees which box the mouse enters upon leaving the tunnel. If the babies’ looking patterns do not show that they expect another to act on their knowledge states, the researchers can be reasonably confident that this is due to the babies’ cognitive limitations rather than quirks of the stimuli or measurement window because these remain the same as in the pilot. This process instantiates the principle of respect and also illustrates how the iterative process can lead to progress in allowing methods of experimentation to extend to new domains.
The last virtue of epistemic iteration that Chang (Reference Chang2004) discusses is “self-correction.” This occurs when a new standard gives us reason to adjust our hypotheses that were based on data from the old standard. In this case, one could call the Many Babies 2 stimuli a step toward a new standard. However, the stimuli themselves cannot be the standard, for the reasons explained at the start of this section. Instead, we need to develop our understanding of why these stimuli are more successful at eliciting goal-directed AL behavior. Once this has been done, the principles can be applied to the creation of new stimuli that give more uniform data concerning false beliefs than Southgate et al.’s (Reference Southgate, Senju and Csibra2007) data. Whether a self-correction is required depends on how these data turn out. Another form of self-correction is evident in the calibration process described earlier as the researchers on the Many Babies 2 team developed their stimuli. The adjustment made to the stimuli to get the effect of AL behavior is itself a process of self-correction and can only occur through repeatedly testing different participants.
3.3 Multiple epistemic goals
Central to Chang’s framework is the idea that there are always multiple goals at work in scientific research, and his emphasis on this aspect is helpful for understanding the epistemic gains made in our case study and through replication work more generally. More often than not, the stated goal of an experiment is to provide data for or against a specific hypothesis. If this is one’s only goal, then failed replications are certainly problematic. Popper (Reference Popper1959) famously argued that replicating results is necessary for distinguishing data that support a hypothesis from “mere isolated coincidence” (45). Later, Collins (Reference Collins1985) articulated the problem of the “experimenters’ regress,” namely, how different research teams decide which experimental outcome is the “correct” one: that of the original or of the failed replication (see Feest [Reference Feest2016] for further discussion). Returning to our case study: the data from the replications are insufficient to allow us to evaluate the anticipation hypothesis; thus, they fail to meet this epistemic goal. Yet despite failure on this front, the previous analysis shows how progress has been made toward achieving other epistemic goals: improving our understanding of how the AL paradigm works and, in so doing, making it a more reliable measure of infants’ expectations. This view of progress seems to capture the epistemic gains that come from replication work better than a single-minded focus on whether the results support the hypothesis under consideration.
One worry with this characterization of progress is that it does not match up with how experimenters view their own work. Southgate and colleagues’ (Reference Southgate, Senju and Csibra2007) aim was to test their false-belief hypothesis; the aim of those conducting the replication work was to test the anticipation hypothesis; none of these parties succeeded in attaining these ends. Is it fair to argue for progress on the grounds that different epistemic goals have been achieved when it is not at all clear that anyone involved in the work has these goals in mind? Footnote 10 I think this question can be addressed by revisiting part of the quotation cited in section 3.1: “Epistemic iteration is a process in which successive stages of knowledge, each building on the preceding one, are created in order to enhance the achievement of certain epistemic goals. In each step, the later stage is based on the earlier stage, but cannot be deduced from it in any straightforward sense” (Chang Reference Chang2004, 46; emphasis added). This investigation of infants’ understanding of false beliefs began with an imperfect foundation: the assumption that the AL methodology would be able to provide evidence for or against the false-belief hypothesis. From this beginning, it could not be deduced in a straightforward sense that the next step would be to dissemble the methodology. That this would be a productive step only became apparent later in the research journey, when the failed replications came in. It seems uncontroversial to say that improving our understanding of the AL methodology is an epistemic gain. But it is not one that could have been foreseen from the starting point and thus could not have been a goal. Crucially, without the imperfect starting point, these gains would not have been possible. This is a liberal view of scientific progress, but I do not think it is too liberal. It gives boundary limits for when more experiments are unhelpful: when they fail to meet any of the epistemic goals mentioned earlier. But it is nevertheless healthy, for science and philosophy, to consider the exclusion of wrong answers to be a form of progress.
3.4 Uncertainty revisited
This section has argued that epistemic iteration offers a way of understanding how infant psychology can make epistemic gains despite the dual doubts—about the reliability of the AL method and the existence of the anticipation phenomenon—at its foundation. By using the principle of respect and building out from our initial assumption that infants can attribute goals to others, we can begin to calibrate the AL methodology, which in turn increases our confidence in its reliability when applied to phenomena we are less certain of, such as anticipating another’s actions based on their knowledge states. Crucially, this iterative process can be applied to the other spontaneous methodologies that face the same double uncertainties about measurement and the existence of a phenomenon (e.g., Buttelman et al.’s [Reference Buttelmann, Carpenter and Tomasello2009] spontaneous-helping paradigm or the violation-of-expectation paradigm). Calibrating and standardizing spontaneous methodologies is a key epistemic goal for infant psychology, and the earlier discussion outlines how this is possible even when we are uncertain about the phenomena in question.
One worry about this application of epistemic iteration is that the cases of infant cognition and temperature are disanalogous. Footnote 11 Those developing the first instruments to measure temperature knew that there was a phenomenon “out there” to be measured; they were just unsure how to go about measuring it. In contrast, the central question of the infant psychology debate is whether babies expect people to act in ways that are congruent with their psychological states, and if so, what the limits of this ability might be (goal states, Footnote 12 knowledge states, belief states, etc., as well as the content of these states). In other words, it’s not clear that a phenomenon exists to be measured, unlike the case of temperature. As observed by Kenneth Kendler (Reference Kendler, Kenneth, Kendler and Parnas2012), one cannot iterate “towards a target that isn’t there” (308).
I think this concern can be mitigated from two different angles. First, Chang himself is clear that epistemic iteration is valuable in helping us achieve our epistemic goals, even when we are unsure about whether our inquiries are targeting the phenomena we are after (see also Schaffner Reference Schaffner, Kendler and Parnas2012):
It [epistemic iteration] differs crucially from mathematical iteration in that the latter is used to approach the correct answer that is known, or at least in principle knowable, by other means. In epistemic iteration that is not so clearly the case. (Chang Reference Chang2004, 45)
A null result is nevertheless an epistemic gain. If, after numerous iterative attempts at calibration and standardization across all spontaneous methodologies, babies do not show looking behaviors consistent with the hypothesis that they anticipate the actions of other agents, then we accept that the research program has contradicted its core hypothesis and that babies do not have this cognitive ability. As mentioned earlier, being able to exclude a wrong answer can be useful.
A more likely scenario is that after several iterative processes aimed at improving calibration and standardization for each spontaneous methodology, there is no consensus about the nature of infants’ anticipation of other agents’ actions. This brings me to the second angle from which to address the worry because just as results from the AL methodology alone are insufficient to support the claim that babies anticipate other agents’ actions, neither would the results from all spontaneous methodologies be sufficient to support this claim. Spontaneous methodologies are but one way of exploring and investigating infant cognition. Babies and their carers have, quite literally, always been a part of human history, and there is a vast, messy, and contradictory body of folk knowledge about their abilities. I am reminded here of a passage from Jennifer Nansubuga Makumbi’s (Reference Makumbi2020) novel The First Woman where a Ugandan trainee nurse writes home with news about her first days at medical school:
We have two orphan babies. I am not lying. Real breathing human babies, donated to the school by Ssanyu Babies’ Home, to learn how to look after babies-–winding and bathing them, tying nappies and diet. I said, but these Europeans know how to waste time. Who taught our mothers to bring up children?
We are not yet in a position to know what infants know about other people’s actions. But what we do know is that infants grow into preschoolers who can track false beliefs in others and recognize when someone is hiding their true emotions (Wellman Reference Wellman2014) and eventually into adults who can track three or four levels of deceit in Shakespearean-style plots. Caregivers do not notice a seismic change in their children when they go from failing to passing false-belief tasks, nor when they pass any other purportedly significant mind-reading milestone Footnote 13 in tracking psychological states. We assume that infants know something about the actions of others, and as such, there is a phenomenon there to be explored, no matter how crudely outlined. Footnote 14 Folk knowledge and evidence from other sources (see sec. 3.1) combined with the principle of respect are sufficient to ensure we start our investigations in broadly the right ballpark, and even if the phenomenon under investigation is even less well understood than temperature was prior to the first thermometers, this does not foreclose the prospect of epistemic iteration leading to the fulfillment of our epistemic goals.
4. Imperfection, not falsehood
This article opened with a quote from an editorial in the journal Nature stating that two-thirds of what we read in psychology journals should not be trusted. This section reviews this sentiment in the light of the discussion in section 3.
The aim of Chang’s framework is to show how we can make epistemic inroads in a scientific investigation, be our starting point ever so bad. From an imperfect starting point and with imperfect methodologies, we can nevertheless end up with a better understanding of a phenomenon than that with which we started. Critically, the knowledge we gain would not have been possible had we not started somewhere: the imperfect starting point is necessary to attaining the goods that follow. This position stands in contrast to those who perceive a large number of failed replications to indicate untrustworthy science. A large number of failed replications should be expected when the starting point is bad because there is so much uncertainty about the concepts under investigation and the methods used to find out about them. The problems arise when researchers fail to acknowledge their work for what it is: a process of building outward from an uncertain foundation. An important lesson being learned from the replication crisis is that this starting point needs to be made more explicit (Bringmann et al. Reference Bringmann, Elmer and Eronen2022; Feest Reference Feest2022; Sikorski and Andreoletti Reference Sikorski and Andreoletti2023).
Second, one cannot build the foundation of a scientific research program on distrust. Footnote 15 But epistemic iteration shows that one can build such a foundation upon imperfection. This is not simply an issue of petty wordplay. Inherent in the distrust narrative is the sense that one would be irrational to continue in a field where so many findings fail to be replicated. Indeed, this is expressed with some force by Tal Yarkoni (Reference Yarkoni2020), who exhorts psychology graduates to go do something else with their lives. Epistemic iteration, on the contrary, shows such a starting point to be acceptable because it implies that there is considerable scope for improvement and plenty of work for scientists to do.
One may object that this is an overly Pollyanna-ish interpretation of a field with many failed replications. Sometimes, so the criticism goes, we should take a slew of failed replications to indicate that a hypothesis or research program ought to be abandoned. How can we distinguish between a foundation that is imperfect but has scope for improvement and one that is hopeless? We distinguish it through the system’s ability to achieve the epistemic goals it sets, and those that consistently face self-contradiction in the pursuit of these goals can be abandoned (see sec. 3.2). Getting the same data from the same methods is one epistemic goal, but it is not the only one; subsequently, a large number of failed replications should not be the only reason to abandon a research program.
5. Conclusions
Infant psychology is a field with a high number of failed replications. Yet it is also, as argued in this article, a field where significant epistemic gains are being made in our understanding of the methods used to investigate infant cognition. This is the case despite the high degree of uncertainty in the field regarding both the phenomena under investigation and the reliability of the methods used to examine them. This article offers an explanation for how this can be in the form of epistemic iteration. Epistemic iteration offers the tools to see how we can progress toward our epistemic goals even when our starting point, both in terms of the phenomena under examination and the methods used to examine it, is imperfect. When a field is in this stage of having relatively few affirmed foundations, it is unsurprising that it also has many instances of failed replications because there is so little to build on (Irvine Reference Irvine2021). Importantly, we need to start somewhere, and without the messy data generated by these imperfect concepts, we would not be in a position to work out how we might advance our epistemic goals. It is as we start building on these data that we come closer to creating experiments that can be replicated.
There are several big issues that have been skirted in this piece, which I defer to later articles. The biggest is how we should view progress within infant psychology or even psychology as a more general field. The article accepts, without much defense, Chang’s proposal of progress as characterized by meeting epistemic goals, which gives a very localized view of progress because the goals of most epistemic import will vary from field to field and from time to time within a field. Future work could offer further defense of this view of progress, and of the coherentist approach more broadly endorsed by Chang, as appropriate for psychology. Another question is that raised in section 3.2 regarding the balance between making stimuli that are appropriate for infants and concerns about ecological validity and generalizability. The concern from ecological validity is that the stimuli are so different from life as encountered in the real world that one needs to carefully justify the claim that they are tapping into the same cognitive abilities that babies use “in the wild.” The concern from generalizability is that the stimuli may be testing a very specific cognitive ability (e.g., an infant’s theory of cartoon bears and mice) rather than the indefinitely flexible ability to track goals, which is the real target of investigation (Feest Reference Feest2022; Packer and Moreno-Dulcey Reference Packer and Moreno-Dulcey2013).
Through this survey of replications of Southgate et al. (Reference Southgate, Senju and Csibra2007) AL false-belief task and the Many Babies 2 project, we see research that, far from being untrustworthy, exemplifies progress through the iterative processes of self-correction and enrichment. Research into infants’ abilities to attribute psychological states to others has very few certain foundations, and I have shown how the progress made to date is based on the most stable, but still imperfect, of these—namely, infants’ ability to anticipate goal-directed actions. Thus, because failed replications are compatible with flourishing, progressive science, it is time to sever the connection between “does not replicate” and “untrustworthy” and instead recognize the necessity of this work for the epistemic iterative cycle of accumulating knowledge.
Acknowledgments
This research was supported by a Humboldt Experienced Researcher fellowship and by a British Academy (BA) grant, “Replication: Crisis or Opportunity” (SRG2000688), which funded a series of workshops where these ideas were developed. The author gratefully acknowledges these funders. Thanks also to participants at the BA workshops, Enno Fischer, Barry Maguire, the Consciousness and Cognition research group at Ruhr Universität Bochum, and two anonymous reviewers for their time and invaluable feedback.