1. Introduction
It has become a commonplace that psychology entered a crisis sometime in the second decade of this century. The crisis was triggered by the recognition that seemingly established experimental results could not be replicated, a fact that has given rise to a high degree of stimulating methodological self-reflection within psychology and has attracted philosophical attention as well. Roughly, we can distinguish between two types of responses to the replication crisis, both of which see the ubiquity of replication failures as symptomatic of a deeper problem. The first views the replication crisis as rooted in the prevalence of questionable research practices (e.g., p-hacking and retrospective hypothesis fitting), which give rise to non-replicable results. Scholars in this debate, sometimes associated with the meta-science movement, have focused on ways in which psychological research can be regulated, e.g., by calling for the preregistration of experiments.Footnote 1 Another group of scholars takes the narrow focus on (the replicability of) experimental effects itself to be part of a larger problem, namely a relative sparsity of sustained theoretical work in psychology. In turn, this has given rise to some efforts to develop methodologies of theory construction and to think more generally about what theoretical work in psychology might look like.
Both of these discussions expose problems with the epistemic practices of psychology, but look for the root problem of the crisis in different places. This motivates different answers to the question of what the replication crisis is a crisis of, and (consequently) what kinds of measures should be taken to resolve it. This paper argues that the two diagnoses are mutually compatible, but that there is a deeper question at stake: Rigorous methods of experimental design, data analysis, and theory construction will only be fruitful if applied to the right questions about the right (kinds of) objects. I will explore questions about the “right” questions and objects in psychology by taking as a point of departure Morawski’s (Reference Morawski2021) suggestion that differing responses to the replication crisis are rooted in different conceptions of the psychological subject matter.
Section 2 analyzes Morawski’s (Reference Morawski2021) characterization of the difference between “reformers” and “challengers” to consider her suggestion that they differ (among other things) with regard to the ways in which they construe the psychological subject matter: in terms of effects versus in terms of complexity and context sensitivity. Highlighting that replicability is about generating data that allow inferences to specific phenomena, I will argue that both “effect seekers” and “complexity mongers” are confronted with similar epistemic problems. Section 3 argues that psychology needs a more sustained look at what psychological theories are about. Section 4 presents an answer to this question, which highlights the importance of studying the context sensitivity of psychological objects in its own right.
2. Reformers and challengers: Competing takes on the replication crisis
Morawski (Reference Morawski2021) has recently suggested that differing assessments of the gravity of the replication crisis may be due to differing background assumptions about the psychological subject matter. She divides the community into two groups, “reformers” and “challengers,” and argues that reformers emphasize the importance of uncovering stable effects, whereas challengers view the subject matter of psychology as complex and context sensitive. I refer to proponents of the first position as “effect seekers” and the second as “complexity mongers.” While these are, of course, caricatures, they are useful for my analytical purposes in this paper. This section disambiguates the notions of “effect” and “complexity” to get an analytical grip on some issues underlying the replication crisis.
2.1. Disambiguating “effect”: Data and (two kinds of) phenomena
Picking up the notion that some researchers (typically those more concerned about replication failures) construe the psychological subject matter in terms of stable effects, it will be helpful to begin by distinguishing between two usages of the term “effect”: the first refers to experimental effects (i.e., data), while the second refers to effects that are inferred from experimental effects (i.e., phenomena).Footnote 2 Debates about replicability mostly turn on the former, i.e., on the replicability of experimental effects, given reasonably similar experiments. Experimental effects, qua data, are used to make inferences to statements about phenomena. Such statements are best understood as the results of experiments.
I maintain that experimental psychologists will need their experimental data to not only be replicable, but also support the intended conclusion about a given phenomenon. Addressing this latter point first, we can say that researchers aim at experimental effects that serve as reliable evidence for the intended result. I am understanding the term “reliability” as referring to a situation where there is “the right sort of pattern of counterfactual dependence between the data and the conclusions investigators reach on the phenomena themselves” (Woodward Reference Woodward2000, S163). I interpret this to mean that data are reliable evidence for a specific claim only if they stand in the right kind of a relationship to the phenomenon that we draw inferences about.Footnote 3 We will need to say more about what “in the right kind of relationship” means, but it seems clear that reliability is a stronger requirement than “mere” replicability, though presumably replicability is a necessary condition for reliability.
Given what was just argued, it seems that both effect seekers and complexity mongers ought to be concerned if they fail to generate replicable experimental effects. So, what are we to make of the suggestion that complexity mongers are less worried about replication failures than effect chasers? To address this question, let’s consider the nature of the phenomena that psychologists try to make inferences to. Here, a second distinction becomes relevant, namely, between two kinds of phenomena that psychologists are interested in, and thus between two kinds of experimental results they might wish to establish. The first concerns the existence of real-world behavioral (stimulus–response) effects, which are similar to the ones found in the experiment. The second concerns the existence of some feature of the psychological subject matter that cannot be immediately observed in the lab and that is not similar to experimental effect. Feest (Reference Feest2011) refers to such unobservable effects as “hidden phenomena.”
Examples of the former are alleged effects such as social priming, power posing, or the Mozart effect. In those cases, researchers attempt to create experimental effects and treat those effects as evidence for a similar effect that exists outside the lab. An example of the latter is provided by facial feedback research, i.e., the (putative) phenomenon that there is a feedback mechanism between smiling and experienced positive emotions. The hypothesis that this phenomenon is real was tested by Strack et al. (Reference Strack, Martin and Stepper1988), in an experiment that required subjects to hold pencils in their mouth (in a way that simulated the facial muscles required for smiling) and subsequently measured the intensity of humorous emotions experienced when reading a funny cartoon. The resulting data seemed to confirm the facial feedback hypothesis. Clearly, though, researchers who perform the latter kind of experiments do not intend the circumstances under which the data are generated to be similar to situations under which facial feedback might be triggered in the real world.
The crucial point here is that in both kinds of cases researchers make inferences from experimental effects (data) to the effects of interest (phenomena). The difference is that the effects of interest are located in different places: In the former kind of scenario, the effects of interest are stimulus–response effects; in the latter, they are effects internal to the organism (see figures 1 and 2, respectively).
2.2. Disambiguating “complexity”
If I am right with my above analysis, it seems that my initial labels (“effect chasers” and “complexity mongers”) are misplaced since both groups of scholars are interested in effects (both at the experimental level and as targets of their inferences). Nonetheless, I think that the distinction between effect seekers and complexity mongers points to an issue worth exploring. To this end, the current section takes a closer look at the notion of complexity.
Why do complexity mongers (even if forced to recognize the importance of replicable experimental effects) resist the suggestion that the replication crisis can be fixed by forcing researchers to implement stricter standards of hypothesis testing? The answer is that while failure to replicate an experimental effect is reason for concern, complexity mongers are less inclined to attribute such failures to questionable research practices alone. Instead, they emphasize the possibility of other contributing factors. Specifically, researchers who use their data to make inferences to internal phenomena (see figure 2) bring to the table a heightened sensibility for the difficulty of generating data that are not only replicable, but also reliable. As already indicated, for experimental effects to function as reliable data for a specific experimental inference, they need to stand “in the right kind of counterfactual relationship” to the phenomenon described by the conclusion. We can unpack this to mean that data can only be regarded as reliable evidence for a specific claim (e.g., that there is a feedback mechanism between smile muscles and experienced emotions) if the experimental manipulation in fact triggered the effect of interest, and if the experimental data in fact measure the effect of interest.
It is clear that the requirement of reliability calls for an undistorted causal path between experimental manipulation and experimental measurement, such that the data are not confounded. It also seems very plausible that evidence about “internal phenomena” can easily be confounded by other internal phenomena, which are not easily controlled or even recognized (figure 3).
The distinction between replicability and reliability may be counterintuitive to readers immersed in the methodological literature in psychology, where the term “reliability” is sometimes equated with the ability to achieve the same effect when rerunning a test or experiment or test. However, this misses an important distinction, namely that between having replicable data and having data that support the conclusion to an effect of interest. On a charitable interpretation, complexity mongers are sensitive to this difference, because appreciating the internal complexity of biological organisms makes them aware of the many ways in which experimental data might not be reliable vis-à-vis the intended experimental results.
Even though I have explained the problem of data reliability in relation to the internal complexity of the organism, the problem also arises for those who “only” aim to make inferences from experimental stimulus–response effects to the existence of stimulus–response effects in the real world. Confounders do not have to be internal to the organism, as figure 4 illustrates: When an experimenter manipulates an organism, they treat the data as the effect of that manipulation. However, there might be uncontrolled variables in the experimental environment. Furthermore, the experimental stimulus might be described in a way that does not pick out the causally efficacious aspect. In such cases, the resulting data are unreliable vis-à-vis the intended conclusion. In other words, those interested in (mere) stimulus–response effects need to be just as worried about unreliable data as those interested in hidden effects.Footnote 4
3. Theory to the rescue?
The analysis in the previous section has established that both effect seekers and complexity mongers need to be concerned with stable effects (on the level of data and on the level of phenomena; section 2.1). It has also revealed that both need to reckon with the complexity that can threaten data reliability (section 2.2). In other words, effect chasers rightly emphasize the importance of replicable data. Complexity mongers rightly point to the difficulties of generating such data. I have unpacked this latter point to refer to both (a) the difficulty of generating experimental data that can be reproduced, and (b) the difficulty of generating data that allow for the intended inferences (i.e., data that are reliable as evidence for specific hypotheses about a given phenomenon). I side with the complexity mongers in arguing that this second point, in particular, cannot be resolved by improving replicability alone.
My analysis converges with recent methodological writings in psychology, which have also pointed out that replicable effects in and of themselves, even if they could be achieved more easily, would not be sufficient for claims about phenomena: while “methodological and statistical solutions to the replication crisis will … help ensure solid stones … they don’t help us build the house” (Muthukrishna and Henrich Reference Muthukrishna and Henrich2019, 1–2). One conclusion from this seems to be that one needs something of a blueprint for “the house,” i.e., a theory, or at least a sketch of a theory. As such, this and other writings allude to the point that an adequate response to the replication crisis will require theoretical work in addition to methodological reforms. Relevant recent work includes attention to theory-building methodology (Borsboom et al. Reference Borsboom, van der Maas, Dalege, Kievit and Haig2021), discussions of what theories might look like in psychology (e.g., van Rooij and Baggio Reference van Rooij and Baggio2021) as well as the role of formal methods as a way to constrain conceptual vagueness in hypothesis testing (Fried Reference Fried2020a; Devezer et al. Reference Devezer, Navarro, Vandekerckhove and Ozge Buzbas2021).
Relatedly, Scheel et al. (Reference Scheel, Tiokhin, Isager and Lakens2021) have pointed out that when psychologists test hypotheses by means of experiments, it is often hard for different researchers to agree on their correct interpretation, because the “derivation chain” (Meehl Reference Meehl1990) between theory, hypotheses, and data is underspecified. This observation fits well with recent attention to the problem of underdetermination in psychological experiments (Uygun Tunç and Tunç Reference Uygun Tunç and Necip Tunç2023; Oude Maatman Reference Oude Maatman2021). It also speaks directly to my concern about data reliability, as I have been using the concept here. Addressing the underdetermination of phenomena by data amounts to attempting to improve data reliability. It seems clear that this is closely related to understanding—and physically implementing—the derivation chain between theory and data. The question is what kind of research is need to accomplish this.
I agree with the suggestion by Scheel et al. (Reference Scheel, Tiokhin, Isager and Lakens2021) that the research in question needs to focus on concept formation and exploratory research (including both exploratory experiments and formal modeling), while noting that this does not commit me to a clear-cut distinction between exploratory and confirmatory research (see also Devezer et al. Reference Devezer, Navarro, Vandekerckhove and Ozge Buzbas2021; Rubin and Donkin Reference Rubin and Donkin2022). However, I argue that a focus on theoretical and exploratory work highlights the more fundamental question of what psychological theories, models, and concepts are actually about. As I have shown above, questions about data reliability are a concern for both effect chasers and complexity mongers. This suggests that there are peculiarities of the psychological subject matter that both sides have to grapple with, independently of their specific theories or research interests.
4. The context sensitivity of the units of psychological analysis
In search of a peculiarity of the psychological subject matter, let’s begin with Fried’s contention that psychological theories are about phenomena, not about data (Fried Reference Fried2020b, section 3), followed, a few pages later, by the assertion that “psychological constructs can be thought of as target systems” that “are represented via a theory’s structure, which, like the target system, features components and relations among them” (ibid, section 3.2). I would like to point out that there is an important difference between phenomena and systems: Theories can explain individual phenomena, but systems can exhibit multiple phenomena.Footnote 5 Thus, I argue that while any give hypothesis derived from a theory can be about a specific phenomenon, psychological theories are usually not just about one specific phenomenon, but about a set of interrelated phenomena that are assumed to jointly constitute the object of research (or, as Fried calls it, the “target system”). Think, for example, of a psychological object like emotion. Clearly, this object has multiple phenomena associated with it, including a great variety of behavioral responses to stimuli (stimulus–response effects), but also internal/hidden phenomena, such as the facial feedback mechanism and emotional experiences.
Once we recognize that objects of psychological investigation are often systems of phenomena—or “clusters of phenomena” (Feest Reference Feest2017)—this adds to our appreciation of the complexity of the psychological subject matter (figure 3). It also raises the question of what is a reasonable way to conceptualize the units that contain, or constitute, such systems, such that they can be differentially affected by environmental factors (figure 4). In this vein, I distinguish questions about the objects of psychological research (e.g., emotion) from questions about the units of analysis psychologists are interested in. Moreover, I suggest that we understand objects of psychological research as complex cognitive, behavioral, and experiential capacities (Feest Reference Feest and Shan2022b), which are exhibited by individual organisms.Footnote 6
This brings to the fore another claim that Morawski (Reference Morawski2021) attributes to “challengers,” namely that “psychology’s objects are not only sensitive to material conditions of the world, including the laboratory, but also affected by the shifting meanings that individuals derive from contexts both inside and outside the lab” (Morawksi Reference Morawski2021, 4). Darwin’s facial feedback hypothesis is a case in point: while Strack et al.’s (Reference Strack, Martin and Stepper1988) study seemed to confirm Darwin’s hypothesis, a later replication study did not. Even more recently, it was found that the replication study had in fact introduced a small change that made the effect go away, thus confounding the data, i.e., the fact that they had filmed the study participants during the replication study—see Feest (Reference Feest2022a) for details. The example illustrates the way in which specific phenomena associated with an object can be context sensitive. It also illustrates that it is not obvious that data reliability could be improved by an improved theory of the object (emotion) alone since it seems that the confounder had to do with the awareness of being filmed, not with emotion, narrowly construed as the “target system.” This is crucial here since it suggests that what makes data unreliable is an ineliminable part of psychological objects (complex cognitive, behavioral, and experiential capacities) and of the units that exhibit the phenomena peculiar to the subject matter (organisms as a whole).
The conclusion I want to draw from the above is that the context sensitivity of the psychological subject matter needs to be at the center of both theoretical and empirical work, not merely as a way of controlling for confounders but also because, in the long run, it is the experiences, cognitions, and behavior of complex organisms in complex environments that psychology needs to focus on.
5. Going forward (instead of a conclusion)
The preceding analysis laid out why I (like many others) don’t think that the replication crisis is merely a crisis of failure to apply stringent methodological rules to practices of hypothesis testing and data analysis. While I agree with “complexity mongers” that the crisis is (at least in part) due to the sheer complexity of the subject matter, I have tried to unpack what this means in more specific terms by pointing to the problems of context sensitivity (as a feature of the psychological subject matter) and data reliability (as a feature of experimental evidence). Crucially, the latter is closely related to the former.
I have proposed a specific account of what the replication crisis is a crisis of (unreliable data in conjunction with a lack of reflection on what are units and objects of psychological analysis). Let me add two disclaimers. First, I am not claiming that the underlying issue I have identified (concerning the subject matter of psychology) is the only issue worth exploring. Second, I have not presented an easy solution to the crisis. I do, however, think my analysis points in two directions. Even though the search for effects will continue to be an important part of psychological research, more efforts are needed to (1) think about how these effects contribute to our overall understanding of the objects of psychological research, and (2) explore how their manifestations are affected by variables internal and external to the organism. Small changes in the experimental design can be confounders relative to a specific intended experimental inference. But looked at from a different perspective, they can also indicate ways in which the object under investigation is context sensitive and thus moderated by the change in question.Footnote 7 In this vein, I would press that we regard the context sensitivity of the psychological subject matter as a feature, not only as a bug: The very question of how organisms respond to environmental variations (but also how members of different populations respond differentially to similar environmental conditions) should be central to psychological research efforts.
While I agree with Eronen and Brinkman (Reference Eronen and Brinkman2021, 785) that the way forward will include attention to stabilizing phenomena and “strengthening the conceptual basis of psychological theories,” the crucial question is how to delineate the corresponding objects and their component phenomena in the first place, and how to ensure that the data (experimental effects) that are generated in support of claims about robust phenomena are reliable. In this regard, I push for a macro-level perspective that takes the behavior of the whole organism into view first. My outlook here is sympathetic to earlier functionalist and ecological approaches to the psychological subject matter (from figures like James and Dewey to Gibson and Brunswik and gestalt psychology). Unlike those earlier approaches, my acknowledgment of internal phenomena and mechanisms as integral to the psychological subject matter recognizes the importance of integrating a solid empirical understanding of (what I have called) stimulus–response effects (figure 1) with mechanistic theorizing (figure 2) and how they are brought about (Hatfield Reference Hatfield2021). Individuating the phenomena that are context sensitive in this way is going to be far from trivial (Wajnerman-Paz and Rojas-Libano Reference Wajnerman-Paz and Rojas-Libano2022). However, I concur with de Houwer (Reference Houwer2011) that it is important to distinguish the search and characterization of stimulus–response effects from cognitive (i.e., hidden) effects, and to direct conceptual, empirical, and theoretical work at the question of how they are related.Footnote 8 In conclusion, I argue that such simultaneous attention to the shape of the psychological subject matter and to the reliability of data is likely to be a crucial component of our response to the replication crisis.
Acknowledgements
I would like to thank the audience and other members of the PSA symposium for helpful questions and suggestions. I am also particularly grateful to Bart Penders for creating the flow diagrams that appear in this publication.