Some problems with zooming out as scientific reform

Jessica Hullman

doi:10.1017/S0140525X23002133

Some problems with zooming out as scientific reform

Published online by Cambridge University Press: 05 February 2024

Jessica Hullman

Show author details

Jessica Hullman*: Affiliation:
Computer Science, Northwestern University, Evanston, IL, USA jhullman@northwestern.edu
*: *Corresponding author.

Article contents

Abstract
Financial support
Competing interest
References

Rights & Permissions

Abstract

Integrative experimentation will improve on the status quo in empirical behavioral science. However, the results integrative experiments produce will remain conditional on the various assumptions used to produce them. Without a theory of interpretability, it remains unclear how viable it is to address the crud factor without sacrificing explainability.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 47 , 2024 , e49

DOI: https://doi.org/10.1017/S0140525X23002133 [Opens in a new window]
Copyright: Copyright © The Author(s), 2024. Published by Cambridge University Press

When faced with social science research, why is it so hard to answer the question: What did we learn from this experiment? A core problem is that many experimenters have come to equate theories with predicting directional associations, which can neither formally ground expectations of when data are surprising nor yield strong experimental tests. Any scientific reform proposal that starts from data generated by sampling a design space and expects to get to good theory misconstrues the role of theory in learning from experiments: To propose a data-generating mechanism with testable implications (Fiedler, Reference Fiedler2017; Muthukrishna & Henrich, Reference Muthukrishna and Henrich2019; Oberauer & Lewandowsky, Reference Oberauer and Lewandowsky2019).

At the same time, behavioral science is unlikely to change the world if we do not start taking heterogeneity of effects more seriously (Bryan, Tipton, & Yeager, Reference Bryan, Tipton and Yeager2021). Integrative experiment design (target article) elevates heterogeneity by rendering explicitly a larger design space from which experiments are sampled. By applying predictive modeling to test the generalization of surrogate models learned on portions of the space, it addresses the pervasive illusion that models chosen for their explanatory power also predict well (Yarkoni & Westfall, Reference Yarkoni and Westfall2017). If adopted, integrative modeling seems well-positioned to improve on the status quo of knowledge generation in many domains.

However, like related proposals that attempt to debias data-driven inferences by “zooming out,” integrative design occupies an in-between territory in which gestures of completeness have conceptual value but struggle to find their footing in the form of stronger guarantees. Here I consider challenges that arise in (1) trying to separate the results sampled from a design space from the assumptions that produce them and (2) trying to achieve a balance between reducing confounds from the crud factor (Meehl, Reference Meehl1990) and drowning in complexity.

No such thing as unconditional data

A presupposition behind integrative experiment design – and related proposals like multiverse analysis, which attempts to amend the limitations of a single analysis by rendering explicitly a design space to sample from (Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016) – is that by zooming out from a narrow focus (on just a few variables, or a single analysis path) and sampling results from a larger space, they will produce unbiased evaluations of a claim. In integrative modeling, tests of surrogate models take the form of prediction problems in a supervised learning paradigm, with the added implied constraint that they must “accurately explain the data researchers have already observed.”

But the theories that arise from integrative experiment design will be conditional on more than just the features used to train them. How to interpret the “tests” of surrogate models is an important degree of freedom, for example. Measures like sample complexity can supply requirements to resolve prediction accuracy within a chosen error bound, but not what bound should constitute sufficient predictive performance, or how it should differ across domains. There is a chicken-and-egg problem in attempting to separate the experimental findings from the definition of the learning problem and sampling approach.

If integrative experiment design also incorporates explanatory methods, and the explanations take the form of causal mechanisms proposed to operate in different regions of the design space, then this explanatory layer may very well make it easier for experimenters to draw on domain knowledge, helping retain predictive accuracy when moving out-of-distribution relative to a “pure prediction” approach. But this is difficult to conclude without defining what makes a surrogate model interpretable.

Goldilocks and the crud factor

Both multiverse analysis and integrative experiment design can seem to presuppose that our prior knowledge can take us just far enough to produce results more complex than current results sections, but not so complicated that we get overwhelmed. The “new kinds of theories” associated with integrative experimentation are meant to “capture the complexity of human behaviors while retaining the interpretability of simpler theories.” This may be possible, but we should be careful not to assume that we can always zoom out until we find the dimensionality that is considerably greater than the dimensionality of the problem implied by the status quo single experiment, but not so great as to be noncomprehensible to a human interpreter.

If we take seriously Meehl's notion of the crud factor, we might easily list hundreds of potential influences, for example, on group performance, from interpersonal attractions among group members to their religious orientations to recent current events. Even if the prior literature boils some of these down to encompassing unidimensional summaries (e.g., religious homogeneity) there will be many ways to measure each, and many ways to analyze the results which might have their own consequences. How do we guarantee that the number of choices that matter yield interpretable explanations? To take seriously the promises of approaches like integrative experimentation, we must contextualize them within a theory of interpretability.

Multiverse and integrative experiment design provide solutions that are more relative than precise: Sampling from the larger space better captures our fundamental ontological uncertainty about the true data-generating model than not defining and sampling from the larger space, but cannot eliminate it. By prioritizing data over theory, both approaches gesture toward completeness, but cannot provide guarantees. Under philosophical scrutiny, their clearest value seems to be rhetorical. Consequently the completeness that such methods seem to promise can be misleading.

These points should not discourage adoption of integrative experimentation, which is likely to improve learning from experiments by addressing many important criticisms raised with the status quo. However, as confident but often informal proposals scientific reforms abound, it is always worth deep consideration of what problems are addressed, and what promises, if any, can be made (Devezer, Navarro, Vandekerckhove, & Buzbas, Reference Devezer, Navarro, Vandekerckhove and Ozge Buzbas2021). Integrative experiment design is one way of improving learning from experiments, which can complement but cannot replace the need to clarify what we learn from any experiment – single or integrative – in the first place. To reform science we will also need to “zoom in” by formalizing our expectations within a theoretical framework and foregrounding the conditionality of our inferences.

Acknowledgment

The author thanks Andrew Gelman for comments on a draft.

Financial support

This work was supported by the National Science Foundation (CISE Nos. 2211939 and 1930642) and a Microsoft Research Faculty Fellowship.

Competing interest

None.

References

Bryan, C. J., Tipton, E., & Yeager, D. S. (2021). Behavioural science is unlikely to change the world without a heterogeneity revolution. Nature Human Behaviour, 5(8), 980–989.CrossRef Google Scholar PubMed

Devezer, B., Navarro, D. J., Vandekerckhove, J., & Ozge Buzbas, E. (2021). The case for formal methodology in scientific reform. Royal Society Open Science, 8(3), 200805.CrossRef Google Scholar PubMed

Fiedler, K. (2017). What constitutes strong psychological science? The (neglected) role of diagnosticity and a priori theorizing. Perspectives on Psychological Science, 12(1), 46–61.CrossRef Google Scholar

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195–244.CrossRef Google Scholar

Muthukrishna, M., & Henrich, J. (2019). A problem in theory. Nature Human Behaviour, 3(3), 221–229.CrossRef Google Scholar PubMed

Oberauer, K., & Lewandowsky, S. (2019). Addressing the theory crisis in psychology. Psychonomic Bulletin & Review, 26, 1596–1618.CrossRef Google Scholar PubMed

Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702–712.CrossRef Google Scholar PubMed

Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100–1122.CrossRef Google Scholar PubMed