Lack of coordination limits both the accumulation and integration of claims, as well as the efficient falsification of theories. How is the field to deal with this problem? We expect that consensus meetings (Fink, Kosecoff, Chassin, & Brook, Reference Fink, Kosecoff, Chassin and Brook1984), where researchers come together to discuss their theoretical viewpoints, prioritize the factors they all agree are important to study, standardize their measures, and determine a smallest effect size of interest, will prove to be a more efficient solution to the lack of coordination and integration of claims in science than integrative experiments. We provide four reasons.
First, design spaces are simply an extension of the principles of multiverse analysis (Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016) to theory-building. Researchers have recognized that any specified multiverse is just one of many possible multiverses (Primbs et al., Reference Primbs, Rinck, Holland, Knol, Nies and Bijlstra2022). The same is true for design spaces. People from different backgrounds and fields are aware of different literatures and might therefore construct different design spaces. Therefore, in practice a design space does not include all factors that members of a scientific community deem relevant – they merely include one possible subset of these factors. While any single design space can lead to findings that can be used to generate new hypotheses, it is not sufficient to integrate existing hypotheses. Designing experiments that inform the integration of disparate findings requires that members of the community agree that the design space contains all relevant factors to corroborate or falsify their predictions. If any such factor is missing, members of the scientific community can more easily dismiss the conclusions of an integrative experiment for lacking a crucial moderator or including a condemning confound. Committing a priori to the outcome – for example, in a consensus meeting – makes it more difficult to dismiss the conclusions.
We believe that to guarantee that people from different backgrounds, fields, and convictions are involved in the creation and approval of the design space, consensus meetings will be required. During these consensus meetings, researchers will need to commit in advance to the consequences that the results of an integrative experiment will have for their hypotheses. Examples in the psychological literature show how initial versions of such consensus-based tests of predictions can efficiently falsify predictions (Vohs et al., Reference Vohs, Schmeichel, Lohmann, Gronau, Finley, Ainsworth and Albarracín2021), and exclude competing hypotheses (Coles et al., Reference Coles, March, Marmolejo-Ramos, Larsen, Arinze, Ndukaihe and Liuzza2022). Furthermore, because study-design decisions always predetermine the types of effects that can be identified in the design space, varying operationalizations may result in multiple versions of a study outcome that are not proforma comparable. To reduce the risks of a “methodological imperative” (Danziger, Reference Danziger1990), we need a consensus among experts on the theory and construct validity of the variables being tested.
Second, many of the observed effects in a partial design space will be either too small to be theoretically interesting, or too small to be practically important. Determining when effect sizes are too small to be theoretically or practically interesting can be challenging, yet it is essential to be able to falsify predictions, as well as to show the absence of differences between experiments (Primbs et al., Reference Primbs, Pennington, Lakens, Silan, Lieck, Forscher and Westwood2023). Due to the combination of “crud” (Orben & Lakens, Reference Orben and Lakens2020) and large sample sizes, very small effect sizes could be statistically significant in integrative experiments. Without specifying a smallest effect of interest, the scientific literature will be polluted with a multitude of irrelevant and unfalsifiable claims. For integrative experiments, which require a large investment of time and money, discussions about which effects are large enough to matter should happen before data are collected. Many fields that have specified smallest effect sizes of interest have used consensus meetings to discuss this important topic.
Third, it is important to note that due to the large number of comparisons made in integrative experiments, some significant differences might not be due to crud (i.e., true effects caused by uninteresting mechanisms), but due to false positives. Strictly controlling the type 1 error rate when comparing many variations of studies will lower the statistical power of tests as the number of comparisons increases. Not controlling for multiple comparisons will require follow-up replication studies before claims can be made. Such is the cost of a fishing expedition. Consensus meetings, which have as one goal to reach collective agreement on which research questions should be prioritized, while coordinating measures and manipulations across studies, might end up being more efficient.
Fourth, identifying variation in effect sizes across a range of combinatorial factors is not sufficient to explain this variation. To make generalizable claims and distinguish hypothesized effects from confounding variables, one must understand how design choices affect effect sizes. Here, we consider machine-learning (ML) approaches a toothless tiger. Because these models exploit all kinds of stochastic dependencies in the data, ML models are excellent at identifying predictors in nonexplanatory, predictive research (Hamaker, Mulder, & Van IJzendoorn, Reference Hamaker, Mulder and Van IJzendoorn2020; Shmueli, Reference Shmueli2010). If there is a true causal model explaining the influence of a set of design choices and variables on a study outcome, the algorithm will find all relations – even those due to confounding, collider bias, or crud (Pearl, Reference Pearl1995). Algorithms identify predictors only relative to the variable set – the design space – so even “interpretable, mechanistic” (target article, sect. 3.3.1, para. 3) ML models cannot simply grant indulgence in causal reasoning. Achieving causal understanding through ML tools (e.g., through causal discovery algorithms) requires researchers to make strong assumptions and engage in a priori theorizing about causal dependencies (Glymour, Zhang, & Spirtes, Reference Glymour, Zhang and Spirtes2019). Here again, we believe it would be more efficient to debate such considerations in consensus meetings.
We believe integrative experiments may be useful when data collection is cheap and the goal is to develop detailed models that predict variation in real-world factors. Such models are most useful when they aim to explain variation in naturally occurring combinations of factors (as effect sizes for combinations of experimental manipulations could quickly become nonsensical). For all other research questions where a lack of coordination causes inefficiencies, we hope researchers studying the same topic will come together in consensus meetings to coordinate their research.
Lack of coordination limits both the accumulation and integration of claims, as well as the efficient falsification of theories. How is the field to deal with this problem? We expect that consensus meetings (Fink, Kosecoff, Chassin, & Brook, Reference Fink, Kosecoff, Chassin and Brook1984), where researchers come together to discuss their theoretical viewpoints, prioritize the factors they all agree are important to study, standardize their measures, and determine a smallest effect size of interest, will prove to be a more efficient solution to the lack of coordination and integration of claims in science than integrative experiments. We provide four reasons.
First, design spaces are simply an extension of the principles of multiverse analysis (Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016) to theory-building. Researchers have recognized that any specified multiverse is just one of many possible multiverses (Primbs et al., Reference Primbs, Rinck, Holland, Knol, Nies and Bijlstra2022). The same is true for design spaces. People from different backgrounds and fields are aware of different literatures and might therefore construct different design spaces. Therefore, in practice a design space does not include all factors that members of a scientific community deem relevant – they merely include one possible subset of these factors. While any single design space can lead to findings that can be used to generate new hypotheses, it is not sufficient to integrate existing hypotheses. Designing experiments that inform the integration of disparate findings requires that members of the community agree that the design space contains all relevant factors to corroborate or falsify their predictions. If any such factor is missing, members of the scientific community can more easily dismiss the conclusions of an integrative experiment for lacking a crucial moderator or including a condemning confound. Committing a priori to the outcome – for example, in a consensus meeting – makes it more difficult to dismiss the conclusions.
We believe that to guarantee that people from different backgrounds, fields, and convictions are involved in the creation and approval of the design space, consensus meetings will be required. During these consensus meetings, researchers will need to commit in advance to the consequences that the results of an integrative experiment will have for their hypotheses. Examples in the psychological literature show how initial versions of such consensus-based tests of predictions can efficiently falsify predictions (Vohs et al., Reference Vohs, Schmeichel, Lohmann, Gronau, Finley, Ainsworth and Albarracín2021), and exclude competing hypotheses (Coles et al., Reference Coles, March, Marmolejo-Ramos, Larsen, Arinze, Ndukaihe and Liuzza2022). Furthermore, because study-design decisions always predetermine the types of effects that can be identified in the design space, varying operationalizations may result in multiple versions of a study outcome that are not proforma comparable. To reduce the risks of a “methodological imperative” (Danziger, Reference Danziger1990), we need a consensus among experts on the theory and construct validity of the variables being tested.
Second, many of the observed effects in a partial design space will be either too small to be theoretically interesting, or too small to be practically important. Determining when effect sizes are too small to be theoretically or practically interesting can be challenging, yet it is essential to be able to falsify predictions, as well as to show the absence of differences between experiments (Primbs et al., Reference Primbs, Pennington, Lakens, Silan, Lieck, Forscher and Westwood2023). Due to the combination of “crud” (Orben & Lakens, Reference Orben and Lakens2020) and large sample sizes, very small effect sizes could be statistically significant in integrative experiments. Without specifying a smallest effect of interest, the scientific literature will be polluted with a multitude of irrelevant and unfalsifiable claims. For integrative experiments, which require a large investment of time and money, discussions about which effects are large enough to matter should happen before data are collected. Many fields that have specified smallest effect sizes of interest have used consensus meetings to discuss this important topic.
Third, it is important to note that due to the large number of comparisons made in integrative experiments, some significant differences might not be due to crud (i.e., true effects caused by uninteresting mechanisms), but due to false positives. Strictly controlling the type 1 error rate when comparing many variations of studies will lower the statistical power of tests as the number of comparisons increases. Not controlling for multiple comparisons will require follow-up replication studies before claims can be made. Such is the cost of a fishing expedition. Consensus meetings, which have as one goal to reach collective agreement on which research questions should be prioritized, while coordinating measures and manipulations across studies, might end up being more efficient.
Fourth, identifying variation in effect sizes across a range of combinatorial factors is not sufficient to explain this variation. To make generalizable claims and distinguish hypothesized effects from confounding variables, one must understand how design choices affect effect sizes. Here, we consider machine-learning (ML) approaches a toothless tiger. Because these models exploit all kinds of stochastic dependencies in the data, ML models are excellent at identifying predictors in nonexplanatory, predictive research (Hamaker, Mulder, & Van IJzendoorn, Reference Hamaker, Mulder and Van IJzendoorn2020; Shmueli, Reference Shmueli2010). If there is a true causal model explaining the influence of a set of design choices and variables on a study outcome, the algorithm will find all relations – even those due to confounding, collider bias, or crud (Pearl, Reference Pearl1995). Algorithms identify predictors only relative to the variable set – the design space – so even “interpretable, mechanistic” (target article, sect. 3.3.1, para. 3) ML models cannot simply grant indulgence in causal reasoning. Achieving causal understanding through ML tools (e.g., through causal discovery algorithms) requires researchers to make strong assumptions and engage in a priori theorizing about causal dependencies (Glymour, Zhang, & Spirtes, Reference Glymour, Zhang and Spirtes2019). Here again, we believe it would be more efficient to debate such considerations in consensus meetings.
We believe integrative experiments may be useful when data collection is cheap and the goal is to develop detailed models that predict variation in real-world factors. Such models are most useful when they aim to explain variation in naturally occurring combinations of factors (as effect sizes for combinations of experimental manipulations could quickly become nonsensical). For all other research questions where a lack of coordination causes inefficiencies, we hope researchers studying the same topic will come together in consensus meetings to coordinate their research.
Competing interest
None.