Consensus meetings will outperform integrative experiments

Maximilian A. Primbs; Leonie A. Dudda; Pia K. Andresen; Erin M. Buchanan; Hannah K. Peetz; Miguel Silan; Daniël Lakens

doi:10.1017/S0140525X23002248

Consensus meetings will outperform integrative experiments

Published online by Cambridge University Press: 05 February 2024

and

Maximilian A. Primbs: Affiliation:
Behavioural Science Institute, Radboud University, Nijmegen, The Netherlands max.primbs@ru.nl hannah.peetz@ru.nl, https://max-primbs.netlify.app/
Leonie A. Dudda: Affiliation:
Department of Otorhinolaryngology, Head and Neck Surgery, University Medical Center, Utrecht, The Netherlands l.a.dudda@umcutrecht.nl University Medical Center Utrecht Brain Center, University Medical Center Utrecht, Utrecht, The Netherlands
Pia K. Andresen: Affiliation:
Department of Methodology & Statistics, Utrecht University, Utrecht, The Netherlands p.k.andresen@uu.nl
Erin M. Buchanan: Affiliation:
Harrisburg University of Science and Technology, Harrisburg, PA, USA ebuchanan@harrisburgu.edu, https://www.aggieerin.com/
Hannah K. Peetz: Affiliation:
Behavioural Science Institute, Radboud University, Nijmegen, The Netherlands max.primbs@ru.nl hannah.peetz@ru.nl, https://max-primbs.netlify.app/
Miguel Silan: Affiliation:
Annecy Behavioral Science Lab, Menthon Saint Bernard, France MiguelSilan@gmail.com Développement, individu, processus, handicap, éducation (DIPHE), Université Lumière Lyon 2, Bron Cedex, France
Daniël Lakens*: Affiliation:
Human–Technology Interaction Group, Eindhoven University of Technology, Eindhoven, The Netherlands D.Lakens@tue.nl, https://sites.google.com/site/lakens2
*: *Corresponding author.

Article contents

Abstract
Competing interest
References

Rights & Permissions

Abstract

We expect that consensus meetings, where researchers come together to discuss their theoretical viewpoints, prioritize the factors they agree are important to study, standardize their measures, and determine a smallest effect size of interest, will prove to be a more efficient solution to the lack of coordination and integration of claims in science than integrative experiments.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 47 , 2024 , e56

DOI: https://doi.org/10.1017/S0140525X23002248 [Opens in a new window]
Copyright: Copyright © The Author(s), 2024. Published by Cambridge University Press

Lack of coordination limits both the accumulation and integration of claims, as well as the efficient falsification of theories. How is the field to deal with this problem? We expect that consensus meetings (Fink, Kosecoff, Chassin, & Brook, Reference Fink, Kosecoff, Chassin and Brook1984), where researchers come together to discuss their theoretical viewpoints, prioritize the factors they all agree are important to study, standardize their measures, and determine a smallest effect size of interest, will prove to be a more efficient solution to the lack of coordination and integration of claims in science than integrative experiments. We provide four reasons.

First, design spaces are simply an extension of the principles of multiverse analysis (Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016) to theory-building. Researchers have recognized that any specified multiverse is just one of many possible multiverses (Primbs et al., Reference Primbs, Rinck, Holland, Knol, Nies and Bijlstra2022). The same is true for design spaces. People from different backgrounds and fields are aware of different literatures and might therefore construct different design spaces. Therefore, in practice a design space does not include all factors that members of a scientific community deem relevant – they merely include one possible subset of these factors. While any single design space can lead to findings that can be used to generate new hypotheses, it is not sufficient to integrate existing hypotheses. Designing experiments that inform the integration of disparate findings requires that members of the community agree that the design space contains all relevant factors to corroborate or falsify their predictions. If any such factor is missing, members of the scientific community can more easily dismiss the conclusions of an integrative experiment for lacking a crucial moderator or including a condemning confound. Committing a priori to the outcome – for example, in a consensus meeting – makes it more difficult to dismiss the conclusions.

We believe that to guarantee that people from different backgrounds, fields, and convictions are involved in the creation and approval of the design space, consensus meetings will be required. During these consensus meetings, researchers will need to commit in advance to the consequences that the results of an integrative experiment will have for their hypotheses. Examples in the psychological literature show how initial versions of such consensus-based tests of predictions can efficiently falsify predictions (Vohs et al., Reference Vohs, Schmeichel, Lohmann, Gronau, Finley, Ainsworth and Albarracín2021), and exclude competing hypotheses (Coles et al., Reference Coles, March, Marmolejo-Ramos, Larsen, Arinze, Ndukaihe and Liuzza2022). Furthermore, because study-design decisions always predetermine the types of effects that can be identified in the design space, varying operationalizations may result in multiple versions of a study outcome that are not proforma comparable. To reduce the risks of a “methodological imperative” (Danziger, Reference Danziger1990), we need a consensus among experts on the theory and construct validity of the variables being tested.

Second, many of the observed effects in a partial design space will be either too small to be theoretically interesting, or too small to be practically important. Determining when effect sizes are too small to be theoretically or practically interesting can be challenging, yet it is essential to be able to falsify predictions, as well as to show the absence of differences between experiments (Primbs et al., Reference Primbs, Pennington, Lakens, Silan, Lieck, Forscher and Westwood2023). Due to the combination of “crud” (Orben & Lakens, Reference Orben and Lakens2020) and large sample sizes, very small effect sizes could be statistically significant in integrative experiments. Without specifying a smallest effect of interest, the scientific literature will be polluted with a multitude of irrelevant and unfalsifiable claims. For integrative experiments, which require a large investment of time and money, discussions about which effects are large enough to matter should happen before data are collected. Many fields that have specified smallest effect sizes of interest have used consensus meetings to discuss this important topic.

Third, it is important to note that due to the large number of comparisons made in integrative experiments, some significant differences might not be due to crud (i.e., true effects caused by uninteresting mechanisms), but due to false positives. Strictly controlling the type 1 error rate when comparing many variations of studies will lower the statistical power of tests as the number of comparisons increases. Not controlling for multiple comparisons will require follow-up replication studies before claims can be made. Such is the cost of a fishing expedition. Consensus meetings, which have as one goal to reach collective agreement on which research questions should be prioritized, while coordinating measures and manipulations across studies, might end up being more efficient.

Fourth, identifying variation in effect sizes across a range of combinatorial factors is not sufficient to explain this variation. To make generalizable claims and distinguish hypothesized effects from confounding variables, one must understand how design choices affect effect sizes. Here, we consider machine-learning (ML) approaches a toothless tiger. Because these models exploit all kinds of stochastic dependencies in the data, ML models are excellent at identifying predictors in nonexplanatory, predictive research (Hamaker, Mulder, & Van IJzendoorn, Reference Hamaker, Mulder and Van IJzendoorn2020; Shmueli, Reference Shmueli2010). If there is a true causal model explaining the influence of a set of design choices and variables on a study outcome, the algorithm will find all relations – even those due to confounding, collider bias, or crud (Pearl, Reference Pearl1995). Algorithms identify predictors only relative to the variable set – the design space – so even “interpretable, mechanistic” (target article, sect. 3.3.1, para. 3) ML models cannot simply grant indulgence in causal reasoning. Achieving causal understanding through ML tools (e.g., through causal discovery algorithms) requires researchers to make strong assumptions and engage in a priori theorizing about causal dependencies (Glymour, Zhang, & Spirtes, Reference Glymour, Zhang and Spirtes2019). Here again, we believe it would be more efficient to debate such considerations in consensus meetings.

We believe integrative experiments may be useful when data collection is cheap and the goal is to develop detailed models that predict variation in real-world factors. Such models are most useful when they aim to explain variation in naturally occurring combinations of factors (as effect sizes for combinations of experimental manipulations could quickly become nonsensical). For all other research questions where a lack of coordination causes inefficiencies, we hope researchers studying the same topic will come together in consensus meetings to coordinate their research.

Competing interest

None.

References

Coles, N. A., March, D. S., Marmolejo-Ramos, F., Larsen, J. T., Arinze, N. C., Ndukaihe, I. L. G., … Liuzza, M. T. (2022). A multi-lab test of the facial feedback hypothesis by the Many Smiles Collaboration. Nature Human Behaviour, 6(12), 1731–1742. https://doi.org/10.1038/s41562-022-01458-9CrossRef Google Scholar PubMed

Danziger, K. (1990). Constructing the subject: Historical origins of psychological research. Cambridge University Press. https://doi.org/10.1017/CBO9780511524059CrossRef Google Scholar

Fink, A., Kosecoff, J., Chassin, M., & Brook, R. H. (1984). Consensus methods: Characteristics and guidelines for use. American Journal of Public Health, 74(9), 979–983. https://doi.org/10.2105/AJPH.74.9.979CrossRef Google Scholar PubMed

Glymour, C., Zhang, K., & Spirtes, P. (2019). Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10, 524. https://doi.org/10.3389/fgene.2019.00524CrossRef Google Scholar PubMed

Hamaker, E. L., Mulder, J. D., & Van IJzendoorn, M. H. (2020). Description, prediction and causation: Methodological challenges of studying child and adolescent development. Developmental Cognitive Neuroscience, 46, 100867. https://doi.org/10.1016/j.dcn.2020.100867CrossRef Google Scholar PubMed

Orben, A., & Lakens, D. (2020). Crud (re)defined. Advances in Methods and Practices in Psychological Science, 3(2), 238–247. https://doi.org/10.1177/2515245920917961CrossRef Google Scholar PubMed

Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4), 669–688. https://doi.org/10.2307/2337329CrossRef Google Scholar

Primbs, M. A., Pennington, C. R., Lakens, D., Silan, M. A. A., Lieck, D. S. N., & Forscher, P. S., … Westwood, S. J. (2023). Are small effects the indispensable foundation for a cumulative psychological science? A reply to Götz et al. (2022). Perspectives on Psychological Science, 18(2), 508–512. https://doi.org/10.1177/17456916221100420CrossRef Google Scholar PubMed

Primbs, M. A., Rinck, M., Holland, R., Knol, W., Nies, A., & Bijlstra, G. (2022). The effect of face masks on the stereotype effect in emotion perception. Journal of Experimental Social Psychology, 103, Article 104394. https://doi.org/10.1016/j.jesp.2022.104394CrossRef Google Scholar

Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310. https://doi.org/10.1214/10-sts330CrossRef Google Scholar

Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702–712. https://doi.org/10.1177/1745691616658637CrossRef Google Scholar PubMed

Vohs, K. D., Schmeichel, B. J., Lohmann, S., Gronau, Q. F., Finley, A. J., Ainsworth, S. E., … Albarracín, D. (2021). A multisite preregistered paradigmatic test of the ego-depletion effect. Psychological Science, 32(10), 1566–1581. https://doi.org/10.1177/0956797621989733CrossRef Google Scholar PubMed