Almaatouq et al. argue that the “one-at-a-time” approach to scientific research has led to collections of atomized findings of unclear relevance to each other. They advocate for an integrative approach in which stimuli are varied systematically across theoretically important dimensions. This allows for strong inferences (Platt, Reference Platt1964) regarding which theory holds the most explanatory power across diverse contexts, as well as the identification of meaningful moderators.
Our research group has addressed this challenge by examining the analytic and design choices that naturalistically emerge across independent investigators as well as the implications for the empirical results (Landy et al., Reference Landy, Jia, Ding, Viganola, Tierney, Dreber and Uhlmann2020; Schweinsberg et al., Reference Schweinsberg, Feldman, Staub, van den Akker, van Aert, van Assen and Uhlmann2021; Silberzahn et al., Reference Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey and Nosek2018). These crowdsourced many analysts and many design initiatives reveal dramatic dispersion in estimates due to researcher choices, empirically demonstrating the limitations of the one-at-a-time approach (see also Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, Van Ravenzwaaij and Vandekerckhove2018; Botvinik-Nezer et al., Reference Botvinik-Nezer, Holzmeister, Camerer, Dreber, Huber, Johannesson and Schonberg2020; Breznau et al., Reference Breznau, Rinke, Wuttke, Nguyen, Adem, Adriaans and Van Assche2022; Menkveld et al., Reference Menkveld, Dreber, Holzmeister, Huber, Johannesson, Kirchler and Wu2023). At the same time, we have sought to further increase the already high-theoretical value of replications by leveraging them for competitive theory testing. Rather than test the original theory against the null hypothesis, we include new conditions and measures allowing us to simultaneously examine the preregistered predictions of different theoretical accounts (Tierney et al., Reference Tierney, Hardy, Ebersole, Leavitt, Viganola, Clemente and Uhlmann2020, Reference Tierney, Hardy, Ebersole, Viganola, Clemente, Gordon and Uhlmann2021). In this manner, we can start to prune the dense theoretical landscape (Leavitt, Mitchell, & Peterson, Reference Leavitt, Mitchell and Peterson2010) found in areas of inquiry characterized by many atomized findings and narrow theories.
In contrast, a striking and unexpected lack of variability has emerged in the results when many laboratories collect data using the same methods. In such crowd replication initiatives, cross-site heterogeneity in estimates is far below what one would expect based on intuition and theory (Olsson-Collentine, Wicherts, & van Assen, Reference Olsson-Collentine, Wicherts and van Assen2020). From a perspectivist (McGuire, Reference McGuire1973) standpoint, psychological phenomena should emerge in some contexts and be nonexistent or even reversed in others (see also Henrich, Heine, & Norenzayan, Reference Henrich, Heine and Norenzayan2010). And yet, effects seem to either fail to replicate across all populations sampled or emerge again and again (see also Delios et al., Reference Delios, Clemente, Wu, Tan, Wang, Gordon and Uhlmann2022).
Bringing many designs, analyses, theories, and data collection teams together, we recently completed a crowdsourced initiative that qualifies as the type of comprehensive integrative test that Almaatouq et al. envision. Tierney et al. (Reference Tierney and Cyrus-Lai2023) systematically re-examined the relationships between anger expression, target gender, and status conferral. In the original research, women who displayed anger in professional settings suffered steep drops in the status and respect they were accorded by social perceivers (Brescoll & Uhlmann, Reference Brescoll and Uhlmann2008). In the original investigations, only a single set of videos featuring one female and one male target were employed as stimuli, and all participants were from Connecticut. In contrast, the crowdsourced replication project featured 27 experimental designs, a multiverse capturing many defensible analytic approaches, and 68 data collection sites in 23 countries. We further tested the original prescriptive stereotype account against competing theories predicting that anger signals status similarly for women and men, that anger has vastly different status implications in Eastern and Western cultures, and that feminist messaging has successfully reduced or even reversed gender biases. As Almaatouq et al. recommend, we probed the dose–response relationship between anger and status conferral by both experimentally manipulating and measuring the extremity of emotion expressions across different designs.
The crowd initiative finds that anger increases status by signaling dominance and assertiveness, while also diminishing it by projecting incompetence and unlikability, aggregating across a wide range of research approaches and populations. Critically, this same pattern emerged for both female and male targets, social perceivers of different genders, and in both Eastern and harmony-oriented cultures and Western and more conflict-oriented ones. Highlighting the value of deploying diverse research approaches, six of the 27 designs found favoritism toward men in status conferral, but one design pointed to the opposite conclusion. Similarly, in a multiverse with 32 branches, there existed just two specifications that supported the original gender-and-anger backlash effect. Had we employed a one-at-a-time approach, we could have accidentally hit upon or strategically chosen narrow methods yielding nonrepresentative conclusions (e.g., of pro-female status bias or gender backlash). Overall, the intellectual returns on including many designs, many analyses, and many theories were high. In contrast, and consistent with past crowd initiatives, collecting data across many places revealed minimal cross-site heterogeneity and no interesting cultural differences.
Thus, we envision a diverse scientific ecology consisting of many “small” and “medium” projects and just a few huge international efforts. The one-at-a-time approach is an efficient means to introduce initial evidence for promising new hypotheses. However, as a theoretical space becomes increasingly cluttered, intellectual returns are maximized by sampling stimuli widely and employing many analyses to provide severe tests of competing theories (Mayo, Reference Mayo2018). Although this could involve a crowd of laboratories, a single team could carry out a multiverse (Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016) and operationalize key variables in a variety of ways. A small team might sample just one or two participant populations that are easily accessible to them. Finally, a subset of findings of particularly high-theoretical and practical importance should be selected for crowdsourced data collections across many nations as a systematic test of cross-cultural generalizability. When numerous sites are not available, the researchers might carry out the first generalizability test in the most culturally distant population available (Muthukrishna et al., Reference Muthukrishna, Bell, Henrich, Curtin, Gedranovich, McInerney and Thue2020). If the effect is still observed, this represents initial evidence of universality (Norenzayan & Heine, Reference Norenzayan and Heine2005).
In sum, an ironic legacy of the movement to crowdsource behavioral research may be showing that scaling science to such a massive level might be neither efficient nor strictly necessary for most research findings. The sorts of integrative tests Almaatouq et al. envision can also be accomplished by a small team that actively ensures a diversity of analyses and stimuli, and yet collects data locally or across a few carefully selected cultures rather than globally. In the future, our greatest intellectual returns on investment may come from “medium” science that prioritizes testing many theories in many ways.
Almaatouq et al. argue that the “one-at-a-time” approach to scientific research has led to collections of atomized findings of unclear relevance to each other. They advocate for an integrative approach in which stimuli are varied systematically across theoretically important dimensions. This allows for strong inferences (Platt, Reference Platt1964) regarding which theory holds the most explanatory power across diverse contexts, as well as the identification of meaningful moderators.
Our research group has addressed this challenge by examining the analytic and design choices that naturalistically emerge across independent investigators as well as the implications for the empirical results (Landy et al., Reference Landy, Jia, Ding, Viganola, Tierney, Dreber and Uhlmann2020; Schweinsberg et al., Reference Schweinsberg, Feldman, Staub, van den Akker, van Aert, van Assen and Uhlmann2021; Silberzahn et al., Reference Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey and Nosek2018). These crowdsourced many analysts and many design initiatives reveal dramatic dispersion in estimates due to researcher choices, empirically demonstrating the limitations of the one-at-a-time approach (see also Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, Van Ravenzwaaij and Vandekerckhove2018; Botvinik-Nezer et al., Reference Botvinik-Nezer, Holzmeister, Camerer, Dreber, Huber, Johannesson and Schonberg2020; Breznau et al., Reference Breznau, Rinke, Wuttke, Nguyen, Adem, Adriaans and Van Assche2022; Menkveld et al., Reference Menkveld, Dreber, Holzmeister, Huber, Johannesson, Kirchler and Wu2023). At the same time, we have sought to further increase the already high-theoretical value of replications by leveraging them for competitive theory testing. Rather than test the original theory against the null hypothesis, we include new conditions and measures allowing us to simultaneously examine the preregistered predictions of different theoretical accounts (Tierney et al., Reference Tierney, Hardy, Ebersole, Leavitt, Viganola, Clemente and Uhlmann2020, Reference Tierney, Hardy, Ebersole, Viganola, Clemente, Gordon and Uhlmann2021). In this manner, we can start to prune the dense theoretical landscape (Leavitt, Mitchell, & Peterson, Reference Leavitt, Mitchell and Peterson2010) found in areas of inquiry characterized by many atomized findings and narrow theories.
In contrast, a striking and unexpected lack of variability has emerged in the results when many laboratories collect data using the same methods. In such crowd replication initiatives, cross-site heterogeneity in estimates is far below what one would expect based on intuition and theory (Olsson-Collentine, Wicherts, & van Assen, Reference Olsson-Collentine, Wicherts and van Assen2020). From a perspectivist (McGuire, Reference McGuire1973) standpoint, psychological phenomena should emerge in some contexts and be nonexistent or even reversed in others (see also Henrich, Heine, & Norenzayan, Reference Henrich, Heine and Norenzayan2010). And yet, effects seem to either fail to replicate across all populations sampled or emerge again and again (see also Delios et al., Reference Delios, Clemente, Wu, Tan, Wang, Gordon and Uhlmann2022).
Bringing many designs, analyses, theories, and data collection teams together, we recently completed a crowdsourced initiative that qualifies as the type of comprehensive integrative test that Almaatouq et al. envision. Tierney et al. (Reference Tierney and Cyrus-Lai2023) systematically re-examined the relationships between anger expression, target gender, and status conferral. In the original research, women who displayed anger in professional settings suffered steep drops in the status and respect they were accorded by social perceivers (Brescoll & Uhlmann, Reference Brescoll and Uhlmann2008). In the original investigations, only a single set of videos featuring one female and one male target were employed as stimuli, and all participants were from Connecticut. In contrast, the crowdsourced replication project featured 27 experimental designs, a multiverse capturing many defensible analytic approaches, and 68 data collection sites in 23 countries. We further tested the original prescriptive stereotype account against competing theories predicting that anger signals status similarly for women and men, that anger has vastly different status implications in Eastern and Western cultures, and that feminist messaging has successfully reduced or even reversed gender biases. As Almaatouq et al. recommend, we probed the dose–response relationship between anger and status conferral by both experimentally manipulating and measuring the extremity of emotion expressions across different designs.
The crowd initiative finds that anger increases status by signaling dominance and assertiveness, while also diminishing it by projecting incompetence and unlikability, aggregating across a wide range of research approaches and populations. Critically, this same pattern emerged for both female and male targets, social perceivers of different genders, and in both Eastern and harmony-oriented cultures and Western and more conflict-oriented ones. Highlighting the value of deploying diverse research approaches, six of the 27 designs found favoritism toward men in status conferral, but one design pointed to the opposite conclusion. Similarly, in a multiverse with 32 branches, there existed just two specifications that supported the original gender-and-anger backlash effect. Had we employed a one-at-a-time approach, we could have accidentally hit upon or strategically chosen narrow methods yielding nonrepresentative conclusions (e.g., of pro-female status bias or gender backlash). Overall, the intellectual returns on including many designs, many analyses, and many theories were high. In contrast, and consistent with past crowd initiatives, collecting data across many places revealed minimal cross-site heterogeneity and no interesting cultural differences.
Thus, we envision a diverse scientific ecology consisting of many “small” and “medium” projects and just a few huge international efforts. The one-at-a-time approach is an efficient means to introduce initial evidence for promising new hypotheses. However, as a theoretical space becomes increasingly cluttered, intellectual returns are maximized by sampling stimuli widely and employing many analyses to provide severe tests of competing theories (Mayo, Reference Mayo2018). Although this could involve a crowd of laboratories, a single team could carry out a multiverse (Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016) and operationalize key variables in a variety of ways. A small team might sample just one or two participant populations that are easily accessible to them. Finally, a subset of findings of particularly high-theoretical and practical importance should be selected for crowdsourced data collections across many nations as a systematic test of cross-cultural generalizability. When numerous sites are not available, the researchers might carry out the first generalizability test in the most culturally distant population available (Muthukrishna et al., Reference Muthukrishna, Bell, Henrich, Curtin, Gedranovich, McInerney and Thue2020). If the effect is still observed, this represents initial evidence of universality (Norenzayan & Heine, Reference Norenzayan and Heine2005).
In sum, an ironic legacy of the movement to crowdsource behavioral research may be showing that scaling science to such a massive level might be neither efficient nor strictly necessary for most research findings. The sorts of integrative tests Almaatouq et al. envision can also be accomplished by a small team that actively ensures a diversity of analyses and stimuli, and yet collects data locally or across a few carefully selected cultures rather than globally. In the future, our greatest intellectual returns on investment may come from “medium” science that prioritizes testing many theories in many ways.
Financial support
This research was supported by an R&D grant from INSEAD to Eric Luis Uhlmann.
Competing interest
None.