When scientists lack validity evidence for measures, they lack the necessary information to evaluate the overall validity of a study's conclusions.
Flake and Fried (Reference Flake and Fried2020, p. 457)
Questionable measurement practices are widespread in the social and behavioural sciences and raise serious questions about the interpretability of numerous studies (Flake & Fried, Reference Flake and Fried2020; Lilienfeld & Strother, Reference Lilienfeld and Strother2020; Vazire, Schiavone, & Bottesini, Reference Vazire, Schiavone and Bottesini2022). Because Almaatouq et al. do not explicitly address measurement, we argue that unresolved measurement issues may threaten the feasibility and utility of their integrative approach. Below, we present three measurement concerns.
First, the interpretability of findings from experiments designed using the integrative approach will rely on the use of valid measurements. Consider the “Moral Machine” experiment (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018, Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2020), which Almaatouq et al. describe as “seminal.” Utilising a modified version of the trolley problem, this experiment evaluated participant's preferences for how autonomous vehicles should weight lives in life-or-death situations based on nine different dimensions. By assessing these dimensions simultaneously and collecting responses from millions of participants, Almaatouq et al. claim that this experiment “offers numerous findings that were neither obvious nor deducible from prior research or traditional experimental designs” (target article, sect. 4.1, para. 2). One of these key findings is that participants are willing to treat people differently based on demographic characteristics when the complexity of a moral decision is increased. However, the validity of this finding has been questioned because it may be an artefact of the forced choice methodology that was used (Bigman & Gray, Reference Bigman and Gray2020). In addition, there is considerable debate in moral psychology about the external validity of the trolley problem and other sacrificial dilemmas (i.e., it is unclear that responses in these tasks predict real-world decisions or ethical judgements; Bauman, McGraw, Bartels, & Warren, Reference Bauman, McGraw, Bartels and Warren2014; Bostyn, Sevenhant, & Roets, Reference Bostyn, Sevenhant and Roets2018). Thus, to our minds, this example demonstrates that no matter how large and integrative an experiment might be, evaluating the validity of the measurements is essential.
Second, the construction of design spaces and the mapping of experiments onto them relies on valid measurement of design space dimensions. However, the validity of measurements, including those obtained from widely used measures, cannot be assumed. Consider Almaatouq et al.'s identification of social perceptiveness as a relevant dimension of group synergy research. They cite four studies that measured social perceptiveness using the Reading the Mind in the Eyes Test (RMET; Almaatouq, Alsobay, Yin, & Watts, Reference Almaatouq, Alsobay, Yin and Watts2021; Engel, Woolley, Jing, Chabris, & Malone, Reference Engel, Woolley, Jing, Chabris and Malone2014; Kim et al., Reference Kim, Engel, Woolley, Lin, McArthur and Malone2017; Woolley, Chabris, Pentland, Hashmi, & Malone, Reference Woolley, Chabris, Pentland, Hashmi and Malone2010). However, it is unclear what psychological constructs the RMET measures. While the RMET has been used to measure multiple dimensions of social cognition, including “theory of mind,” “emotion recognition,” “empathy,” “emotional intelligence,” “mindreading,” “mentalising,” and “social perceptiveness,” there is ongoing debate about the relationship between these constructs and which, if any, of them the RMET actually measures (Kittel, Olderbak, & Wilhelm, Reference Kittel, Olderbak and Wilhelm2022; Oakley, Brewer, Bird, & Catmur, Reference Oakley, Brewer, Bird and Catmur2016; Silverman, Reference Silverman2022). Moreover, despite the extensive use of the RMET (cited over 7,000 times according to Google Scholar), serious questions have been raised about the reliability and validity of RMET scores (Higgins, Ross, Langdon, & Polito, Reference Higgins, Ross, Langdon and Polito2023; Higgins, Ross, Polito, & Kaplan, Reference Higgins, Ross, Polito and Kaplan2023; Kittel et al., Reference Kittel, Olderbak and Wilhelm2022; Olderbak et al., Reference Olderbak, Wilhelm, Olaru, Geiger, Brenneman and Roberts2015). This means that any integrative experiment that uses the RMET to measure social perceptiveness as a dimension of group synergy research will be very difficult to interpret. Given that vast swathes of measures used in psychological and social science research lack good validity evidence (Flake & Fried, Reference Flake and Fried2020), analogous validity concerns are likely to exist for measures of many dimensions of a given design space. Thus, measurement validation is a critical and nontrivial consideration for the construction and implementation of the design spaces at the heart of the integrative approach. Moreover, given that design spaces are likely to include large numbers of dimensions, a coherent strategy to handle these issues must be developed otherwise the integrative approach risks becoming unmanageable in terms of magnitude and complexity.
Third, measurement incommensurability poses a substantial challenge to the feasibility and utility of the integrative approach because knowledge integration relies on valid and commensurable measurements. Consider depression, one of the most prevalent mental health conditions worldwide (Herrman et al., Reference Herrman, Kieling, McGorry, Horton, Sargent and Patel2019). Fried, Flake, and Robinaugh (Reference Fried, Flake and Robinaugh2022) recently identified over 280 different depression measures. Extensive variability in the symptoms assessed by these measures forced them to conclude that different depression measures “seem to measure different ‘depressions’” (p. 360). Moreover, they found that depression measures frequently fail to show measurement invariance, meaning that they might measure different things when used in different groups or contexts. Fried and colleagues’ examination of depression measures is an unusually thorough demonstration of just how serious measurement incommensurability problems can be. Nonetheless, there are indications that validity and commensurability problems extend to a diverse range of research areas which, troublingly, are also pertinent to human welfare, including child and adolescent psychopathology (Stevanovic et al., Reference Stevanovic, Jafari, Knez, Franic, Atilola, Davidovic and Lakic2017); race-related attitudes, beliefs, and motivations (Hester, Axt, Siemers, & Hehman, Reference Hester, Axt, Siemers and Hehman2023); and well-being (Alexandrova & Haybron, Reference Alexandrova and Haybron2016). While Almaatouq et al. claim that their integrative approach “intrinsically promotes commensurability and continuous integration of knowledge” (target article, abstract), it is unclear how the approach can feasibly address incommensurability arising from the use of disparate measures and violations of measurement invariance. Left unaddressed, measurement incommensurability might substantially curtail the knowledge integration potential of the proposed approach.
To summarise, although we are sympathetic to Almaatouq et al.'s ambitious attempt to tackle the substantial challenges in the psychological and behavioural sciences, their lack of engagement with the measurement literature raises serious questions about their approach. If it is to deliver its intended benefits of increased commensurability and knowledge integration, then measurement must be addressed explicitly. It is unclear to us whether this can be achieved while maintaining the feasibility of the proposed integrative approach.
When scientists lack validity evidence for measures, they lack the necessary information to evaluate the overall validity of a study's conclusions.
Flake and Fried (Reference Flake and Fried2020, p. 457)Questionable measurement practices are widespread in the social and behavioural sciences and raise serious questions about the interpretability of numerous studies (Flake & Fried, Reference Flake and Fried2020; Lilienfeld & Strother, Reference Lilienfeld and Strother2020; Vazire, Schiavone, & Bottesini, Reference Vazire, Schiavone and Bottesini2022). Because Almaatouq et al. do not explicitly address measurement, we argue that unresolved measurement issues may threaten the feasibility and utility of their integrative approach. Below, we present three measurement concerns.
First, the interpretability of findings from experiments designed using the integrative approach will rely on the use of valid measurements. Consider the “Moral Machine” experiment (Awad et al., Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2018, Reference Awad, Dsouza, Kim, Schulz, Henrich, Shariff and Rahwan2020), which Almaatouq et al. describe as “seminal.” Utilising a modified version of the trolley problem, this experiment evaluated participant's preferences for how autonomous vehicles should weight lives in life-or-death situations based on nine different dimensions. By assessing these dimensions simultaneously and collecting responses from millions of participants, Almaatouq et al. claim that this experiment “offers numerous findings that were neither obvious nor deducible from prior research or traditional experimental designs” (target article, sect. 4.1, para. 2). One of these key findings is that participants are willing to treat people differently based on demographic characteristics when the complexity of a moral decision is increased. However, the validity of this finding has been questioned because it may be an artefact of the forced choice methodology that was used (Bigman & Gray, Reference Bigman and Gray2020). In addition, there is considerable debate in moral psychology about the external validity of the trolley problem and other sacrificial dilemmas (i.e., it is unclear that responses in these tasks predict real-world decisions or ethical judgements; Bauman, McGraw, Bartels, & Warren, Reference Bauman, McGraw, Bartels and Warren2014; Bostyn, Sevenhant, & Roets, Reference Bostyn, Sevenhant and Roets2018). Thus, to our minds, this example demonstrates that no matter how large and integrative an experiment might be, evaluating the validity of the measurements is essential.
Second, the construction of design spaces and the mapping of experiments onto them relies on valid measurement of design space dimensions. However, the validity of measurements, including those obtained from widely used measures, cannot be assumed. Consider Almaatouq et al.'s identification of social perceptiveness as a relevant dimension of group synergy research. They cite four studies that measured social perceptiveness using the Reading the Mind in the Eyes Test (RMET; Almaatouq, Alsobay, Yin, & Watts, Reference Almaatouq, Alsobay, Yin and Watts2021; Engel, Woolley, Jing, Chabris, & Malone, Reference Engel, Woolley, Jing, Chabris and Malone2014; Kim et al., Reference Kim, Engel, Woolley, Lin, McArthur and Malone2017; Woolley, Chabris, Pentland, Hashmi, & Malone, Reference Woolley, Chabris, Pentland, Hashmi and Malone2010). However, it is unclear what psychological constructs the RMET measures. While the RMET has been used to measure multiple dimensions of social cognition, including “theory of mind,” “emotion recognition,” “empathy,” “emotional intelligence,” “mindreading,” “mentalising,” and “social perceptiveness,” there is ongoing debate about the relationship between these constructs and which, if any, of them the RMET actually measures (Kittel, Olderbak, & Wilhelm, Reference Kittel, Olderbak and Wilhelm2022; Oakley, Brewer, Bird, & Catmur, Reference Oakley, Brewer, Bird and Catmur2016; Silverman, Reference Silverman2022). Moreover, despite the extensive use of the RMET (cited over 7,000 times according to Google Scholar), serious questions have been raised about the reliability and validity of RMET scores (Higgins, Ross, Langdon, & Polito, Reference Higgins, Ross, Langdon and Polito2023; Higgins, Ross, Polito, & Kaplan, Reference Higgins, Ross, Polito and Kaplan2023; Kittel et al., Reference Kittel, Olderbak and Wilhelm2022; Olderbak et al., Reference Olderbak, Wilhelm, Olaru, Geiger, Brenneman and Roberts2015). This means that any integrative experiment that uses the RMET to measure social perceptiveness as a dimension of group synergy research will be very difficult to interpret. Given that vast swathes of measures used in psychological and social science research lack good validity evidence (Flake & Fried, Reference Flake and Fried2020), analogous validity concerns are likely to exist for measures of many dimensions of a given design space. Thus, measurement validation is a critical and nontrivial consideration for the construction and implementation of the design spaces at the heart of the integrative approach. Moreover, given that design spaces are likely to include large numbers of dimensions, a coherent strategy to handle these issues must be developed otherwise the integrative approach risks becoming unmanageable in terms of magnitude and complexity.
Third, measurement incommensurability poses a substantial challenge to the feasibility and utility of the integrative approach because knowledge integration relies on valid and commensurable measurements. Consider depression, one of the most prevalent mental health conditions worldwide (Herrman et al., Reference Herrman, Kieling, McGorry, Horton, Sargent and Patel2019). Fried, Flake, and Robinaugh (Reference Fried, Flake and Robinaugh2022) recently identified over 280 different depression measures. Extensive variability in the symptoms assessed by these measures forced them to conclude that different depression measures “seem to measure different ‘depressions’” (p. 360). Moreover, they found that depression measures frequently fail to show measurement invariance, meaning that they might measure different things when used in different groups or contexts. Fried and colleagues’ examination of depression measures is an unusually thorough demonstration of just how serious measurement incommensurability problems can be. Nonetheless, there are indications that validity and commensurability problems extend to a diverse range of research areas which, troublingly, are also pertinent to human welfare, including child and adolescent psychopathology (Stevanovic et al., Reference Stevanovic, Jafari, Knez, Franic, Atilola, Davidovic and Lakic2017); race-related attitudes, beliefs, and motivations (Hester, Axt, Siemers, & Hehman, Reference Hester, Axt, Siemers and Hehman2023); and well-being (Alexandrova & Haybron, Reference Alexandrova and Haybron2016). While Almaatouq et al. claim that their integrative approach “intrinsically promotes commensurability and continuous integration of knowledge” (target article, abstract), it is unclear how the approach can feasibly address incommensurability arising from the use of disparate measures and violations of measurement invariance. Left unaddressed, measurement incommensurability might substantially curtail the knowledge integration potential of the proposed approach.
To summarise, although we are sympathetic to Almaatouq et al.'s ambitious attempt to tackle the substantial challenges in the psychological and behavioural sciences, their lack of engagement with the measurement literature raises serious questions about their approach. If it is to deliver its intended benefits of increased commensurability and knowledge integration, then measurement must be addressed explicitly. It is unclear to us whether this can be achieved while maintaining the feasibility of the proposed integrative approach.
Financial support
This work was supported by an Australian Government Research Training Program (RTP) Scholarship (W. C. H.), a Macquarie University Research Excellence Scholarship (W. C. H.), a Discovery Early Researcher Award (DECRA) by The Australian Research Council (ARC) (E. D., grant number DE220100087), and the John Templeton Foundation (R. M. R., grant number 62631; A.G., grant number 61924).
Competing interest
None.