When we first executed a metastudy in 2015 (Baribault, Reference Baribault2019; Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij and Vandekerckhove2018), the concept of sampling from a method space (what the target article calls a “design space”) was central to its implementation. We had set out to replicate an interesting effect we had found in a published paper. However, we soon realized that we would need to specify so many details of implementation – the kinds of things researchers rarely make explicit in their methods sections – that we felt we could not perform a faithful replication. Of course, we could have reached out to the original authors, but we also felt that the literature should to some extent be able to stand on its own. Eventually, we decided to be good Bayesians and allow for uncertainty in our experimental design. In contrast to a “point experiment,” a metastudy defines a distribution over the method space, from which we can draw samples in a kind of Monte Carlo integration over our uncertainty as to which point experiment best captures the effect of interest.
Our intent was to test a particular type of theory: A statement that is broader than a single contrast or effect, but is about regions in the method space where an effect holds. Others have referred to such regions as the universe of generalization (Cronbach, Rajaratnam, & Gleser, Reference Cronbach, Rajaratnam and Gleser1963), constraints on generality (Simons, Shoda, & Lindsay, Reference Simons, Shoda and Lindsay2017), or the boundary of meaning (Kenett & Rubinstein, Reference Kenett and Rubinstein2021) – all invoking metaphors that imply the existence of some spatially arranged population of possible experiments.
We were interested in exploring this method space in part to identify moderators of effects but also to establish invariances. Invariances were perhaps of greater interest because they speak to the robustness of effects across sets of exchangeable experiments – experiments that are not identical, but that are minor variations on each other such that a reasonable experimenter could have chosen any one of them to test the theory at hand. In other words, many randomly sampled experiments are identical in theory, if not necessarily so in practice. We focused on randomization specifically because we wanted to determine whether an effect was robust – that is, whether it was sensitive to irrelevant perturbations of the study, such as who the participants were, where the study was conducted, or which #@$%&? masking symbol we chose.
This notion of identity in theory is important, I think. Whether two experiments can be reasonably compared or jointly analyzed (i.e., whether they are commensurate) depends not only on how they relate to one another but also on the theoretical weight given to that relationship. Without the context of germ theory, washing hands between patients may seem like a silly exercise, but in reality handwashing can act as an accidental confounder if it is not properly controlled. Accordingly, there must be a role for the formation of theories prior even to the construction of the method space.
The target article understates the importance of the development of integrative theory relative to the experimentation framework. Without a connecting theory, no two experiments (or, for that matter, observations) are commensurate. With a connecting theory, it does not seem to matter greatly if the method space was conceived ahead of time or even at all. Commensurability engineering – the activity of building experiments such that they are commensurate – is first and foremost a theoretical exercise. But this invites a new question: If indeed disparate experiments can be made commensurate with a properly integrative theory, and method spaces only provide commensurability if there is such a theory, then what justifies the added effort of designing a metastudy? After all, a space of experiments exists whether we define one or not and a research program of consecutive point experiments constitutes a guided walk in some space, so is not any collection of point experiments a metastudy?
An underappreciated strength of metastudies is their statistical efficiency (DeKay, Rubinchik, Li, & De Boeck, Reference DeKay, Rubinchik, Li and De Boeck2022; Rubinchik, Reference Rubinchik2019). In a metastudy, increasing the number of point experiments k reduces the standard error of the mean effect size above and beyond the total number of participants P. To see this, consider the equation for the error variance in a random-effects meta-analysis as a function of the variance in effect sizes across subjects $( {\varsigma^2} )$ and the variance in effect sizes across studies (τ 2): $\sigma _\delta ^2 = \varsigma ^2/P + \tau ^2/k$. For a fixed number of participants, increasing the number of point experiments (and reducing the number of participants per study) maximizes estimation accuracy.
Looking ahead, I believe there is much relevant work being done in the field of mathematical behavioral science. In order to engineer commensurability at scale, it is critical to develop quantitative integrated theories. Ideally these would take the form of likelihood functions – functions that describe the probability of data patterns under a theory – over the method space. A likelihood framework for theoretical integration has a number of advantages. For example, such a framework would be applicable even with complex theories for complex data. The focus of the target article seems mostly on linear theories – models that are composed mostly of effects (or “dependencies”) that change the mean of some variate in an additive or at most interactive way – but a well-constructed mathematical likelihood can account for patterns of any kind and data of any shape.
Even more importantly, likelihoods are inherently commensurate and can act as a universal language in which theories can be cast for comparison between areas of a method space (whether intentionally designed or not). Regions A and B of the method space are identical in theory T if they come with the same likelihood, p(data | A, T) = p(data | B, T), and not otherwise. The development of an integrative theory then boils down to defining this likelihood for all applicable regions, making all points in the method space commensurate while at the same time avoiding the incoherency problem discussed by Watts (Reference Watts2017). Theories of such scope are currently rare in social science, but we stand to gain much from their development.
When we first executed a metastudy in 2015 (Baribault, Reference Baribault2019; Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, van Ravenzwaaij and Vandekerckhove2018), the concept of sampling from a method space (what the target article calls a “design space”) was central to its implementation. We had set out to replicate an interesting effect we had found in a published paper. However, we soon realized that we would need to specify so many details of implementation – the kinds of things researchers rarely make explicit in their methods sections – that we felt we could not perform a faithful replication. Of course, we could have reached out to the original authors, but we also felt that the literature should to some extent be able to stand on its own. Eventually, we decided to be good Bayesians and allow for uncertainty in our experimental design. In contrast to a “point experiment,” a metastudy defines a distribution over the method space, from which we can draw samples in a kind of Monte Carlo integration over our uncertainty as to which point experiment best captures the effect of interest.
Our intent was to test a particular type of theory: A statement that is broader than a single contrast or effect, but is about regions in the method space where an effect holds. Others have referred to such regions as the universe of generalization (Cronbach, Rajaratnam, & Gleser, Reference Cronbach, Rajaratnam and Gleser1963), constraints on generality (Simons, Shoda, & Lindsay, Reference Simons, Shoda and Lindsay2017), or the boundary of meaning (Kenett & Rubinstein, Reference Kenett and Rubinstein2021) – all invoking metaphors that imply the existence of some spatially arranged population of possible experiments.
We were interested in exploring this method space in part to identify moderators of effects but also to establish invariances. Invariances were perhaps of greater interest because they speak to the robustness of effects across sets of exchangeable experiments – experiments that are not identical, but that are minor variations on each other such that a reasonable experimenter could have chosen any one of them to test the theory at hand. In other words, many randomly sampled experiments are identical in theory, if not necessarily so in practice. We focused on randomization specifically because we wanted to determine whether an effect was robust – that is, whether it was sensitive to irrelevant perturbations of the study, such as who the participants were, where the study was conducted, or which #@$%&? masking symbol we chose.
This notion of identity in theory is important, I think. Whether two experiments can be reasonably compared or jointly analyzed (i.e., whether they are commensurate) depends not only on how they relate to one another but also on the theoretical weight given to that relationship. Without the context of germ theory, washing hands between patients may seem like a silly exercise, but in reality handwashing can act as an accidental confounder if it is not properly controlled. Accordingly, there must be a role for the formation of theories prior even to the construction of the method space.
The target article understates the importance of the development of integrative theory relative to the experimentation framework. Without a connecting theory, no two experiments (or, for that matter, observations) are commensurate. With a connecting theory, it does not seem to matter greatly if the method space was conceived ahead of time or even at all. Commensurability engineering – the activity of building experiments such that they are commensurate – is first and foremost a theoretical exercise. But this invites a new question: If indeed disparate experiments can be made commensurate with a properly integrative theory, and method spaces only provide commensurability if there is such a theory, then what justifies the added effort of designing a metastudy? After all, a space of experiments exists whether we define one or not and a research program of consecutive point experiments constitutes a guided walk in some space, so is not any collection of point experiments a metastudy?
An underappreciated strength of metastudies is their statistical efficiency (DeKay, Rubinchik, Li, & De Boeck, Reference DeKay, Rubinchik, Li and De Boeck2022; Rubinchik, Reference Rubinchik2019). In a metastudy, increasing the number of point experiments k reduces the standard error of the mean effect size above and beyond the total number of participants P. To see this, consider the equation for the error variance in a random-effects meta-analysis as a function of the variance in effect sizes across subjects $( {\varsigma^2} )$ and the variance in effect sizes across studies (τ 2): $\sigma _\delta ^2 = \varsigma ^2/P + \tau ^2/k$. For a fixed number of participants, increasing the number of point experiments (and reducing the number of participants per study) maximizes estimation accuracy.
Looking ahead, I believe there is much relevant work being done in the field of mathematical behavioral science. In order to engineer commensurability at scale, it is critical to develop quantitative integrated theories. Ideally these would take the form of likelihood functions – functions that describe the probability of data patterns under a theory – over the method space. A likelihood framework for theoretical integration has a number of advantages. For example, such a framework would be applicable even with complex theories for complex data. The focus of the target article seems mostly on linear theories – models that are composed mostly of effects (or “dependencies”) that change the mean of some variate in an additive or at most interactive way – but a well-constructed mathematical likelihood can account for patterns of any kind and data of any shape.
Even more importantly, likelihoods are inherently commensurate and can act as a universal language in which theories can be cast for comparison between areas of a method space (whether intentionally designed or not). Regions A and B of the method space are identical in theory T if they come with the same likelihood, p(data | A, T) = p(data | B, T), and not otherwise. The development of an integrative theory then boils down to defining this likelihood for all applicable regions, making all points in the method space commensurate while at the same time avoiding the incoherency problem discussed by Watts (Reference Watts2017). Theories of such scope are currently rare in social science, but we stand to gain much from their development.
Acknowledgment
Unable to find a native English speaker for proofreading on short notice, I asked ChatGPT to evaluate my writing. It found my grammar and spelling to be “mostly on par” with a native English speaker, which I found comforting.
Financial support
J. V. was supported by NSF grant Nos. 1754205, 1850849, and 2051186.
Competing interest
None.