1. Introduction
Violent crime is endemic to human society. The second-century poet Juvenal described Ancient Rome as having “no shortage of thieves” and “many opportunities to die” (Juvenal, 1769). Court records from thirteenth-century England show that “murderous brawls and violent death … were everyday occurrences” (Gurr, Reference Gurr1981, p. 305). Now, statistics suggest that 2020 was America's “most violent year in decades,” with more than 19,000 people killed in firearm-related incidents (Bates, Reference Bates2020). Understanding why people act in violent and criminal ways remains a societal imperative.
In a 2013 address, then-President Barack Obama offered one possible avenue for reducing crime: improving early childhood education. “Every dollar we invest in high-quality early childhood education,” said Obama, “can save more than $7 later on by … reducing crime” (Obama, Reference Obama2013). At its core, this statement communicates the causal hypothesis that high-quality childhood education will reduce crime. This hypothesis about a cause–effect relationship takes the form of a counterfactual statement about what could be. If the availability of high-quality childhood education were different, Obama predicts, then crime rates would also be different.
As social scientists, one of our primary aims is to produce research that verifies or challenges these sorts of causal claims. We examine evidence as to whether a causal relationship exists between two variables, offer theories for interpreting causal associations, and evaluate whether causal knowledge can be effectively applied to improve public health and well-being. As we will explain in this paper, this process of evaluating causes in social science relies heavily on counterfactual thinking, and it often begins by manipulating a variable in a randomly selected group of people. As we will also explain in this paper, this process of evaluating causes in social science is not limited to environmental exposures such as early childhood education. The same process of evaluating causes applies even when the causes in questions are variables less commonly considered by social scientists: genes.
1.1. Environmental causes in the social sciences: An empirical example
In the early 1960s, disadvantaged children living in Ypsilanti, Michigan were randomly assigned to an intensive two-year preschool education program (High/Scope Perry Preschool Program [HPPP]) that involved over two hours of daily active learning and weekly home visits from teachers (Schweinhart, Barnes, & Weikart, Reference Schweinhart, Barnes and Weikart1993). Children of the same age and socioeconomic background who were not assigned to this program received no preschool education. All participants were assessed throughout the first 40 years of life to determine the effects of the education program on outcomes such as educational attainment, economic earnings, and criminal behavior (Belfield, Nores, Barnett, & Schweinhart, Reference Belfield, Nores, Barnett and Schweinhart2006).
This methodological design, known as a randomized controlled trial (RCT), serves as the gold standard for validating the sorts of causal claims advanced by President Obama. By randomly assigning participants to different levels of a manipulated variable (in this case, preschool education), researchers were able to approximate the counterfactual scenario of what would have happened if conditions had been different. We can observe both the rate of criminal behavior in those who were given the treatment of better education and the rate of criminal behavior among children whose lives proceeded as usual (the control group). Relative to the control group, those children who participated in the high-quality preschool education program received over 50% fewer total arrests and over 80% fewer charges for violent crimes by age 40 (estimates based on data presented in Heckman, Moon, Pinto, Savelyev, & Yavitz, Reference Heckman, Moon, Pinto, Savelyev and Yavitz2010). Because of the experimental design of this study, we can conclude that President Obama was correct: improving early childhood education caused a reduction in adult criminal outcomes.
This conclusion is more meaningful than merely observing a correlation between attending preschool and (not) committing crimes. As social scientists, we privilege inferences about causal relationships over such correlational ones and believe that they reveal something unique about the world. As this paper will show, the conclusion that good preschool education causes an average reduction in adult criminal behavior is distinctive, but how we interpret and apply this knowledge depends on the type of causation implied by experimental designs.
To further illustrate both the power and the limitations of this causal knowledge, consider that of the six individuals from the HPPP study who went on to incur the greatest number of lifetime criminal charges, three of them had actually participated in the preschool education program (Heckman et al., Reference Heckman, Moon, Pinto, Savelyev and Yavitz2010). So, despite being exposed to a program that “causes” a reduction in criminal behavior, these individuals nevertheless received a total of 110 criminal charges between them. At the same time, the specific mechanisms underlying the effect of preschool education on crime are opaque. Indeed, researchers were surprised to observe the effects of preschool on adult outcomes, as the benefits of the intervention had appeared to fade out entirely in middle childhood (Heckman, Reference Heckman2006). Whatever intermediary process linked an educational experience at age four with a behavior committed (or not committed) by age 40 is not known (Schneider & Bradford, Reference Schneider and Bradford2020). Clearly then, the causation implied by the experimental paradigm does not suggest that preschool education is the sole determinant of a person's lifetime criminal behavior, nor that the criminal behavior of any single individual can be attributed to the preschool education they received, nor does it explain anything about the mechanisms generating individual differences in the relationship between preschool education and criminal behavior.
Nevertheless, knowing that preschool education makes an average difference in adult criminal behavior is useful. Most directly, this knowledge has led to calls for policy changes in the United States to develop and disseminate childhood education programs. That is, the most straightforward application of the observation that changing X produced an average difference in Y is to develop intervention and prevention programs that target X on a large scale. In this paper, we refer to this application as first-generation causal knowledge. Knowing that X caused Y in one group of people implies that one could change Y in future groups of people by changing X.
As many interventionists and policymakers can attest, however, first-generation causal knowledge can be quite limited (Bryan, Tipton, & Yeager, Reference Bryan, Tipton and Yeager2021). Treatment effects often fail to sustain over time, to generalize to other samples, or to behave in predicted ways (Bailey, Duncan, Cunha, Foorman, & Yeager, Reference Bailey, Duncan, Cunha, Foorman and Yeager2020). Further, even when these effects show maintenance and durability across time and place, they often operate through unobserved mechanisms, obfuscating deep understanding of the effect. We may know that improved preschool education caused decreases in crime, but we have limited understanding of for whom this effect will hold, why it holds, for how long it will last, or how portable this effect will be across contexts. Indeed, recent RCTs of preschool programs for children from low-income families have found surprising adverse effects of preschool on children's academic achievement, attendance, and disciplinary infractions (Durkin, Lipsey, Farran, & Wiesen, Reference Durkin, Lipsey, Farran and Wiesen2022). Clearly, first-generation causal knowledge is not sufficient to anticipate how an effect will play out in a different environmental and historical context.
Overcoming this challenge requires what we refer to in this paper as second-generation causal knowledge. By revealing sources of heterogeneity and mechanisms supporting the durability of causal effects, we can better understand when, where, why, for whom, and for how long X makes some difference in Y – and this knowledge gives us more avenues for effecting change. Knowing that preschool education made an average difference in adult criminal behavior is useful in the near term, because we identify preschool education as a potential intervention target. But we must go beyond that, examining the causal pathway from early education to adult crime to identify other intervention targets whose manipulation might yield larger, more enduring, or more generalizable changes in criminal behavior.
1.2. Evaluating genetic causes in the social sciences: An impossible or worthless task?
Let us consider another causal hypothesis: certain genetic variants cause violent and criminal behavior. As evidence for this claim, we might point to behavior genetics research on the heritability of antisocial behavior (see Lahey, Waldman, & McBurnett [Reference Lahey, Waldman and McBurnett1999] and Moffitt [Reference Moffitt, Cicchetti and Cohen2006] for review) and on specific measured DNA variants that can predict antisocial behavior and involvement with the criminal justice system (Karlsson Linnér et al., Reference Karlsson Linnér, Mallard, Barr, Sanchez-Roige, Madole, Driver and Dick2021; Tielbeek et al., Reference Tielbeek, Johansson, Polderman, Rautiainen, Jansen, Taylor and Posthuma2017). The suggestion that genetic variants cause criminal behavior likely triggers a stronger intuitive response from many of our readers than the suggestion that being deprived of an education in early childhood causes criminal behavior. Genetic effects tend to be viewed as more essential, more natural, and more immutable than other causes (Dar-Nimrod & Heine, Reference Dar-Nimrod and Heine2011; Lynch, Morandini, Dar-Nimrod, & Griffiths, Reference Lynch, Morandini, Dar-Nimrod and Griffiths2019). Accordingly, claims about genetic causes are more controversial, both to our fellow scientists and to the general public, than claims about environmental ones. Research linking genetics with human behavior (along with some neuroscience research, which we do not address here) has been characterized “subversive science” that has the “power to shake the public's faith” in “cherished ideologies” of responsibility and equality (Fox, Reference Fox2019, pp. 153, 156). As the biologist Richard Dawkins noted almost four decades ago, genes have acquired a “sinister, juggernaut-like reputation” (Dawkins, Reference Dawkins1982/2016, p. 12).
Genetics' sinister reputation has historical roots. In the twentieth century, results from the nascent field of behavioral genetics were used to justify state-sponsored violence against the socioeconomically disadvantaged and people of color, including forcible sterilizations. This history is – and will likely continue to be – a stumbling stone for those asked to consider the idea that genetic differences between people could cause the behavioral outcomes that are the province of social science.
Despite these fears that genetics will be misused to justify racist and classist oppression, the search for genetic correlates of human behavior is accelerating. The past decade has witnessed a rapid expansion in the collection and analysis of genomic data. As of June 2021, more than 38 million individuals had contributed DNA to ancestry-testing companies (Janzen, Reference Janzen2021) and over 5,000 genome-wide association studies (GWASs) had been published (Buniello et al., Reference Buniello, MacArthur, Cerezo, Harris, Hayhurst, Malangone and Parkinson2019). This includes GWASs of social and behavioral phenotypes, such as educational attainment (Lee et al., Reference Lee, Wedow, Okbay, Kong, Maghzian, Zacher and Cesarini2018), household income (Hill et al., Reference Hill, Davies, Ritchie, Skene, Bryois, Bell and Deary2019), and criminal activity (Tielbeek et al., Reference Tielbeek, Johansson, Polderman, Rautiainen, Jansen, Taylor and Posthuma2017). As ancestry-testing companies and national biobanks continue to accrue DNA samples from millions of individuals, and as genetic variants continue to demonstrate associations with more and more biologically distal life outcomes, scientists have an outstanding responsibility to address the implications of genomic research.
The obvious shadow cast by the history of eugenics can make it difficult to see another stumbling block to considering claims about genetic causation: a widespread confusion about the basics of causal inference, about how genetic research in humans could ever establish causation, and about what such causal knowledge would ever be good for, in the absence of the ability to tinker directly with people's genes. The goal of this paper is to resolve this stumbling block by describing how certain genetic research designs map onto what social scientists already know about establishing causal relationships and applying causal knowledge. By describing a clear perspective on what it does – and does not – mean for genes to be causes, and how that causal knowledge can be ethically applied, we also challenge the genetic determinism and essentialism that have historically characterized the pernicious misapplications of genetics by political extremists.
Let us consider the problem in more detail. Why might the prospect of establishing genetic causes of human behavior seem difficult, perhaps to the point of impossibility? Recall that the first step for testing a cause in an RCT is to manipulate the variable-of-interest in a randomly selected group of people. In many corners of biology, manipulating the genome is not only viable, but widely practiced. Researchers studying rodents, insects, and sea and plant life commonly use gene-modification strategies (e.g., knockout, selective breeding) as a means of gaining experimental control (Nagy, Perrimon, Sandmeyer, & Plasterk, Reference Nagy, Perrimon, Sandmeyer and Plasterk2003). These techniques allow for a direct assessment of the (counterfactual) causal hypothesis that if the organism's genome had been different, the outcome would have been different too. But when it comes to testing causal hypotheses about the human genome, the very idea of experimental manipulation is provocative at best and contemptible at worst. Gene-editing technologies such as CRISPR have demonstrated that direct alteration of the human genome is possible, but the use of these technologies on any meaningful scale is both scientifically nascent and ethically ambiguous (Gaskell et al., Reference Gaskell, Bard, Allansdottir, Da Cunha, Eduard, Hampel and Zwart2017). Regardless of one's moral appraisal of gene modification in humans, the fact remains that at present, manipulating the genomes of a randomly selected group of people is not a practicable option for testing hypotheses about the genetic causes of human behavior.
Moreover, even if we concede that, at a conceptual level, genes could cause average differences in human behavior, at a practical level, it is not readily apparent what we would do with this knowledge. As Evelyn Fox Keller wrote, “[t]he major practical interest driving the search for the relative importance of different causal factors in producing a given phenomenon is to be found in the wish to effect change in that phenomenon” (Keller, Reference Keller2010, p. 8). But, for the same reasons that we discussed above, we cannot (and should not) readily apply knowledge of genetic causes to change the genomes of large swathes of the population in the hopes of changing their outcomes. Indeed, many of us cannot even engage in that thought experiment without feeling anxiety or revulsion at the prospect. As a consequence, it might be easy to conclude that establishing genetic causes of human behavior, even if it could be accomplished, is not a worthwhile endeavor. The fruit of that causal knowledge, the idea that we could change behavior by changing people's genes, seems poisonous.
1.3. Goals of the current paper
This skepticism about the feasibility and value of establishing genetic causes, however intuitive and well-meaning it might be, is mistaken. As we discuss in this paper, genetic causes are like nearly all environmental causes investigated in social science: they are non-uniform, non-unitary, and non-explanatory. Indeed, most genetic causes, when appropriately identified, can be interpreted along the same lines as average treatment effects (ATEs) estimated from RCTs or other natural experiments. Genetic causes, such as environmental ones, are not deterministic, explanatory, or homogeneous across place and time, but they do make an average difference in social and behavioral outcomes.
We also consider not just the feasibility of causal inference about genes but also the utility of that endeavor. We propose that knowledge about genetic effects on important life outcomes can help us change people's lives for the better, and that these changes may be brought about via social science (i.e., environmental) interventions, not by manipulating genomes. Specifically, we call attention to second-generation causal knowledge. Examining the causal pathways from genes to life-course outcomes allows us to improve etiological understanding, uncover sources of heterogeneity in those outcomes, and identify novel targets for intervention.
2. What is a cause and how do we identify them? A brief review of causal inference in the social sciences
For decades, the scholarly community has been polarized by how to interpret findings from behavioral genetics (Fig. 1). For some, the complexity of processes that span genes and behavior and the dynamic interplay between genes and environment preclude researchers from gleaning any sort of meaningful causal knowledge from behavioral genetic research designs (Block, Reference Block1995; Lewontin, Reference Lewontin1974/2006; Turkheimer, Reference Turkheimer2011). For others, heritability estimates and correlations with measured genotypes are evidence that genes determine life outcomes (Herrnstein & Murray, Reference Herrnstein and Murray1996; Jensen, Reference Jensen1969; Murray, Reference Murray2020).Footnote 1 And for still others, genes are neither non-causal nor supra-causal, but are rather causes of human behavior in a more circumscribed, probabilistic sense (Bourrat, Reference Bourrat2020; Dawkins, Reference Dawkins1982/2016). How to decide among these competing interpretations?
We think that the seemingly intractable conversation about how to interpret the results of behavioral genetic research can be advanced by first considering a more general, and less controversial question: how do social scientists typically think about (non-genetic) causes and how do they go about finding them?
2.1. “No causes in, no causes out”
Determining that a relationship is causal requires more than plugging data into statistical models. It requires causal concepts (Pearl, Reference Pearl2009). Conceptual definitions of causation have historically been expressed in terms of active behavior – a cause “produces” (Locke, Reference Locke1690/1997), “forces” (Lakoff, Reference Lakoff and Ortony1993), and “changes” (Charlton, Reference Charlton1983). Empirical tests of causation, therefore, involve detecting such activity, and not all statistical associations are up to the task. The familiar adage “correlation does not equal causation” is founded on precisely this principle, that a statistical association between two variables does not inherently demonstrate that one of those variables produced or changed the other. Identifying statistical causes means grounding statistical models in causal concepts and assumptions. In other words, “no causes in, no causes out” (Cartwright, Reference Cartwright1995, p. 154).
The predominant causal concept in scientific thinking is the counterfactual (Pearl, Reference Pearl2018). Counterfactuals refer broadly to any hypothetical situation that describes what would have happened if conditions had been different. In 1973, David Lewis asserted that the counterfactual was the cornerstone of causal reasoning, arguing that X is a cause of Y if (a) when X occurs, Y occurs and (b) in the closest possible alternative world where X did not occur, Y also would not have occurred (Lewis, Reference Lewis1973a). Boiling water causes a tea kettle to whistle because (a) when water boils in a kettle, it whistles and (b) in a close possible world where water was not boiling in a kettle, it would not have whistled. Causation, in this view, is a matter of counterfactual dependence (Lewis, Reference Lewis1973b).
Counterfactual logic marked a departure from thinking about causation in terms of the regular occurrence of two variables. Regularity accounts of causation, which had dominated much of the history of causal reasoning, required that for X to cause Y, Y must invariably follow X (Hume, Reference Hume1748/1999; Mill, Reference Mill1843/2002). Relying on the constant conjunction of two variables for causation, however, is problematic. Among the problems of regularity accounts is that they evoke the thorny concepts of necessity (whether X must be present for Y to occur) and sufficiency (whether X alone can bring about Y) (Hulswit, Reference Hulswit2002; Mackie, Reference Mackie1965). Counterfactual definitions relieve the need for Y to be necessarily or sufficiently dependent on X. Boiling water causes a tea kettle to whistle, but it is neither necessary (we can create steam in a kettle without boiling water), nor sufficient (if the water is boiling but the spout is open, the kettle will not whistle).
Despite these strengths, the counterfactual dependence account offered by Lewis (Reference Lewis1973b) has limitations.Footnote 2 First, it fares no better than regularity accounts at ruling out third causal variables. Borrowing an example from Woodward (Reference Woodward2005), the reading of a barometer and the occurrence of a storm are counterfactually dependent on one another, such that if the barometer reading dropped, a storm would occur and if the barometer reading had not dropped, the storm would not have occurred. Nevertheless, they are not causally related. Both are caused by a third variable, namely, atmospheric pressure (Woodward, Reference Woodward2005). Second, counterfactual dependence does not explain the direction of the causal effect (Brady, Reference Brady2011). Observing the co-occurring presence and absence of two variables does not reveal which of those variables is causally responsible for the other. Third, and perhaps most critically, the Lewis counterfactual is subject to what Holland (Reference Holland1986) referred to as the fundamental problem of causal inference: it is impossible to simultaneously observe X and not-X. The same kettle of water cannot be boiling and not boiling at the same time.
Manipulationist accounts of causation address some of these limitations. Similar to Lewis' counterfactual, manipulationist thinking relies on hypothetically comparing what would happen to Y under different conditions of X. Where it deviates is in reserving causal efficacy for those counterfactual situations “that describe how the value of one variable would change under interventions that change the value of another” (Woodward, Reference Woodward2005, p. 15). The critical shift here is from an emphasis on counterfactual dependence to counterfactual control (Ross, Reference Ross2015). Manually changing the reading on a barometer will not cause a storm to occur because the barometer lacks causal control over the weather (Woodward, Reference Woodward2005).
This subtle shift from dependence to control has important advantages. First, it ensures that the detected relationship is not an artifact of a common cause. If intervening on X changes Y (or the probability of Y), then holding everything else constant, this rules out the possibility that X and Y just happen to change together because of Z (Ross, Reference Ross2018). Second, it allows us to determine the direction of the effect. Designating one variable to be manipulated and one to respond establishes temporal precedence and helps to segregate cause from effect (Hill, Reference Hill1965/2015). That just leaves the fundamental problem of causal inference – how can we simultaneously observe the changed and unchanged versions of X? For that, we need to create parallel worlds.
2.2. Parallel worlds and potential outcomes
In the United States, more than 256,000 children and adolescents have witnessed or died from school shootings in the past two decades (Cox, Rich, Chiu, Muyskens, & Ulmanu, Reference Cox, Rich, Chiu, Muyskens and Ulmanu2018). The median age of assailants is 16 years old (Cox et al., Reference Cox, Rich, Chiu, Muyskens and Ulmanu2018). While we know that changing preschool education is an effective means of reducing violent crime, if we have already missed the opportunity to improve an individual's preschool experience, we must develop other methods for reducing violent and aggressive behavior during critical developmental windows. Suppose you think that, for gun violence to end, adolescents need to be more compassionate toward one another. Equipped with an understanding of the relevant causal concepts, you know that to demonstrate that compassion causes a reduction in violent behavior, you need to manipulate compassion and see how violent behavior responds. For example, you might design a curriculum for first-year high school students that increases awareness of positive emotions and strengthens empathic communication skills. To test whether this intervention works, you need to create parallel worlds, running with the exact same conditions at the exact same time, save for one single difference: the presence of the compassion intervention. Each world then hosts a range of potential outcomes, in this case, the prevalence of violent behavior. The difference in the observed outcomes across these worlds represents the causal effect of the compassion intervention on violent behavior.
In social science, the simulation of parallel worlds and potential outcomes most often takes the form of a randomized controlled trial (RCT; Fisher, Reference Fisher1925). We create parallel worlds by assigning different, but similar, people to different conditions of an intervention (i.e., treatment groups). We consider the response of each treatment group as a representation of potential outcomes, of what would have happened given the opposite condition. We summarize the causal effect by taking the difference of the average effect for each treatment group (ATE; Rubin, Reference Rubin2005). RCTs entitle causal inference because they translate those theoretical causal concepts – manipulation, counterfactual control, parallel worlds, potential outcomes – into empirical action. They provide an algebra of the counterfactual (Pearl, Reference Pearl and Shrout2010).
How well an RCT approximates these causal concepts, however, depends on how well it meets four critical assumptions: independence, sample homogeneity, potential exposability, and SUTVA (stable unit treatment value assumption). Together, these assumptions build confidence that a study truly tests whether X has causal control over Y. Fortunately, most of these are satisfied (at least in expectation) by a single methodological tool: randomization. By randomizing participants to treatment groups, we neutralize any dependency between treatment assignment and outcome (independence; Holland, Reference Holland and Clogg1988), and we balance (in expectation) the treatment groups on all variables other than X (sample homogeneity; Rubin, Reference Rubin1974). Randomization thus forms the basis of our parallel worlds, ensuring that the mechanism splitting our sample into respective worlds operates in a way that maximizes the uniformity of these worlds. Any causal effect is therefore attributable to the control of X over Y, and not to any artifactual differences between these worlds.
Randomization also helps confirm that all participants can be potentially assigned to any of the treatment conditions (potential exposability; Jo & Muthén, Reference Jo, Muthén, Marcoulides and Schumacker2001). This marks the first step toward preserving the comparison of potential outcomes. If certain participants are unable to receive one of the treatment conditions – that is, if X cannot be manipulated for them – then the counterfactual collapses. Holland's (Reference Holland1986) proclamation “No causation without manipulation” is emphasized for exactly this reason (p. 959). If X cannot be changed, then the potential outcome of what would have happened had X been different does not exist, and no causal comparison can be drawn. Importantly, this proclamation can be extended to cover scenarios in which X is only hypothetically manipulatable, but where pragmatic or ethical considerations limit its ability to be manipulated in practice (Holland, Reference Holland1986; Woodward, Reference Woodward2005).
If randomization sets the counterfactual conditions of a study into motion, SUTVA guarantees that they persist as the study unfolds. SUTVA protects the uniformity of parallel worlds and the openness of potential outcomes by stipulating that (a) participants in each treatment group receive identical forms of the treatment and (b) the outcome for each participant is not influenced by the treatment assignment of another participant (Rubin, Reference Rubin1980). Uniting these tenets is the overarching principle that, once parallel worlds have been set to run, no new worlds are created. Consider, for example, if instead of receiving the same compassion curriculum, some students received education focused on building communication skills, while others learned mindful breathing or expressive writing. We could no longer meaningfully compare the potential outcomes of X and not-X because X would represent several divergent conditions. Likewise, if participants from the treatment group share their discoveries with members from the control group, then our parallel worlds have intersected and opened new counterfactual doors. For the difference between potential outcomes to have causal validity, the parallel worlds initiated by randomization must be preserved throughout the study. In theory, “SUTVA is automatically satisfied under the Fisher (Reference Fisher1935) null hypothesis of absolutely no treatment effects of any kind” (Rubin, Reference Rubin1986, p. 961), though in practice, meeting SUTVA involves careful methodological design and statistically testing the magnitude of potential interference (Hudgens & Halloran, Reference Hudgens and Halloran2008; Sobel, Reference Sobel2006).
2.3. Conceptualizing causes
We began with causal concepts. Next, we translated those concepts into empirical parameters and assumptions in the form of an RCT. The final step is to export a causal conclusion. Yet drawing an appropriate causal conclusion is not always straightforward. For one, there are many different kinds of causal relationships – some are general rules, others are specific instances; some are direct, whereas others are bridged by a cascade of intermediary forces (Hausman, Reference Hausman2005; Rottman & Hastie, Reference Rottman and Hastie2014). Moreover, a statistical parameter, by itself, provides little insight into the type of observed causal relationship. An ATE reveals only that there is a mean difference between groups. When it comes to interpreting instances of counterfactual control, however, philosophers have established a set of dimensions along which causal relationships can be conceptualized (see Woodward [Reference Woodward2010] on stability, specificity, and levels of explanation). Because RCTs simulate counterfactual conditions, these dimensions can be readily exported and applied to interpreting ATEs (see Deaton & Cartwright, Reference Deaton and Cartwright2018). In most of the social sciences, ATEs are perhaps best understood by describing what they are not: they are not uniform, not unitary, and not explanatory.
Uniform causes produce effects in the same way every time. For example, atmospheric pressure invariably causes a barometer to drop. At least in theory, we often presume that treatment effects will behave uniformly (unit homogeneity; Holland, Reference Holland1986). Despite this expectation, we often observe substantial heterogeneity in treatment effects (Angrist, Reference Angrist2004; Kent, Rothwell, Ioannidis, Altman, & Hayward, Reference Kent, Rothwell, Ioannidis, Altman and Hayward2010). This is an important indication of the type of observed causal relationship – it tells us that the observed relationship is probabilistic rather than deterministic. Heterogeneity indicates that the cause does not affect the outcome in the exact same way across person, place, or time. And indeed, this is what we find in RCTs: “there is no warrant for the convenient assumption that the ATE estimated in a specific RCT is an invariant parameter, nor that the kinds of interventions and outcomes we measure in typical RCTs participate in general causal relations” (Deaton & Cartwright, Reference Deaton and Cartwright2018, pp. 13–14). This limits the ideographic and external validity of ATEs. They do not tell us about singular causes (i.e., that X is the cause of Y in a specific instance for a specific person), nor do they tell us about general claims (i.e., that X will cause Y in all places at all times) (see Cartwright [Reference Cartwright, Skyrms and Harper1988] for a discussion of singular vs. generic causes).
Unitary causes produce effects entirely on their own. Atmospheric pressure, for example, is singularly capable of dropping the reading on a barometer. Heterogeneity in treatment effect provides another important indication here. It tells us that the causal relationship is dependent on the presence of other factors (i.e., moderators). Adolescents with a large emotional vocabulary may show a greater reduction in aggressive behavior after a compassion intervention than those with more limited vocabularies. In this case, compassion is not causally exclusive, but rather, its effect on violent and aggressive behavior is embedded within a system of other causes whose collective functioning brings about the outcome. This renders ATEs local parameters that reflect causes that are inextricably tied to the demographic composition and environmental context of the measured sample.
Explanatory causes provide a description of how the cause brought about the effect. For example, atmospheric pressure causes a barometer to drop by changing the balance of the weight of mercury and the air pressure inside of the barometer. In contrast, ATEs tell us only that changing one variable will change the other, without explaining how this change comes about (Woodward, Reference Woodward2002). This explanatory, or causally distal, gap divorces causes from mechanisms. Mechanisms can be conceptualized as complex causal systems whose interrelated parts collectively produce an effect (Glennan, Reference Glennan1996). Identifying mechanisms requires (a) decomposing the effect into the component processes extending from cause to effect and (b) articulating how those processes function together to generate an outcome (Craver & Darden, Reference Craver and Darden2013). These are different concepts than those at work in RCTs, so their empirical validation requires a different set of scientific practices.
2.4. First- and second-generation causal knowledge
In 1949, John Cade reported a series of case studies finding that lithium salts helped to pacify “psychotic excitement” (Cade, Reference Cade1949). In his initial report, Cade called for “controlled observation of a sufficient number of treated and untreated patients” to test more conclusively whether differences in lithium administration caused differences in manic symptoms (Cade, Reference Cade1949, p. 518). Seventy years, and dozens of controlled trials later, lithium has been heralded as a “psychiatric success story” (Draaisma, Reference Draaisma2019, p. 584). The well-established knowledge that lithium makes an average difference in manic symptoms has been packaged into the first line of treatment for bipolar disorder in clinical practice (Draaisma, Reference Draaisma2019; Volkmann, Bschor, & Köhler, Reference Volkmann, Bschor and Köhler2020). “I don't believe in God,” wrote Jaime Lowe, “but I believe in Lithium” (Lowe, Reference Lowe2015, para. 35).
The “controlled observations” upon which the efficacy of lithium was established constitute what we refer to as first-generation causal knowledge. This is the knowledge that a variable makes a non-uniform, non-unitary, and non-explanatory (i.e., average) difference in an outcome. As we have demonstrated so far, this is the type of information that is gained from standard counterfactual comparisons under the potential outcomes model. The promise of first-generation causal knowledge has historically been that, despite everything it lacks, it suggests a target that can be manipulated to change the probability of an outcome on a large scale (Gueron & Rolston, Reference Gueron and Rolston2013). Because we know that lithium treatment causes an average difference in manic symptoms, we can prescribe lithium to bipolar patients, in the hopes of reducing the severity of their manic symptoms, even if we lack a clear sense of who is most likely to benefit from this treatment or how this causal relationship comes about.
And yet, for all the difference that lithium has made, not knowing exactly how or for whom this treatment works has limited its utility. Lithium is effective in fewer than one in three patients and, even after 70 years of research, its mechanisms of action remain largely undefined (Alda, Reference Alda2015; Harrison et al., Reference Harrison, Cipriani, Harmer, Nobre, Saunders, Goodwin and Geddes2016). Lithium is far from the only intervention with a positive ATE, high heterogeneity in its effects, and unclear mechanism(s) of action. As a result, scientists have become increasingly vocal about the limitations of first-generation causal knowledge (Bailey et al., Reference Bailey, Duncan, Cunha, Foorman and Yeager2020). In the behavioral and social sciences, as seminal findings have failed to replicate, generalize, and sustain over time, scholars have criticized “the narrow emphasis on discovering main effects and the common practice of drawing inferences about an intervention's likely effect at a population scale based on findings in haphazard convenience samples that cannot support such generalizations” (Bryan et al., Reference Bryan, Tipton and Yeager2021, p. 1). If social science is to advance and reach more people, we need to “revolutionize” our approach to identifying and applying causal knowledge (Bryan et al., Reference Bryan, Tipton and Yeager2021, p. 1).
In many corners of science, this revolution has already started. Once again, we can look to lithium treatment for guidance. Knowing that lithium creates an average difference in manic symptoms is useful not only because it identifies an intervention target, but also because it identifies a causal pathway that can be investigated to better understand the pathophysiology of bipolar disorder and sources of heterogeneity in its treatment. For example, recent research has found that variation in properties of neuronal signaling explains differences in response to lithium (Mertens et al., Reference Mertens, Wang, Kim, Diana, Pham, Yang and Yao2015). In particular, studies of lithium responders versus non-responders found that the former show a reduction in the hyperexcitability of hippocampal dentate gyrus neurons, suggesting that this “might be the mechanism that allows [lithium] to improve symptoms in both mania and depression phases” (Stern et al., Reference Stern, Santos, Marchetto, Mendes, Rouleau, Biesmans and Gage2018, p. 1461). With this knowledge, these researchers have been able to predict more accurately who will respond to lithium, to test whether alternate treatments reduce neuronal hyperexcitability in lithium non-responders, and to discover highly specific electrophysiological processes that serve as candidates for pharmacological intervention (Santos et al., Reference Santos, Linker, Stern, Mendes, Shokhirev, Erikson and Gage2021; Stern et al., Reference Stern, Santos, Marchetto, Mendes, Rouleau, Biesmans and Gage2018).
All of this followed from the first-generation knowledge that lithium makes an average difference in symptoms of mania. What initially appeared to be a critical flaw in the results from an RCT – that the results are not perfectly portable across all people – turned out to be a boon for scientific discovery. By continuing to investigate the causal pathway, and more specifically, heterogeneity in the causal pathway, we have been able to migrate our relatively shallow understanding of this causal effect to a position of greater causal depth. These types of investigations represent a progression toward what we refer to as second-generation causal knowledge. This is knowledge that provides a “clear sense of the mechanisms of change through which effects (intended and unintended) occur, which specific [causal] components and combinations are likely to be most (and least) effective, and in what contexts and with whom such effects will potentially be replicable” (Bonell, Fletcher, Morton, Lorenc, & Moore, Reference Bonell, Fletcher, Morton, Lorenc and Moore2012, p. 10). The promise of second-generation causal knowledge is that, by identifying processes and contexts through which the effect emerges, we will be able to increase uniformity, improve understanding, and isolate steps in the causal path that serve as candidates for intervention.
2.5. Summary
In this section, we discussed one of the primary tools that social scientists use to test causation: RCTs. The counterfactual was introduced as the primary causal concept that gives RCTs causal power, with particular emphasis placed on counterfactual situations that involve manipulation and control. The construction of parallel worlds and the comparison of potential outcomes across these worlds was discussed as the foundation of the ATE. Guidelines for interpreting ATEs in the context of RCTs were advanced by detailing what these causal relationships are not: they are not the same across all people (uniform), they are not isolable causes (unitary), and they are not explanations for how a cause changes an effect (explanatory). Using the example of lithium administration, we highlighted how the understanding that a cause creates an average difference in an outcome (first-generation causal knowledge) is traditionally used to identify and implement large-scale intervention targets. We reviewed the limitations of this application and highlighted how second-generation approaches can improve our understanding of the mechanisms of action generating an effect and sources of heterogeneity in treatment outcomes. In the next section, we carry forward this experimental and interpretational framework to scaffold our definition of what it means for genes to be causes.
3. Causal inference in genetic research designs
3.1. Overview of behavior genetics
Tracing the causes of human behavior has been of scholarly interest because long before social scientists were using RCTs to manipulate measured variables. In every epoch of documented history, heredity has been considered one such source of human action and decision making (see Loehlin [Reference Loehlin and Kim2009] for a complete history of behavior genetics). It was only relatively recently, however, that two major breakthroughs transformed this longstanding endeavor from speculation to quantification. The first came in 1869, when Francis Galton redefined the study of heredity as the study of measurable similarities between relatives (Galton, Reference Galton1869; Kevles, Reference Kevles1995). Then in 2001, researchers successfully sequenced the human genome, making it possible to observe the composition of human DNA (Venter et al., Reference Venter, Adams, Myers, Li, Mural, Sutton and Zhu2001). These empirical milestones have provided critical scientific insight into the etiology of complex human outcomes, and it turns out that the pre-empirical scholars were right: genes do cause human behavior. Arriving at this conclusion, however, requires more than simply obtaining estimates of genetic associations. Once again, “no causes in, no causes out.”
These methodological advances have formed the foundation of the two principal methodologies used in behavior genetics: twin studies and genome-wide association studies (GWASs). In twin studies, pairs of monozygotic twins, sharing 100% of their segregating genetic variance,Footnote 3 are contrasted with pairs of dizygotic twins, who share only 50%. The total variance of a measured trait can then be decomposed into three latent sources: additive genetic variance (a 2), shared environmental variance (c 2), and nonshared environmental variance (e 2) (Plomin, DeFries, Knopik, & Neiderhiser, Reference Plomin, DeFries, Knopik and Neiderhiser2013). Of primary interest to behavior geneticists is the proportion of phenotypic variance attributable to additive genetic variance, also known as a trait's heritability (h 2). Similar to an R 2 effect size, heritability is useful in that it quantifies the extent to which phenotypic differences are statistically accounted for by genetic differences (writ large), but it fails to specify which genes or, crucially, how those genes are responsible for producing phenotypic differences. Without such mechanistic knowledge, it can be difficult or impossible to predict whether genetic influences will be portable across environmental contexts (Mostafavi et al., Reference Mostafavi, Harpak, Agarwal, Conley, Pritchard and Przeworski2020; Uchiyama, Spicer, & Muthukrishna, Reference Uchiyama, Spicer and Muthukrishna2021).
The breakthrough in genetic sequencing modernized the estimation of genetic associations from a single unobserved variable to millions of observed variables. In GWASs, individual genetic sites (known as single-nucleotide polymorphisms [SNPs]) are entered as independent variables in a linear regression predicting a measured phenotype. This hypothesis-free approach tests associations with millions of SNPs in order to glean insight about which specific portions of the genome are associated with the occurrence or degree of a trait (Corvin, Craddock, & Sullivan, Reference Corvin, Craddock and Sullivan2010; see Box 1 for a technical primer on GWASs).
Single-nucleotide polymorphisms (SNPs) are the sites of DNA that commonly vary in the population (>1%). Each SNP is composed of a pair of allelic variants, or two of four possible genetic “letters” (adenine, thymine, cytosine, and guanine). SNP genotyping identifies each genotyped individual's pair of allelic variants at each polymorphic site (Perkel, Reference Perkel2008). Because of the correlation structure among SNPs (known as linkage disequilibrium), it is possible to impute values for hundreds of SNPs not measured during the genotyping process, allowing for the analysis of millions of genetic variants in relation to an outcome. Prior to conducting GWAS, the raw allelic structure of each SNP is converted to an ordinal variable reflecting the number of minor alleles (i.e., the less commonly occurring allele in the population) that an individual possesses. The number of minor alleles at each SNP is what is then associated with the outcome to obtain an SNP effect size. SNP effect sizes are used for a growing number of applications, including annotating the biological function of identified SNPs (Watanabe, Taskesen, van Bochoven, & Posthuma, Reference Watanabe, Taskesen, van Bochoven and Posthuma2017), constructing polygenic scores (Sugrue & Desikan, Reference Sugrue and Desikan2019), modeling genetic associations with other traits (Grotzinger et al., Reference Grotzinger, Rhemtulla, de Vlaming, Ritchie, Mallard, Hill and Tucker-Drob2019), and estimating SNP-based heritability (Yang, Zeng, Goddard, Wray, & Visscher, Reference Yang, Zeng, Goddard, Wray and Visscher2017).
The estimates from these linear regressions, called SNP effect sizes, represent either the probability that cases differ from controls at a particular genetic site or represent the magnitude of the association between a particular genetic site and a continuous outcome. The entire set of SNP effects (collectively referred to as summary statistics) is in turn used for a wide array of applications. A popular application in the social sciences is using GWAS summary statistics to create a polygenic score, which aggregates information from all SNPs into a single index of each individual's genetic propensity for a trait (Sugrue & Desikan, Reference Sugrue and Desikan2019).
Exactly what h 2 estimates and SNP associations tell us about the relationship between genes and behavior has been the source of much discourse and much disagreement (see Fig. 1). At one end of the spectrum are those that claim – quite extraordinarily – that these coefficients prove that traits such as intelligence are genetically determined and that differences in ability between racial and ethnic groups must be the result of hardwired genetic differences (Herrnstein & Murray, Reference Herrnstein and Murray1996; Jensen, Reference Jensen1969; Murray, Reference Murray2020). At the other end are those that claim that, not only do these coefficients fail to represent genetic determinism or innate group differences, they fail to represent anything meaningful about how genes influence behavior (Block, Reference Block1995; Lewontin, Reference Lewontin1974/2006).
Both extremist views are mistaken. Heritability estimates and SNP associations are neither supra-causal nor inherently meaningless. They are simply point estimates from statistical models. In the same way that a statistical association between cannabis use and psychotic symptoms would not imply that cannabis use is the ultimate or fixed source of psychosis, nor that a population with a high incidence of psychosis must therefore be using more cannabis, h 2 estimates and SNP effects imply neither deterministic associations nor between-group differences. What can be implied, however, is that using cannabis potentially increases risk for developing psychotic symptoms. This is a causal hypothesis that must be evaluated using study designs, such as RCTs, that appropriately instantiate causal concepts. Likewise, most behavior geneticists believe that twin studies and GWASs have utility in identifying genetic factors that potentially predispose for phenotypic differences between people (Visscher, Hill, & Wray, Reference Visscher, Hill and Wray2008).
But if the conclusion that we aim to defend is that genes cause behavioral and psychological outcomes, then clearly, we need something more than genetic associations alone. For genes to be considered causes, h 2 estimates and SNP effects need to be bolstered by the same causal concepts that privilege some t-statistics as ATEs. We need to know what the trait would have looked like if the genotype had been different. Unlike in RCTs, however, we cannot manipulate the treatment to simulate the counterfactual. We cannot randomly assign people to receive a certain genotype – at least not in any agreeable, ethical, or disseminable way. Fortunately, there is no need. The counterfactual has already been simulated for us.
3.2. Natural experiment of genetic inheritance
Consider this portentous lesson from history. In the fall of 1918, an influenza pandemic hit the United States without warning. By January of 1919, the virus had mostly disappeared. This meant that babies born just a few months apart experienced vastly different prenatal conditions. In effect, the virus had manipulated the prenatal environment, randomizing adjacent birth cohorts into those exposed to pandemic conditions – including either the flu itself or related stressors – and those experiencing relatively normal, or control, conditions. These cohorts represented parallel worlds that could be compared at different developmental stages to examine the causal effect of prenatal conditions on economic outcomes. Leveraging this natural randomization to simulate conventional RCT methodology, researchers concluded that in utero exposure to pandemic conditions caused lower educational attainment, lower income, and lower socioeconomic status in adulthood (Almond, Reference Almond2006). The real-world generation of these counterfactual conditions typifies a natural experiment, in which treatment and control groups are meted out on the basis of a naturally occurring randomization mechanism.
In the case of genes, counterfactual conditions are created through meiosis, an instance of naturally occurring biological randomization. Meiosis is a process of cell division and DNA recombination that results in the production of unique sex cells (i.e., gametes). This process is essentially a natural manipulation of parental DNA. During recombination, segments of DNA from identical (i.e., homologous) chromosomes cross over in novel patterns to create new chromosomes to be inherited by offspring. Recombination is a primary source of intergenerational genetic variation (Nachman, Reference Nachman2002; Spencer et al., Reference Spencer, Deloukas, Hunt, Mullikin, Myers, Silverman and McVean2006), and the amount of variation created is vast. Within a single person, recombination results in the production of over 8 million unique chromosomal combinations (Batmanian, Ridge, & Worrall, Reference Batmanian, Ridge and Worrall2011). When combined with a partner's gametes, there are over 70 trillion genotypes that an offspring could become (Carroll, Reference Carroll2020).
Manipulation is not synonymous with randomization, however. If the variation produced by meiosis creates different treatment conditions, it must also be the case that the inheritance, or assignment, of these conditions is random. Two principles established by nineteenth century geneticist Gregor Mendel reassure us of the validity of randomization: (1) the Law of Segregation states that at every point in the genome, offspring randomly inherit one allele from each parent and (2) the Law of Independent Assortment states that alleles will segregate to gametes independently of one another (Davies et al., Reference Davies, Howe, Brumpton, Havdahl, Evans and Davey Smith2019). The astute reader will note that all alleles are not inherited entirely independently from each other because of linkage (National Human Genome Research Institute, 2022), the tendency for DNA segments that are positioned close together on a chromosome to be inherited together. We return to defining linkage and considering its implications for causal inference below.
Thus, at each genetic site, you inherit two alleles (one from your mother and one from your father) but which alleles you inherit of the possible parental alleles is a completely random event. Even though linkage makes it so that we inherit groups of alleles together (i.e., haplotype blocks) (Phillips et al., Reference Phillips, Lawrence, Sachidanandam, Morris, Balding, Donaldson and Cardon2003), Mendel's principles are just as aptly applied to haplotype blocks as to SNPs – haplotype blocks are, at least in part, randomly created (Wang, Akey, Zhang, Chakraborty, & Jin, Reference Wang, Akey, Zhang, Chakraborty and Jin2002), and inherited independently of one another (Browning & Browning, Reference Browning and Browning2011). Crucially, this randomness gives genetic inheritance its experimental infrastructure (Davey Smith & Ebrahim, Reference Davey Smith and Ebrahim2003): just as we can compare outcomes between treatment and control groups in the context of an RCT in order to gain insight about the average causal effect of the treatment, we can compare outcomes between family members who inherited different genes in order to gain insight about the causal effect of genotype. Confidence in these counterfactual conditions, however, depends on how well they meet those four critical assumptions of all randomized experiments – independence, sample homogeneity, potential exposability, and SUTVA. We consider each one in turn.
3.2.1. Independence
At face value, Mendel's laws satisfy the independence assumption. If genetic variants are randomly and independently assigned, then we should expect no systematic dependency between genotype and outcome (Holland, Reference Holland and Clogg1988). In actuality, there exist several possible violations of independence.
First, because of evolutionary factors and non-random mating patterns, different subpopulations, such as those with different ancestral backgrounds, have different frequencies of certain alleles, known as population stratification (Cardon & Palmer, Reference Cardon and Palmer2003). Discrepancies in allele frequency across different groups of people are often systematically associated with environmental differences (environmental confounding), non-ancestral-related genetic differences (genetic confounding), and mate selection (assortative-mating confounding) (Young, Benonisdottir, Przeworski, & Kong, Reference Young, Benonisdottir, Przeworski and Kong2019). This means that if we estimate a genetic association in a sample of people who are not close biological relatives, we cannot separate the causal effect of the gene from any of these confounding sources. Conventional GWASs do their best to mitigate these problems. For instance, they are conducted in ancestrally homogeneous samples (Mills & Rahal, Reference Mills and Rahal2019), and even within these samples, population stratification is often corrected for by controlling for ancestry-based principal components (Price et al., Reference Price, Patterson, Plenge, Weinblatt, Shadick and Reich2006) or using linear mixed models (Yang, Zaitlen, Goddard, Visscher, & Price, Reference Yang, Zaitlen, Goddard, Visscher and Price2014). But, none of these practices guarantees independence (Haworth et al., Reference Haworth, Mitchell, Corbin, Wade, Dudding, Budu-Aggrey and Timpson2019).
The only way to surmount this problem is to examine genetic associations relative to parental genotypes, for example, by directly comparing an offspring to both of its parents or by comparing siblings from the same family (Brumpton et al., Reference Brumpton, Sanderson, Heilbron, Hartwig, Harrison, Vie and Davies2020; Young et al., Reference Young, Frigge, Gudbjartsson, Thorleifsson, Bjornsdottir, Sulem and Kong2018). For any individual, “each of the meiosis and conception events that determined [a person's] DNA is an independent event conditional on the parental genotypes” (Davies et al., Reference Davies, Howe, Brumpton, Havdahl, Evans and Davey Smith2019, p. R174, emphasis added). Here are Mendel's laws in action: the genotype of any individual is a random and independent selection of genes from their parents. Because siblings inherit their genes from the same pool of potential genotypes, the pitfalls of population structure can be avoided if the comparison of siblings is appropriately conditioned on their parental genotypes (Fletcher, Wu, Li, & Lu, Reference Fletcher, Wu, Li and Lu2021; Zaidi & Mathieson, Reference Zaidi and Mathieson2020). Novel designs such as within-sibship GWASs and relatedness disequilibrium regression (RDR) exploit the randomization in meiosis that renders treatment assignment and outcome independently (Howe et al., Reference Howe, Nivard, Morris, Hansen, Rasheed and Cho2021; Young et al., Reference Young, Frigge, Gudbjartsson, Thorleifsson, Bjornsdottir, Sulem and Kong2018).
Second, alleles are not inherited completely independently from each other. Rather, DNA segments that are positioned closely together on a chromosome are more likely to be inherited together, as there is a lower probability of a recombination event occurring between them. As a loose analogy, if you shuffle a deck of cards and then split the deck, two cards that are right next to each other in the deck before shuffling are more likely to end up in the same half of the deck than cards that are far apart from each other. This co-inheritance results in linkage disequilibrium (LD), that is, a correlation between alleles.
The issue of LD raises a more general issue, which we refer to as the resolution of genetic effects. The highest resolution for genetic causes is to identify an individual genetic variant. When geneticists talk about identifying a “causal variant,” they are using a high resolution for genetic effects: a C1 allele in the cystic fibrosis transmembrane regulatory (CFTR) gene, for example, causes cystic fibrosis. The lowest resolution for genetic causes is the entire genome. A method such as RDR can conclude that, if people had inherited different genetic segments from their parents, their phenotypes would be different. This is a causal conclusion but one that is silent regarding which genetic variants are causally relevant.
An intermediate resolution for conceptualizing genetic causes, and the resolution most relevant for understanding the results of GWASs, is neither the individual variant nor the entire genome, but instead a set of alleles that are all in high LD with each other (but not in LD with other alleles). A within-sibling GWAS leverages the natural experiment of meiosis, but it does not measure every possible genetic variant. Thus, a “hit” in a within-sibling GWAS, that is, an SNP that is associated with within-sibling differences in phenotypes, might be the causal variant, or it might be in LD with the causal variant. That is, the SNP is best considered a measure of an underlying genetic cause, while the specific causal variant often remains unknown.
In order to build an intuition about how an SNP can be a measure of a cause, rather than the cause itself, it might be helpful to consider other types of natural experiments. Consider, for example, the Dutch Hunger Winter studies. In 1944, the Nazis retaliated against Dutch resistance to occupation by imposing an embargo on transport to western Holland, causing a severe famine in large cities. By November, food rations were 450 calories per day, and the famine continued until Holland was liberated by the Allied armies in May 1945.
The Dutch famine has become a famous quasi-experiment for studying the effects of prenatal exposure to caloric restriction of adult health and cognition. Because the famine affected cities in a geographically circumscribed area in west Holland, for a circumscribed period of time, exposure to famine can be treated as if random, by comparing individuals conceived during the famine to individuals in the same cities who were conceived before or after the famine, and to individuals conceived at similar times in unaffected cities.
In their landmark 1972 study on the effects of prenatal famine exposure on cognition, males who appeared for military induction at age 18 were asked for their date and place of birth, which researchers used to assign them to “exposed” or “unexposed” groups (Stein, Susser, Saenger, & Marolla, Reference Stein, Susser, Saenger and Marolla1972). That is, we can differentiate between the study's cause of interest (prenatal exposure to famine) and the study's measurement of that cause (participant's self-report of date and place of birth). Obviously, a participant writing down his date of birth is not the cause of his adult health. Rather, his self-report is an indicator used to infer his membership in a group that is as-if randomly exposed to the putative cause. Such a situation, where researchers are relying on potentially imperfect measures of putative causes, is common in natural experiments where researchers are not assigning participants to treatment and control groups, but are rather ascertaining exposures after the fact. Similar to the Dutch Hunger Winter researcher who has not randomly assigned their participants to be exposed to famine or not, a GWAS researcher has not assigned people to genotypes. Nature has randomly assigned offspring to genotypes from their parents, and the GWAS researcher is left trying to ascertain to which genotypes people have been assigned. An SNP array is an imperfect measure of that random assignment.
Putting these lines of reasoning together, the natural experiment of meiosis guarantees that segments of the parental genome are independently and randomly assigned to offspring, but there remains non-independence of specific alleles that are co-inherited and in LD. A within-family GWAS, then, will be able to successfully identify that “genes” have a causal effect on phenotypes, but “genes” are studied at an intermediate level of resolution, encompassing all alleles in LD with the measured SNP. Researchers can then use “fine mapping” techniques to gain higher resolution (LaPierre et al., Reference LaPierre, Taraszka, Huang, He, Hormozdiari and Eskin2021).
3.2.2. Sample homogeneity
Comparing members of the same family should allow randomization to serve another one of its chief functions: preserving sample homogeneity (Rubin, Reference Rubin1974). Randomization “guarantees, by construction,…that the [difference in means for all other causes] is zero in expectation” (Deaton & Cartwright, Reference Deaton and Cartwright2018, p. 4). In practice, sample homogeneity is a function of two factors: (1) the number of participants and (2) the number of trials (Deaton & Cartwright, Reference Deaton and Cartwright2018). GWASs are uniquely suited to address these factors. First, standard GWAS sample sizes tend to tally in the millions (e.g., Evangelou et al., Reference Evangelou, Warren, Mosen-Ansorena, Mifsud, Pazoki, Gao and Caulfield2018; Karlsson Linnér et al., Reference Karlsson Linnér, Biroli, Kong, Meddens, Wedow, Fontana and Beauchamp2019; Nielsen et al., Reference Nielsen, Thorolfsdottir, Fritsche, Zhou, Skov, Graham and Willer2018), orders of magnitude larger than typical RCTs. Although improving the sample sizes of within-family designs remains a critical aim of behavior genetics, recent studies have begun to analyze SNP effects in upward of 40,000 sibling pairs (Karlsson Linnér et al., Reference Karlsson Linnér, Mallard, Barr, Sanchez-Roige, Madole, Driver and Dick2021). Second, meiosis is essentially a series of millions of randomized trials. As the assortment of alleles at each genetic site is a random event, we should have increasing confidence that allele carriers do not differ in systematic ways as we aggregate over the genome. This makes summary indices of genetic effects, such as polygenic scores, particularly powerful tools.
3.2.3. Potential exposability
Potential exposability is directly related to manipulability. If the treatment is something that can be manipulated, or changed, then randomization ensures that every participant is potentially exposable to any condition (Jo & Muthén, Reference Jo, Muthén, Marcoulides and Schumacker2001). In one sense, the conditions of meiosis easily satisfy the requirement of potential exposability. Meiosis manipulates parental DNA, creating trillions of unique genotypes for an offspring to inherit (Carroll, Reference Carroll2001). The fact that meiosis satisfies genotype-level exposability suggests that, as with sample homogeneity, indices that aggregate across the genome may be particularly suited for causal inference.
3.2.4. SUTVA
Consider a family with two adolescent children, Linda and Maggie. Through meiosis, Linda and Maggie were randomly assigned their genotypes, creating parallel worlds that could be compared to examine whether their genes caused different life outcomes. In particular, Linda inherits variants in the ADH1B gene that affect her metabolism of alcohol, contributing to her refraining from alcohol use (Bierut, Reference Bierut2011). Linda's substance-use choices become part of the environment that she shares with Maggie, a factor that often serves to align substance-use habits among siblings (see Samek, McGue, Keyes, & Iacono [Reference Samek, McGue, Keyes and Iacono2015] for a review of shared environmental factors in substance use). If observing Linda decline alcohol, return home promptly before curfew, and engage in substance-free recreational activities influences Maggie's alcohol-related behavior, then SUTVA has been violated. Linda's treatment assignment – her genotype – has interfered with Maggie's potential outcome, obfuscating a causal comparison of counterfactual conditions (Rubin, Reference Rubin1980). For SUTVA to be preserved in the natural experiment of genetic inheritance, there can be no indirect sibling-to-sibling genetic effects (Eaves, Reference Eaves1976).
The surest way to safeguard against the behavioral transmission of genetic effects between siblings is to analyze data from a single offspring controlling for both of the parental genotypes. Alternatively, one could compare the potential outcomes of siblings who were not raised together (e.g., adoption studies; Plomin, DeFries, & Loehlin, Reference Plomin, DeFries and Loehlin1977). This assures that each sibling's genotype has as little influence on the other sibling's phenotype as possible. To be sure, even adoption studies cannot protect against other sources of indirect genetic effects (see Scarr & McCartney [Reference Scarr and McCartney1983] for a review), but these are more a problem of sample homogeneity than SUTVA. As an analogy, consider an RCT on a pharmacological treatment of depression. If some participants happen to read existential philosophy during their treatment, the threat is that a potential imbalance of philosophy readers across treatment groups will confound depression scores. Reading existential philosophy, however, has nothing to do with whether the depression treatment that one participant receives interacts with another participant's depressive symptoms. Non-sibling indirect genetic effects are like reading existential philosophy – they are sure to affect an offspring's outcome,Footnote 4 and they might create a systematic difference in (genetically influenced) environments across allele carriers, but they do not violate SUTVA.
Evidence has begun to suggest, however, that when siblings are raised together, their respective genotypes do in fact influence their siblings' phenotypes (Fletcher, Wu, Zhao, & Lu, Reference Fletcher, Wu, Zhao and Lu2020). The presence of sibling interference need not undo causal inference entirely (Rosenbaum, Reference Rosenbaum2007). In these cases, addressing SUTVA involves determining (a) the direction of the interference and (b) the magnitude of the effect. Developmental psychologists differentiate between imitation and contrast effects – those patterns of “behavioral acquisition via social learning” that serve to either fuse or drive apart sibling behavior (Carey, Reference Carey1986, p. 320; see Dolan, de Kort, van Beijsterveldt, Bartels, & Boomsma [Reference Dolan, de Kort, van Beijsterveldt, Bartels and Boomsma2014]; Moscati, Verhulst, McKee, Silberg, & Eaves [Reference Moscati, Verhulst, McKee, Silberg and Eaves2018] for empirical demonstrations). Whether or not Linda refraining from alcohol use causes Maggie to similarly abstain or rebel into greater use depends on factors such as their relative ages (Abramovitch, Corter, & Lando, Reference Abramovitch, Corter and Lando1979) and the stage of their dyadic relationship (Carey, Reference Carey1986).
Empirically examining SUTVA involves quantifying the magnitude of imitation or contrast effects by segregating the direct causal effect from interference effects. This can be achieved through a process of triangulation (Lawlor, Tilling, & Davey Smith, Reference Lawlor, Tilling and Davey Smith2017), a leveraging of multiple data sources and unique methodological approaches to increase confidence in a causal conclusion. In the case of sibling interference, Kong et al. (Reference Kong, Thorleifsson, Frigge, Vilhjalmsson, Young, Thorgeirsson and Stefansson2018) provide a paradigmatic example: using genotype data from both siblings and parents and integrating within-sibship comparison with a traditional trio design (see Connolly & Heron [Reference Connolly and Heron2015] for a review), the researchers were able to triangulate on a direct causal estimate of genotype on outcome. By including the effect of the sibling's genotype and the uninherited portions of the parental genotypes in the model, Kong et al. (Reference Kong, Thorleifsson, Frigge, Vilhjalmsson, Young, Thorgeirsson and Stefansson2018) estimated the magnitude of the interference and effectively ensured that it, and other confounding sources, were controlled for. Adoption studies may ensure protection against SUTVA violations, but innovative methodological approaches can still rescue causal inference in the face of sibling interference.
3.3. Shallow end of genetic causation
Perhaps no outcome has been more magnetic in contemporary behavior genetics than educational attainment (EA; Martin, Reference Martin2018). The most recent GWAS of EA, published in 2018, has already been cited over 900 times (Lee et al., Reference Lee, Wedow, Okbay, Kong, Maghzian, Zacher and Cesarini2018). It has also generated a litany of passionate critiques and rebuttals (see, e.g., the blog post titled “Why We Shouldn't Embrace the Genetics of Education”; Warner, Reference Warner2018). Yet prior to 2013, EA was considered a fairly rudimentary, albeit important, covariate in GWASs (Plomin & von Stumm, Reference Plomin and von Stumm2018). Priority had been given to medical and psychiatric disease states; EA was simply a confound to rule out. As GWAS methodology began to permeate the social sciences, however, the troves of data on EA that had been accrued over the years by large-scale research consortia became invaluable. Suddenly, EA had become the most GWAS-able trait.
The first GWAS of EA detected three SNPs with significant effects in 126,559 individuals, collectively explaining 2% of its variance (Rietveld et al., Reference Rietveld, Medland, Derringer, Yang, Esko, Martin and Koellinger2013). Three years later, 74 SNPs were detected in twice as many people, explaining 4% of the variance (Okbay et al., Reference Okbay, Beauchamp, Fontana, Lee, Pers, Rietveld and Benjamin2016). By 2018, the GWAS of EA included 1.1 million individuals, over 1,000 significant SNP effects, and explained over 10% of the variance (Lee et al., Reference Lee, Wedow, Okbay, Kong, Maghzian, Zacher and Cesarini2018). By social science standards, that is a large and stable effect size (Funder & Ozer, Reference Funder and Ozer2019), and one that even outperforms many complex, multivariate approaches to predicting educational outcomes (Salganik et al., Reference Salganik, Lundberg, Kindel, Ahearn, Al-Ghoneim, Almaatouq and McLanahan2020). The incremental successes of the EA GWASs are undeniably impressive, but they have not been accompanied by incremental increases in causal inference. Even if the fourth iteration of the EA GWAS detected 5,000 significant SNP effects and explained 50% of the variance in EA, it alone would not move us closer to the conclusion that genes cause educational outcomes.
To be sure, we are currently in a position to conclude that genes cause EA. But this conclusion is only possible because researchers have applied summary statistics from EA GWASs to datasets that allow for counterfactual comparison. By using “within-family genetic design[s],” differences in associations between polygenic scores and educational outcomes allow for “causal inference and explanation” (Selzam et al., Reference Selzam, Ritchie, Pingault, Reynolds, O'Reilly and Plomin2019, p. 360). So when we find that “children with higher polygenic scores…move up the social ladder in terms of education, occupation, and wealth, even compared with siblings in their own family” (Belsky et al., Reference Belsky, Domingue, Wedow, Arseneault, Boardman, Caspi and Harris2018, p. E7281), the appropriate conclusion is that genes caused these differences in attainment.
For behavior geneticists, this is undoubtedly a triumph. After years of null results and unreplicable false positives, the field can now construct measures of DNA differences that caused important life outcomes. For others, however, this statement rouses ambivalence at best, and outrage at worst. There is a vocal contingent of bloggers, journalists, and scientists who fear that GWAS of social outcomes and its associated applications “will only be fuel for those who think that social inequalities are natural and unchangeable” (Samorodnitsky, Reference Samorodnitsky2020, para. 20). Such a picture of genetic causes is unwarranted, however, when we remember what it means for something to be a cause: genetic causes for human behavioral traits are non-uniform, non-unitary, and non-explanatory.
It can be easy to neglect that genetic causes behave just like ATEs from RCTs. Prominent examples from medicine have shaped expectations that genes are of a different class of causes (Ross, Reference Ross2019). Take cystic fibrosis (CF), for example. CF is an autosomal recessive disorder present in about 70,000 individuals globally. It is caused by two mutated copies of the CFTR gene on the seventh chromosome (Cutting, Reference Cutting2015). Unlike most ATEs, this genetic cause is (a) uniform – it consistently produces the occurrence of CF across individuals, (b) unitary – it alone causes the occurrence of CF, and (c) explanatory – it provides an explanation for how CF occursFootnote 5 (Elborn, Reference Elborn2016). Together, these characteristics make CF an instance of deep genetic causation Footnote 6 (see Turkheimer [Reference Turkheimer1998] on strong biologisim; Meehl [Reference Meehl1972] on specific genetic etiology). Scientists gravitate toward deep causes. They are salient, simplistic, and they provide a coherent framework for the operation of a complex system such as the genome (Engel, Reference Engel1977; Kendler, Reference Kendler2005). Despite the conceptual attraction to deep causes, almost everything we have learned from GWASs points to genes as shallow causes – many variants from across the genome relate to behavioral outcomes, but when they matter and how they matter differs across people, place, and time (Ross, Reference Ross2019). The appropriate paradigm for genetic causes of human behavior is therefore not the deeply deterministic example of CF, but the local, probabilistic, and distal characteristics of ATEs.
Support for the idea that genes are non-unitary causes of behavior is so robust that it has been consecrated as one of the modern laws of behavior genetics (Chabris, Lee, Cesarini, Benjamin, & Laibson, Reference Chabris, Lee, Cesarini, Benjamin and Laibson2015). Indeed, arguably the greatest takeaway from the GWAS era has been that individual genetic variants do not produce behavioral effects on their own. This is not a trivial statement – decades of research were spent hunting for single polymorphisms (i.e., candidate genes) that would prove to have causal control in the etiology of behavioral and psychological outcomes (see Munafò [Reference Munafò2006] for an overview). Consistent failure of these findings to replicate, however, pushed behavior geneticists to develop more sophisticated models. Most believe now that the genetic architecture of complex traits is polygenic (involving thousands of variants with small effects distributed throughout the genome [Duncan, Ostacher, & Ballon, Reference Duncan, Ostacher and Ballon2019]) or even omnigenic (involving sundry genome-wide variants that affect behavior by disrupting interconnected gene regulatory networks [Boyle, Li, & Pritchard, Reference Boyle, Li and Pritchard2017]). But if single genes are not unitary causes of behavior, neither is the genome writ large. Even a model that considered every gene in the genome and its higher-order function would fail to be causally exclusive because it would fail to account for larger etiological systems such as “history and cohort, the life course, and social structures like gender” through which “genetic influence must be understood” (Herd et al., Reference Herd, Freese, Sicinski, Domingue, Mullan Harris, Wei and Hauser2019, p. 1070). Genes might cause EA, but they are certainly not the only cause of EA.
The nesting of genetic effects within biological, psychological, and social systems is what makes them local parameters. The size and shape of a particular effect will always depend on the size and shape of the other causal factors present in that instance. In theory, this suggests that genetic effects will be non-uniform. If context matters, then genetic effects should change across settings. Nevertheless, there persists “the common assumption…that genetic effects are ‘universal’ across environments” (Tropf et al., Reference Tropf, Lee, Verweij, Stulp, van der Most, de Vlaming and Mills2017, p. 758). This would imply that genetic effects are deterministic, that they will produce the same effect in the same way every time, independent of the context. Two takeaways from modern genomics suggest that this assumption is unfounded: (1) genetic effects are heterogeneous across environments and (2) genetic effects show poor generalizability.
That genetic effects vary across environments is a proposition of longstanding tenacity (gene × environment [G × E] interactions) (see Jaffee & Price [Reference Jaffee and Price2007] for a review; Feldman & Lewontin, Reference Feldman and Lewontin1975; Turkheimer & Gottesman, Reference Turkheimer and Gottesman1996). Reliably identifying such interactions, however, has historically proved difficult (Munafò & Flint, Reference Munafò and Flint2009). This is where the substantial increases in the predictive power of GWASs, similar to those seen in EA, have considerable value. Several studies have been able to provide insight into the environments that facilitate the emergence of genetic effects on EA. Genetic effects appear to increase in size when structural barriers such as gender (Herd et al., Reference Herd, Freese, Sicinski, Domingue, Mullan Harris, Wei and Hauser2019), class (Rimfeld et al., Reference Rimfeld, Krapohl, Trzaskowski, Coleman, Selzam, Dale and Plomin2018), and intergenerational mobility (Engzell & Tropf, Reference Engzell and Tropf2019) are removed. Said differently, the probability that genes matter for EA varies depending on the environmental exposures of the individual. Similar heterogeneity has been observed in genetic effects on reproductive, physical, and psychiatric outcomes (Coleman et al., Reference Coleman, Peyrot, Purves, Davis, Rayner, Choi and Breen2020; Tropf et al., Reference Tropf, Lee, Verweij, Stulp, van der Most, de Vlaming and Mills2017).
The disparity in genetic effects across environments is further corroborated by the fact that GWAS findings have largely failed to be applicable outside of discovery samples. That means that the genes that predict an outcome in one sample fare poorly in predicting the same outcome in a separate sample. Not only is this the case when testing predictive accuracy in diverse populations (Martin et al., Reference Martin, Gignoux, Walters, Wojcik, Neale, Gravel and Kenny2017), but inconsistent accuracy has also been found when looking within demographic subgroups (i.e., age, gender, socioeconomic status) of the same ancestry (Mostafavi et al., Reference Mostafavi, Harpak, Agarwal, Conley, Pritchard and Przeworski2020). Moreover, this failure for genetic effects to port was found even when phenotypes were measured consistently across samples. Variation in the measurement or applicability of a phenotype across populations, for example when “educational attainment emphasizes rote memorization or formal writing… [rather than] experiential learning,” is likely to be another source of restricted generalizability (Meyer, Turley, & Benjamin, Reference Meyer, Turley and Benjamin2020, para. 5). Collectively, this suggests that while genes cause EA, this is neither a singular nor a generic claim (Cartwright, Reference Cartwright, Skyrms and Harper1988). We know neither that genes are the cause of EA for a specific individual nor that genes are the cause of EA for all people across place and time.
Genes, however, do have generic functions (Dawkins, Reference Dawkins1982/2016). Every gene produces biochemical material for cellular encoding, and the specific set of instructions governed by a particular gene is consistent across person, place, and time (Schaefer & Thompson, Reference Schaefer and Thompson2014). This would indicate that the first step in the causal pathway from genes to behavior is uniform. To ultimately arrive at non-uniform effects on behavior, there must therefore be subsequent points along this causal pathway where people diverge. Indeed, we know already that this divergence begins almost immediately after gene function. Even processes such as gene expression and gene regulation show substantial heterogeneity across environments (Bork et al., Reference Bork, Dandekar, Diaz-Lazcoz, Eisenhaber, Huynen and Yuan1998), and this tends to be more the rule than the exception as the pathway winds through biological (Gough et al., Reference Gough, Stern, Maier, Lezon, Shun, Chennubhotla and Taylor2017), psychological (Molenaar, Reference Molenaar2004), and sociological systems (Scott, Reference Scott1988). Each point of heterogeneity demarcates a garden of forking paths (Borges, Reference Borges1941/2018; Gelman & Loken, Reference Gelman and Loken2014), a splintering of a uniform stream of processes into separable causal pathways. The staggering amount of heterogeneity that exists in the processes that extend from genes to behavior tells us that there are a potentially untraceable number of causal pathways. Identifying a genetic cause provides no insight into which causal pathway ultimately produced the behavior because genetic causes are not mechanisms. Genes might cause EA in the sense that genes made some distal difference in the level of attainment, but not in the sense that they provide an explanation for how this difference was made.
This distinction between causes and mechanisms often gets lost when applied to the relationship between biology and behavior (Thomas & Sharp, Reference Thomas and Sharp2019). As Gregory Miller writes, “[r]elevant science abounds with demonstrations that…imply causal relationships between psychology and biology…yet we often write as if we know the mechanisms” (Miller, Reference Miller2010, p. 717). Despite the implicit assumption that biology reveals something inherently mechanistic, there is nothing that necessitates that biological causes need to be mechanistic nor that mechanisms need to be biological. In plainest terms, mechanisms explain an effect. Even putatively biological associations, such as genetic effects on lung cancer, might be largely explained via social processes, such as access to cigarettes (Kendler et al., Reference Kendler, Chen, Dick, Maes, Gillespie, Neale and Riley2012). Further, whether a biological cause actually provides explanatory insight is a matter of circumstance. Consider the example from Turkheimer (Reference Turkheimer1998) on the origins of vocal muteness for two individuals, one who has suffered a stroke in Broca's area of the brain and the other who has taken a religious vow of silence:
It seems natural to describe the stroke patient's muteness as biological and the monk's as psychological. What do these attributions mean? It is not simply that the aphasia is ‘in’ the brain, because the monk's decision presumably resides there also. Instead, the difference involves the nature of the structural relationship between a neurological representation of the condition and a psychological account of it (p. 783).
In other words, the key difference is that only the aphasia patient's identified biological cause is mechanistic: it describes how a localized lesion compromises the neural areas that support linguistic functioning – an outcome that would also happen to be invariant across time and place. Likewise, biology factors into the monk's muteness (see, e.g., research on the neural networks supporting religiosity; Kapogiannis et al., Reference Kapogiannis, Barbey, Su, Zamboni, Krueger and Grafman2009). While this biology may prove useful in forecasting future spiritually-based muteness in other individuals using predictive modeling (Shmueli, Reference Shmueli2010; Yarkoni & Westfall, Reference Yarkoni and Westfall2017), it would be insufficient to explain this monk's vow of silence. The mechanism behind the monk's silence is psychological – it is his decision that generates and explains his outcome.
3.4. Using genetic causes to advance second-generation causal knowledge
In this paper, we have argued for the idea that certain types of genetic effects (i.e., contingent on parental genotypes) constitute first-generation causal knowledge. Similar to ATEs, genes are causal in the sense that “differences in [genotype]…cause phenotypic differences in particular genetic and environmental contexts” (Waters, Reference Waters2007, p. 558). Unlike with many ATEs, however, this information cannot be used to manipulate the causal variable on a population scale. This would seem to limit the applied value of identifying and conceptualizing genes – or any other immutable cause – as average difference-makers. But, as the case of lithium treatment showed us, there is no reason to restrict first-generation causal knowledge to this singular application. Similar to all first-generation causes, genetic effects contingent on parental genotypes represent causal pathways that can be explored to advance second-generation aims.
If we accept the conclusion that genetic variants make an average difference in psychological and behavioral outcomes, then we can begin to embrace the trove of potential scientific discoveries lying along this causally distal pathway. To start, we can improve phenotypic understanding by exploring mediating processes. Traditionally, researchers have used GWAS results to gain deeper insights into the biology of complex behavioral outcomes (Dick et al., Reference Dick, Barr, Cho, Cooke, Kuo, Lewis and Su2018). Approaches such as bioinformatics annotation make it possible to locate specific cells, tissues, and organs where relevant genetic variants are expressed (Watanabe, Mirkov, de Leeuw, van den Heuvel, & Posthuma, Reference Watanabe, Mirkov, de Leeuw, van den Heuvel and Posthuma2019). In pathway analysis, genetic variants are clustered by functional relatedness and used to assess whether candidate biological functions are implicated in disease etiology (White et al., Reference White, Yaspan, Veatch, Goddard, Risse-Adams and Contreras2019). Collectively, these techniques serve to “increase explanatory power” by specifying the “parameters of the nervous system [that] are aberrant as a result close in the causal chain to the gene or genes” (Khatri, Sirota, & Butte, Reference Khatri, Sirota and Butte2012, p. 1; Meehl, Reference Meehl1972, p. 11).
Inspired by this method of biological discovery, researchers have called for an approach that maps genotypes to multifarious aspects of the social environment (phenotypic annotation; Belsky & Harden, Reference Belsky and Harden2019). By associating polygenic scores for one phenotype with related phenotypes at different stages across the lifespan, we can detail potential behavioral and developmental pathways through which target phenotypes emerge. Already this work has provided considerable insight into how genetic risk for adult outcomes such as body mass index, smoking, educational attainment, and attention-deficit/hyperactivity disorder manifests in childhood and adolescence (Agnew-Blais et al., Reference Agnew-Blais, Belsky, Caspi, Danese, Moffitt, Polanczyk and Arseneault2021; Belsky et al., Reference Belsky, Moffitt, Houts, Bennett, Biddle, Blumenthal and Caspi2012, Reference Belsky, Moffitt, Baker, Biddle, Evans, Harrington and Caspi2013a, Reference Belsky, Moffitt and Caspi2013b, Reference Belsky, Moffitt, Corcoran, Domingue, Harrington, Hogan and Caspi2016). Uncovering more about the biological and behavioral intermediaries bridging genes and behavior improves our ability to develop integrated causal models of complex behavioral phenomena.
As our understanding of the causal structure of psychological and behavioral phenotypes deepens, our discovery of potential prevention and intervention targets improves (Dick, Reference Dick2018). Indeed, each process that we find mediates cause and effect which represents a candidate for intervention, even if the original cause itself is immutable. Consider again the example of lithium administration: researchers localized the differential pattern of neuronal signaling in lithium responders and non-responders to the expression of a single gene (LEF1) (Santos et al., Reference Santos, Linker, Stern, Mendes, Shokhirev, Erikson and Gage2021). Rather than structurally alter the gene, these researchers explored its downstream biological consequences (e.g., transcription pathways), thereby identifying “useful phenotypes for drug development” (Santos et al., Reference Santos, Linker, Stern, Mendes, Shokhirev, Erikson and Gage2021, p. 12). In these cases, the relevant question is not just whether the cause itself is manipulatable, but (a) which of the mediating processes are manipulatable and (b) which processes' manipulation will generate a meaningful effect on the outcome. Knowing that LEF1 causes differences in brain signatures characteristic of lithium response allows us to identify mechanistic processes that could be pushed upon to improve treatment responsivity.
The same should be true of behavioral and health-related outcomes. Understanding how genetic factors unfold along biological and behavioral pathways across development allows us to isolate intermediate processes that represent (a) prognostic markers of future outcomes and (b) targets for programmatic manipulation that may serve to close the gap in health disparities (Belsky, Moffitt, & Caspi, Reference Belsky, Moffitt and Caspi2013b). Behavioral genetics is beginning to turn toward these applications, and research on body mass index (BMI) provides a ready example. Large-scale phenotypic annotation efforts have begun to link genetic variants associated with adult BMI to eating behaviors in childhood and adolescence (Abdulkadir et al., Reference Abdulkadir, Herle, De Stavola, Hübel, Santos Ferreira, Loos and Micali2020; Herle et al., Reference Herle, Abdulkadir, Hübel, Ferreira, Bryant-Waugh, Loos and Micali2021a, Reference Herle, Pickles and de Stavola2021b). These studies have found that, by as early as age 2, a child's eating behavior may demarcate genetic risk for adult BMI. This suggests that eating habits, and possibly related health behaviors, may represent malleable outcomes through which we can mitigate the influence of genetic differences on BMI. Preliminary evidence supports this claim. Correlational research has shown that genetic effects on adult BMI are larger in individuals who live sedentary lifestyles and consume more sweetened beverages (Li et al., Reference Li, Zhao, Luan, Ekelund, Luben, Khaw and Loos2010; Qi et al., Reference Qi, Chu, Kang, Jensen, Curhan, Pasquale and Qi2012). Early experimental findings point toward physical activity at age 11 as a modifiable behavior for attenuating the association between genes and BMI (Herle, Pickles, & de Stavola, Reference Herle, Pickles and de Stavola2021b).
Still, we know that not all 11-year-olds will respond equally to a behavioral intervention. This was one of the main takeaways from the HPPP that we reviewed at the beginning of this paper. Simply being exposed to an intervention does not entail how a given person will respond, for how long the effect will last, or whether it will generalize to related behaviors across development (Bailey et al., Reference Bailey, Duncan, Cunha, Foorman and Yeager2020; Bryan et al., Reference Bryan, Tipton and Yeager2021; Green, Reference Green2021). To improve the efficacy and reach of our treatments, we need to understand the sources of individual differences in their outcomes. We need to “be concerned with the otherwise neglected interactions between organismic and treatment variables” (Cronbach, Reference Cronbach1957, p. 681).
Genetic causes can help. By integrating genomic data into longitudinal, experimental research designs, we can begin to answer causal questions about heterogeneity in treatment effects and mechanisms generating the fadeout, persistence, and emergence of those effects later in life. A growing body of work in this area has demonstrated that responses to childhood interventions such as HPPP are sensitive to genetic variation (Albert et al., Reference Albert, Belsky, Crowley, Latendresse, Aliev, Riley and Dodge2015; Brody et al., Reference Brody, Beach, Philibert, Chen, Lei, Murry and Brown2009, Reference Brody, Beach, Hill, Howe, Prado and Fullerton2013; Kuo et al., Reference Kuo, Salvatore, Aliev, Ha, Dishion and Dick2019). Using the framework for genetic causation that we have advanced in this paper, we can develop more robust and comprehensive understanding of how individual differences in constitutional factors influence treatment outcomes. In particular, we can integrate whole-genome measures from family members into two-shock designs, which yield an estimate of the interaction of two random sources of variation to provide special insight into the (biological and environmental) contexts in which a particular cause operates (Almond, Currie, & Duque, Reference Almond, Currie and Duque2018). These designs may critically advance our understanding of why particular individuals are more or less likely to respond to treatments and why particular treatment effects are more or less enduring or generalizable.
3.5. Summary
In this section, we considered interpretations of the prevailing statistical parameters used in behavior genetics – h 2 and SNP effects. We argued that the randomization of offspring to genotype in meiosis generates a natural experiment, but that genetic effects on behavior can only be considered causal when other counterfactual conditions are met. The experimental assumptions of independence, sample homogeneity, potential exposability, and SUTVA were discussed with respect to genetic causation. The takeaway was that within-family designs that leverage the natural experiment of genetic inheritance are best suited for causal inference. Guidelines for conceptualizing genetic causes were examined with respect to a dimension of causal depth: deep causes, that are unitary, uniform, and explanatory, and shallow causes, that are local, probabilistic, and causally distal. We discussed how the knowledge of genetic causes as advanced in this paper can be applied to advance second-generation aims: genomic data can improve our understanding of the etiology of complex psychological and behavioral outcomes, can facilitate the discovery of intervention and prevention targets for health-related outcomes, and can provide insight into individual differences in treatment responsivity, fadeout, and emergence.
4. Conclusions
Our motive for writing this paper was to grapple with the conceptual issues that have marked the history of behavioral genetics. To guide our discussion, we turned to philosophical and statistical thinking on the parameters for detecting and interpreting counterfactual causes. We compared the infrastructure of genetic inheritance to that of an RCT, and concluded that genetic effects conditional on the parental genotype are causal in the same sense as ATEs. To conclude, we provided some suggestions for how this knowledge can be used to facilitate scientific inquiry and maximize treatment outcomes.
Doubtless, many will take issue with the conclusions that we have drawn and the solutions that we have offered. Such responses are understandable in a field that is so richly complex and so wildly divisive. We welcome all work that earnestly engages in, challenges, questions, or explores the ideas that we have presented in this paper, insofar as it continues to think cautiously and judiciously about the meaning and applications of genomic research. Knowledge of genetic effects on human behavior will only continue to grow over the next several decades. We must chart the course for how we interpret and use this knowledge. As Dov Fox wrote, “[t]here is nothing especially menacing about knowledge on its own…[a]wareness or understanding of some subject can be troubling only when those facts are sought for bad reasons, or when such data are put to bad effects” (Fox, Reference Fox2019, p. 155).
Financial support
This work was supported by grants from the Jacobs Foundation and the John Templeton Foundation. K.P.H. is a Faculty Research Associate of the Population Research Center at the University of Texas at Austin which is supported by grant P2CHD042849 from the NICHD.
Competing interest
None.
Target article
Building causal knowledge in behavior genetics
Related commentaries (23)
A disanalogy with RCTs and its implications for second-generation causal knowledge
Addressing genetic essentialism: Sharpening context in behavior genetics
All that glisters is not gold: Genetics and social science
Behavior genetics and randomized controlled trials: A misleading analogy
Behavior genetics: Causality as a dialectical pursuit
Benefits of hereditarian insights for mate choice and parenting
Building causal knowledge in behavior genetics without racial/ethnic diversity will result in weak causal knowledge
Causal dispositionalism in behaviour genetics
Drowning in shallow causality
Extensions of the causal framework to Mendelian randomisation and gene–environment interaction
Genes, genomes, and developmental process
Genetics can inform causation, but the concepts and language we use matters
Genome-wide association study and the randomized controlled trial: A false equivalence
Human genomic data have different statistical properties than the data of randomised controlled trials
Mechanistic understanding of individual outcomes: Challenges and alternatives to genetic designs
Meeting counterfactual causality criteria is not the problem
On the big list of causes
Polygene risk scores and randomized experiments
Shallow versus deep genetic causes
The providential randomisation of genotypes
Theory matters for identifying a causal role for genetic factors in socioeconomic outcomes
When local causes are more explanatorily useful
Where not to look for targets of social reforms and interventions, according to behavioral genetics
Author response
Causal complexity in human research: On the shared challenges of behavior genetics, medical genetics, and environmentally oriented social science