Dr Nick Martin is one of the most prolific and influential behavioral geneticists in the world, who has also been a key motivator, teacher and role model for his students, including ourselves. Over the years, we have greatly benefitted from Nick’s wonderful teaching, very often demonstrating how theory can be applied in practice to investigate interesting and important scientific questions, and providing a much-needed historical perspective on the latest developments in our fast-moving field. It is therefore our great honor and privilege to review one of Nick’s earliest papers, in celebration of his 70th birthday.
The paper, ‘The power of the classical twin study’ (Martin et al., Reference Martin, Eaves, Kearsey and Davies1978), was based on work from Nick’s PhD thesis (Martin, Reference Martin1976), completed in the Department of Genetics at the University of Birmingham. It was in this department that the field of biometrical genetics (Evans et al., Reference Evans, Gillespie and Martin2002; Mather & Jinks, Reference Mather and Jinks1982) was established by pioneers who included Kenneth Mather, John Jinks, David Fulker and Lindon Eaves (Nick’s PhD supervisor). The principles of biometrical genetics, as compared to other contemporary approaches to the analysis of family data, were laid down in a seminal paper from that department (Jinks & Fulker, Reference Jinks and Fulker1970).
While the aim of biometrical genetics was to partition the sources of individual differences in the population according to various genetic and environmental sources of variation, Jinks and Fulker recognized that the ability to untangle different sources of variation from one another requires certain minimal experimental conditions — the ‘minimum data.’ For example, an analysis of variance for monozygotic (MZ) twins reared apart would yield two summary statistics, the between-group and the within-group mean-squares, which, when equated to the theoretical expected mean squares under the classical quantitative genetic model, would provide estimates for the total genetic and the total environmental variances. However, this analysis would not be able to separate out additive genetic effects from those of genetic dominance, nor could it distinguish the familial environment shared by siblings reared together from environmental influences unique to each sibling. A study that included a wider variety of relationships would provide more summary statistics, which would enable more sources of variation to be jointly estimated from the data.
Martin et al. (Reference Martin, Eaves, Kearsey and Davies1978) recognized that even when an experimental design could provide the ‘minimum data’ for resolving certain sources of variation, the probability of achieving this in practice would still depend on having a sufficient sample size. To quote, ‘If the power of a study to detect a given effect is low and in fact we do not find evidence for the effect in our sample then we should be foolish to infer that the effect is not present in the population’ (p. 99). This remark is equivalent to the ever-valid saying ‘Absence of evidence is not evidence of absence.’ They pointed out that theoretical power calculations in the literature at the time dealt with ‘human experimental designs which are seldom (if ever) used’ but not ‘the classical twin design, the most common design in human biometrical genetics’ (p. 99).
The paper then went on to describe an analytical approach to perform a power calculation for the classical twin design. The method involved calculating the expected values of the observed mean squares under the specified parameter values of a true model, and then equating these to the theoretical expected mean squares under a false model to estimate the parameters of the false model (using iterative weighted least squares). By substituting the expected mean squares under the true model as the observed mean squares of a goodness-of-fit chi-square test statistic for the false model, they obtained the noncentrality parameter of the (typically chi-squared) distribution of the test statistic. This enabled them to calculate the approximate power of the test for any desired significance level. Because the noncentrality parameter is proportional to sample size, the results can be easily extrapolated to calculate the power for any sample size, and to calculate the required sample size for any desired power. The accuracy of the power estimates obtained from the noncentral chi-squared distribution was shown to be acceptable by simulation for a range of parameter values and sample sizes. Using this method, it was shown that 600 twin pairs were required to reject most false models and that an optimal proportion of monozygotic (MZ) and dizygotic (DZ) twin pairs under most true models was between $${1}\over{3}$$ and $${1}\over{2}$$ . The paper ended with a section on the power of detecting nonadditive and directional effects, with three subsections: (1) G × E interaction, by regressing pair variances on pair means, (2) directional dominance, by testing the phenotypic distribution for skewness and (3) directional allele frequency differences, again by testing the phenotypic distribution for skewness. The scoring of many behavioral and psychological tests often results in non-normal distributions of sum or factor scores, which can bias all three of these tests, but they still have potential for other variables such as neuroimaging measures, whose distributions accord better with the central limit theorem.
Two other papers from Nick and colleagues published at around the same time (Eaves et al., Reference Eaves, Last, Young and Martin1978; Martin & Eaves, Reference Martin and Eaves1977) were extremely influential in clarifying the properties of existing analytic approaches to family studies that use raw data, correlations or mean squares as the starting point. They also introduced the use of covariance matrices as an alternative and integrated factor analysis methodology into biometrical genetic analysis. These two papers, together with Martin et al. (Reference Martin, Eaves, Kearsey and Davies1978), laid much of the foundation for the later developments in human behavior genetics, including the establishment of large twin registries and the development of modern maximum likelihood approaches for model estimation and testing that enabled the extension of the classical twin model to threshold traits, multiple phenotypes and extended twin-families (Neale & Cardon, Reference Neale and Cardon1992).
Power calculation has remained an important issue in human genetics research. Subsequent papers to Martin et al. (Reference Martin, Eaves, Kearsey and Davies1978) have considered the power of new study designs including threshold traits (Neale et al., Reference Neale, Eaves and Kendler1994), multivariate phenotypes (Schmitz et al., Reference Schmitz, Cherny and Fulker1998) and extended twin designs (Posthuma & Boomsma, Reference Posthuma and Boomsma2000). As the field moved to include molecular data for gene mapping, analytic power calculations were developed for quantitative trait linkage and association analyses under the variance components model, also using the noncentral chi-squared distribution (Nance & Neale, Reference Nance and Neale1989; Purcell et al., Reference Purcell, Cherny and Sham2003; Sham et al., Reference Sham, Cherny, Purcell and Hewitt2000). In the genome-wide association studies (GWAS) era, the variance components model has been applied to estimate the heritability attributable to common single-nucleotide polymorphisms (SNPs), and the power of this approach has also been characterized (Visscher et al., Reference Visscher, Hemani, Vinkhuyzen, Chen, Lee, Wray, Goddard and Yang2014).
Where the noncentral, chi-squared distribution is a poor approximation of the sampling distribution of the test, simulation-based approaches to power calculation can be used. Of course, all power calculations are effectively simulations, where expected values of statistics such as mean squares or covariances are generated from known values of the parameters of the model in question. Fitting models to summary statistics in this way is very efficient because only two models need to be fitted to the data — the true one and a submodel where one or more of the parameters have been fixed to zero. An extension of this method is to generate raw data and to fit the true and the false models to them. This approach is more flexible because it allows datasets with many patterns of missing observations to be handled with ease. Similarly, models with definition variables can be tested with this approach. A key consideration here is whether to generate data that exactly conform to the covariance matrix and means used to simulate them (e.g., using argument empirical = TRUE in the mvrnorm() routine in the MASS R library; Venables & Ripley, Reference Venables and Ripley2002). Doing so reduces the computational complexity to simulating and model-fitting to only one raw dataset. At other times, permitting stochastic variation in the generation of datasets can be useful, particularly when the statistics used to evaluate model fit do not conform to, for example, the noncentral chi-squared distribution. The multivariate ACE Cholesky is one example. Here, large numbers of trials of simulating and fitting are needed to characterize the distribution of the trials’ test statistics. Having done so, it is then possible to evaluate the probability of observing effect sizes that exceed a particular threshold on the empirical distribution of the test statistics. This procedure aligns closely with using the bootstrap likelihood ratio test (BLRT; Boker et al., Reference Boker, Neale, Maes, Wilde, Spiegel, Brick and Niesen2020).
The seminal paper of Martin et al. (Reference Martin, Eaves, Kearsey and Davies1978) on the power of the classical twin design was revisited by Visscher (Reference Visscher2004), who calculated power via the standard errors of the variance components and the expected values of the maximum likelihood ratio test statistics. His results are largely comparable to those of Martin et al. (Reference Martin, Eaves, Kearsey and Davies1978), with the major difference being that the consideration of likelihood ratio statistics enabled a specific parameter in a model to be tested (e.g., the additive genetic effects within a full model that also contains shared sibship environment and individual-specific environment), rather than the entire model. It should be noted that such calculations are not limited to estimating the power to detect nonzero heritability. Estimates of the power to detect other variance components, particularly that due to the shared or family environment (Visscher et al., Reference Visscher, Gordon and Neale2008), are also of great utility.
An important further consideration in statistical power is whether variance components should be bounded to be greater than zero. If a model’s variance components are estimated directly and without bounds, they may return nonsensical negative values. However, this difficulty in interpretability is counterbalanced by the good statistical properties of the noncentrality parameter, which can be interpreted without transformation. Of note, in recent work by Verhulst et al. (Reference Verhulst, Prom-Wormley, Keller, Medland and Neale2019), is that models that do bounds can deliver very poor estimates of statistical power, unless challenging mathematical transformations are carried out (Wu & Neale, Reference Wu and Neale2013). Some prior discrepancies between the articles emerge as a result of model constraints; for example, estimating the path coefficient from genotype to phenotype versus estimating the variance component VA and allowing this quantity to be negative. This issue becomes much more serious in power calculations for multivariate genetic models. That Nick’s 1978 paper and thesis have led to new studies on the topic over 40 years later is a great tribute to his ability to produce useful science that stands the test of time.
By highlighting statistical power considerations, Nick calls to mind Ronald Fisher (Reference Fisher1938), who in his Presidential Address to the First Indian Statistical Congress said ‘to consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of’ (p. 17). The power calculations of Martin and colleagues are exactly the kind of prospective treatments that can prevent horribly underpowered research studies from being carried out. While meta-analysis can overcome some shortcomings of studies that involve too few subjects, most researchers would prefer to have results from adequately powered studies that can contribute substantively either alone or in aggregate with others. Power calculations can take much of the guesswork out of research planning. In some cases, logistical considerations place additional constraints on the maximum sample size that can practically be collected. Power calculations remain useful here — at the very least to avoid proceeding with a study where all the findings will be equivocal and difficult to validate. The International Methodology Workshops taught in Europe and Colorado continue to teach methodology for statistical power calculations for exactly this reason, which is but one reflection of an enduring contribution by Dr Martin.
As a pioneer of the fields of biometrical and behavioral genetics, Nick’s knowledge, insights and perspectives have benefitted entire generations of researchers in behavioral genetics who have attended the annual ‘twin workshops,’ often multiple times. We were fortunate to progress to faculty members of the workshop and have more directly experienced Nick’s enthusiasm and intellectual curiosity, greatly facilitating the sharing of ideas and lively debates, not only among faculty members but also with the students. These debates and discussions were what have made the workshops so enjoyable and often led to new and fruitful research directions. On the occasion of Nick’s 70th birthday, we express our appreciation and gratitude to him, glance backward to what we have achieved and look forward to working together to extend the frontiers of the field.