Introduction
Human leukocyte antigens
When performing a solid organ transplant, it is important to not only procure a high-quality organ (Massie et al., Reference Massie, Luo, Chow, Alejo, Desai and Segev2014), but also ensure the compatibility of the donor and recipient (Doxiadis, Reference Doxiadis2012). Recent methods for organ allocation match on the eplets that comprise the human leukocyte antigens (HLA) and these require a high-resolution typing of both individuals (Kosmoliaptsis et al., Reference Kosmoliaptsis, Bradley, Sharples, Chaudhry, Key, Goodman and Taylor2008). This level of typing is typically only available for transplants that absolutely require a perfect match, but is too slow and expensive for transplants from deceased donors. To bridge the gap between the low- and mid-resolution typing that are available and the high-resolution that is required, statistical methods are used to estimate the high from the low using programs like HLA Matchmaker (Duquesnoy and Askar, Reference Duquesnoy and Askar2007).
These programs use a set of reference panels to estimate the high-resolution HLA from a patient’s low-resolution HLA and self-reported ethnicity or race. Ethnic and racial groupings are inadequate proxies for genetic information on their own (Bamshad et al., Reference Bamshad, Wooding, Salisbury and Stephens2004), but do provide some insight in this setting. Estimates are typically derived from a single large reference panel for that ethnicity. Through an application of Bayesian estimation (explained in the Bayesian Estimation subsection), it is also reasonable to combine a small sample from a group very similar to the patient with a less precise, but larger, sample. The seven currently recognized HLA loci (genetic locations with a recognizable function) are given the alphanumeric labels A, B, C, DRB1, DRB3/4/5, DQB1, and DPB1. However, there are situations in which only three or four of these loci are available for clinical or budgetary reasons. In the three-locus setting, the loci used are typically A, B, and DRB1 and the fourth that is typically added is DQB1 (Bekbolsynov, Reference Bekbolsynov2018).
The primary feature of the human genome that makes this estimation possible is linkage disequilibrium (LD), where certain alleles (genetic information at a given locus) co-occur with others more often than independent randomness would suggest. As such, it would seem reasonable to suspect that this LD could be shared across ethnicities despite different distributions of fundamental genetic materials. To this end, Bayesian estimation can be applied using the allele frequencies (AF) of the small sample and the copula (relationship among alleles) of the large sample instead of the full reference distribution.
Given a sufficiently large reference panel, one can directly estimate the distribution of high-resolution haplotypes (alleles at a set of loci) given a low-resolution haplotype by conditioning on the low-resolution: $ P\left(\mathrm{high}|\mathrm{low}\right)=P\left(\mathrm{high}\right)/P\left(\mathrm{low}\right) $. However, there is always a trade-off between statistical bias and variance. Having a large reference panel reduces the uncertainty about that panel, and thus the variance of its estimates. Unfortunately, no patient is the average member of that panel, so larger panels provide biased estimates for that individual. To re-balance the trade-off, we can prioritize a sub-population that is more similar to the patient. This sub-population can lower the bias of the estimate, but would increase the variance. Consequently, we would need to find the appropriate balance between the two.
Bayesian estimation
To find a balance between bias and variance we can use Bayesian estimation. Haplotype frequencies follow a multinomial distribution where the probability of each haplotype occurring is its own parameter of the model. The likelihood of these parameters is simply the proportion of the small panel that has that haplotype. To accommodate the larger panel, we treat it as a prior belief regarding the nature of the smaller. The combination of the two acts as a posterior belief where the prior belief has been updated by the evidence (likelihood).
The probability of the true frequency given the observed data is $ P\left(\theta |X\right) $ whereas $ \mathrm{\mathcal{L}}\left(X|\theta \right) $ is the likelihood of observing the data given a certain true frequency. That leaves $ P\left(\theta \right) $, which is the probability of a certain frequency being the true one.
The distribution of the prior belief most appropriate for this application would be the Dirichlet distribution. Similar to the multinomial, each haplotype is its own parameter. However, instead of each being a proportion bounded by $ \left[0,1\right] $, they are unbounded above: $ \left[0,\infty \right) $. This unboundedness provides an easily interpretable result in an effective sample size (the sum of all parameters) for the prior belief that can be tuned to balance the effects of the large panel versus the small. Conveniently, this also has the desirable property that the posterior is then also a Dirichlet distribution, so the maximum a posteriori estimate is easily computed (Carlin and Louis, Reference Carlin and Louis2008). However, one limitation of this prior is that the unboundedness can cause the simulation to fail to converge in cases where no improvement is possible. In practice, one must choose an effective sample size upper limit that is considered effectively infinite such that a result is always returned.
To include the shared copula hypothesis, we then treat the prior in the same manner. The AF of the small reference panel becomes the multinomial likelihood for the prior and the AF of the large panel becomes a Dirichlet hyperprior (a prior of a prior). This posterior for the prior is then composed with the copula to regain the prior distribution.
Methods
The data for this study comes from the National Marrow Donor Program and contain haplotype frequencies for 21 sub-populations and five standard groupings thereof (Maiers and Gragert, Reference Maiers and Gragert2020). To simulate having small reference panels with known distribution, haplotypes are sampled with replacement from the five standard groupings: Black or African American; Asian or Pacific Islander; Hispanic or Latino; Native American; and White (BeTheMatch.org, 2022). These groups have sample sizes 1,742,191; 1,965,495; 2,714,930; 228,006; and 11,226,174 respectively. They have 130,972; 149,171; 166,397; 38,244; and 312,928 different 7-locus haplotypes with 642,477 unique haplotypes among them. The number of possible 7-locus haplotypes given known alleles is of order $ {10}^{18} $, so LD has reduced this amount dramatically. The large reference panel is then composed of a combination of the remaining four groupings.
This simulation is done at 30 exponentially spaced sample sizes of order $ {10}^2 $ to $ {10}^7 $ for adequate accuracy at multiple potential scales across the three haplotype sizes: 7-locus, 4-locus, and 3-locus. Given the sampling described above, the squared error of the frequencies estimated by the Bayesian model is computed by taking advantage of the conjugate nature of the Dirichlet prior and posterior. The posterior estimate involves adding the prior estimated observations of each haplotype (the large reference panel scaled by the effective sample size) to the actual observations and then normalizing this set of parameters to sum to one such that it is comparable to the true frequencies. The bisection method is then used to optimize the effective sample size of the prior in the first experiment to minimize the error. The second experiment optimizes the effective sample size of both the prior and hyperprior.
The objective function of the optimizations is the Euclidean distance between the maximum a posteriori estimate of the haplotype distribution and the observed full data for each group summed across all sample sizes. This will find the effective sample size that provides the greatest benefit across all settings as well as the largest small reference panel size that still has Bayesian performance outside the Frequentist 95% confidence interval. Each step of the optimization uses 100 replications of the sampling and when the optimal effective sample size(s) are found another 200 replications are performed for additional precision of the presented results.
The second experiment uses two types of copula: multiplicative and additive. A multiplicative copula assumes that the distribution of haplotypes is proportional to a constant multiple of the haplotype distribution without LD as in (3). The constant is shared, but each group has its own haplotype distribution without LD. An additive copula is similar except its constant is added instead of multiplied as in (4).
The program is written in Python 3.9.7 and the source code is available to the public (Schellhas, Reference Schellhas2022).
Results
The first experiment shows a wide range of optimal effective sample sizes and reference panel sizes that would benefit from Bayesian estimation that have similar relationships across the different haplotype sizes. The details are available in Table 1. Complete details for all sample sizes are available in Figures 1–3. With such a wide variance in such a small sample, no conclusion can be drawn regarding the best effective sample size to use with a new sub-population. It would be reasonable to say that it would be useful for panels of size less than 80,000 with an effective sample size of less than 1,000 with a complete 7-locus haplotype. These numbers drop to about 10,000 and 300 respectively for 4- and 3-locus haplotypes. For less conservative practitioners, the effective sample size could be rather safely increased to 2,000 (or 600) without increasing the bias much. This experiment also found that the Native American panel (N = 228,006) would benefit from Bayesian estimation with the remaining four sub-populations as a prior since such a prior was found to be useful up to a panel size of about 400,000 for 7-locus haplotypes.
The second experiment failed to converge to an optimal effective sample size for the hyperprior across all haplotype sizes. Since the parameters of the Dirichlet distribution are unbounded, the search algorithm could continue increasing them indefinitely. This implies that the large panel AF should have an infinite effective sample size which means that using the small panel AF with the copula estimated from the large panel is not useful. That makes the result of experiment two identical to that of experiment one. Thus, there is no evidence for the LD to be the same across sub-populations relative to their AF.
Discussion
The simulations demonstrated the benefit of tempering the variance of haplotype frequency estimates using Bayesian estimation. Including the larger reference panel despite it being from other ethnicities improves performance. It seemed reasonable to expect the LD present in HLA to have similar causes across ethnicities and thus have shared influence when attempting HLA imputations. The lack of any evidence of this using these two copula implies that either (a) the sources of LD are different between groups or (b) none of the shared nature is captured by either of these copula. Both options are reasonable directions for future research.
The lack of convergence of the simulations may cast some doubt on the results, but the algorithm was allowed to run until the effective sample size exploded to a point where the shared copula results were statistically indistinguishable from the copula-free first experiment.
Conclusion
LD in HLA does not appear to have a simple relationship with the available genetic materials in a given population. This implies that the biological pressures have historically differed across sub-populations or there is a feature of HLA that is not made immediately apparent by how they are currently encoded.
Funding Information
This research received no grants from any funding agency, commercial, or not-for-profit sectors.
Authorship Contribution
Investigation: D.S.; Methodology: D.S.; Writing—original draft: D.S.; Data curation: R.C.G.; Resources: R.C.G.; Supervision: R.C.G.; Writing—review and editing: R.C.G.
Data Availability Statement
The data that support the findings of this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.4474442. The source code is available on GitLab at https://gitlab.com/dschell/hla-shared-copula/.
Conflict of Interest
The authors have no conflicts of interest to declare.
Comments
Comments to the Author: This computational study is overall very well presented, and the conclusions appear to be justified by the data. However, I recommend that the manuscript be sent to a statistical editor for further validation since I am not a statistics expert.
Minor issues to be fixed:
1. The authors should increase the size of all figure panels. The font size in the figures is too small and the labels are difficult to read.
2. The authors provide little if any discussion of their results and do not highlight the limitations. Thus, a brief Discussion should be added, either as a separate section or as part of the Results.
3. The first sentence in the legend to Table 1 (“...and largest panel size that benefits from additional references vary...”) appears to be grammatically incorrect.