Introduction
Multivariate models are of great importance in quantitative genetics. In animal breeding, the value of a candidate for selection as a prospective parent of the next generation often is a function of several traits (Gianola and Sorensen, Reference Gianola and Sorensen2004). A typical characteristic of the multi-trait mixed models is the increase in the number of parameters with the increase in the number of the traits involved in genetic analysis, which may compromise the feasibility of genetic analysis (Silva et al., Reference Silva, Paiva, Botelho, Carrara, Lopes, Silva, Veroneze, Sterman Ferraz, Eler, Mattos and Gaya2021). Structural equation modelling (SEM) is an advanced statistical technique developed by Wright (Reference Wright1921), which allows for the exploration of causal relationships among phenotypes in a multivariate setting. SEM employs diverse models to depict the relationships among observed variables and enables a quantitative evaluation of a theoretical or measurement model for these relationships (Schumacker and Lomax, Reference Schumacker and Lomax1996). This kind of modelling assesses the functional connections between the traits, allowing inferences to be made about causal relationships and the strength of their associations (Rosa et al., Reference Rosa, Valente, de los Campos, Wu, Gianola and Silva2011). Latent variable modelling is a primary application of SEM in a range of scientific disciplines (Schumacker and Lomax, Reference Schumacker and Lomax1996).
As a dimension reduction technique, latent variable modelling can be applied to phenotypic data for reducing dataset dimensionality and minimizing computational complexity. For example, Silva et al. (Reference Silva, Paiva, Botelho, Carrara, Lopes, Silva, Veroneze, Sterman Ferraz, Eler, Mattos and Gaya2021) demonstrated that a small set of latent variables can effectively reduce the dimensionality of phenotypic data, thereby alleviating the complexity that arises from the over-parameterization of the model. Statistically, latent variables are defined as variables that are not directly measurable but can be characterized by several observed phenotypes. A latent variable is constructed or measured using observed phenotypes (Schumacker and Lomax, Reference Schumacker and Lomax1996).
Latent variable modelling provides the opportunity to investigate biologically complex phenomena by reducing the data dimensionality in such a way that many phenotypes are combined to represent a few underlying concepts of interest (Leal-Gutierrez et al., Reference Leal-Gutiérrez, Rezende, Elzo, Johnson, Penagaricano and Mateescu2018). The latent traits are assumed to be unobserved, but they are believed to explain the covariation among the observed variables. Confirmatory factor analysis (CFA) is one of the methods that can be used in SEM to reduce the dimensionality of phenotypic data in genomic studies (Yu et al., Reference Yu, Campbell, Zhang, Walia and Morota2019). CFA is usually used to test the hypothesis that a set of observed variables are related to a smaller number of latent factors, also known as latent variables or latent traits (Momen et al., Reference Momen, Bhatta, Hussain, Yu and Morota2021).
Genome-wide association study (GWAS) is a well-known method for detecting causal genes or genomic regions which are related to economic traits in livestock populations. This approach uses many single nucleotide polymorphisms (SNP) to screen the whole genome for target genes that are associated with a trait of interest. So, GWAS has become a valuable and effective method to identify candidate genes for important economic traits in livestock. In recent years, GWAS have been included in genetic breeding programmes for livestock species such as pigs and have identified many genes or molecular markers that could regulate important economic traits (Le et al., Reference Le, Christensen, Nielsen and Sahana2017; Liu et al., Reference Liu, Song, Jiang, Jiang, Zhang, Liu, Shi, Ding and Wang2021; Wang et al., Reference Wang, Wang, Li, Sun, Chen, Yan, Dong, Pan and Lu2022).
In production animals, body size (BS) is an important trait that reflects the overall appearance of animals (Liu et al., Reference Liu, Song, Jiang, Jiang, Zhang, Liu, Shi, Ding and Wang2021). Body size is a typical quantitative (or complex) trait and understanding the genetic mechanisms controlling body size differences among individuals can effectively help control the growth and production of animals (Niu et al., Reference Niu, Kim, Choi, Kim, Kim and Kim2013). For example in pigs, body size can objectively reflect the response of pigs to their environment (Ohnishi and Satoh, Reference Ohnishi and Satoh2018). In pig breeding, the body shape character index is often used as the most direct production index of a pig (Liu et al., Reference Liu, Song, Jiang, Jiang, Zhang, Liu, Shi, Ding and Wang2021). Various body measurements and morphological traits such as body length, body height, chest width, rump width and heart girth have been considered as measures of body size in pigs (Song et al., Reference Song, Zhang, Zhang and Ding2019; Liu et al., Reference Liu, Song, Jiang, Jiang, Zhang, Liu, Shi, Ding and Wang2021) species. Leal-Gutierrez et al. (Reference Leal-Gutiérrez, Rezende, Elzo, Johnson, Penagaricano and Mateescu2018) identified latent variables underlying carcass and meat quality traits in beef cattle and performed whole-genome scans for these latent variables to identify associated genomic regions and individual genes, the application of defining a latent trait for body size measurement in genomic analysis has yet to be studied in pigs.
This study aimed to model and identify the latent trait of body size (BS) using five measurable body dimensions, conduct a GWAS for BS, and identify candidate genes and genomic regions associated with BS in Yorkshire pigs.
Materials and methods
Data source and phenotypes
The data used in the present study originated from an elite Chinese pig breeding farm in Beijing that is a descendant of American Yorkshire populations (http://figshare.com/articles/single-step-strategies/7434203). In the present study, the phenotypic records of five linear body measurements of 5573 Yorkshire pigs including body length (BL), body height (BH), chest width (CW), chest girth (CG) and tube girth (TG) were used. Descriptive statistics for the measured body dimension traits used in the present study are shown in Supplementary Table 1. More detailed information about the animals and phenotypes was presented by Song et al. (Reference Song, Zhang, Zhang and Ding2019) and Liu et al. (Reference Liu, Song, Jiang, Jiang, Zhang, Liu, Shi, Ding and Wang2021).
Genotype data, quality control and imputation
In the present study, genotype data on 592 out of 5572 Yorkshire pigs (Song et al., Reference Song, Zhang, Zhang and Ding2019) were used. Animals were genotyped using the PorcineSNP80 Bead Chip (Illumina, San Diego, CA, United States), which includes 68 528 SNP across the whole pig genome. PLINK software version 2.0 (Purcell et al., Reference Purcell, Neale, Todd-Brown, Thomas, Ferreira, Bender, Maller, Sklar, De Bakker, Daly and Sham2007) was used for quality control of genotype data and SNP with a maximum missing rate of 0.10, a maximum individual missing rate of 0.10, a minor allele frequency (MAF) ⩽0.01, and Hardy-Weinberg equilibrium (HWE) with a P value < 10−7 were excluded. Missing genotypes were imputed using Beagle software version 5.4 (Browning and Browning, Reference Browning and Browning2007). Among 68 528 SNPs, we removed 6197 SNPs related to those unmapped and those located on mitochondrial DNA and sex chromosomes; 9621 SNP were discarded by applying the thresholds considered for maximum missing rate, maximum individual missing rate, minor allele frequency and Hardy-Weinberg equilibrium. We hypothesized that BS architecture may be influenced by multiple genetic loci across the genome, and this highlights the critical role of quality control measures in ensuring the reliability of GWAS results for this analysis. After quality control, 583 genotyped individuals remained and 52 710 SNPs were finally used for further analysis.
Latent variable modelling by CFA
The theoretical model hypothesized for constructing the BS latent variable from the considered five linear body dimensions is shown in Fig. 1. The CFA technique is used to extract the underlying latent factors that contribute to variations in body size measurements. To perform a CFA for extracting a latent factor related to body size, it is necessary to establish a measurement model that specifies the connections between the observed variables and the latent factor(s). The following CFA model was considered in this study:
Where, X is the matrix of measured variables (BL, BH, CW, CG and TG), ξ is the vector of latent factors. Λ matrix contains the factor loadings that associate these factors to the measured variables, and δ is the residuals vector. The model was fitted by applying the maximum likelihood estimation method. The lavaan R package (Rosseel, Reference Rosseel2012; R Core Team, 2021) was used to fit the above CFA model.
The overall fit of the hypothesized model for constructing BS was evaluated by using fit indices, including standardized root mean square residual (SRMR) (Bentler, Reference Bentler1990), root mean square error of approximation (RMSEA) (Steiger, Reference Steiger1990), Tucker-Lewis Index (TLI) (Bentler, Reference Bentler1990) and comparative fit index (CFI) (Bentler, Reference Bentler1990).
Univariate genome-wide association analysis
A univariate GWAS model using the SNP Snappy strategy (Meyer and Tier, Reference Meyer and Tier2012) was applied by fitting the following model:
where y is the vector of latent variable BS, s is the additive effect for SNP marker j, b is the vector of fixed effects including overall mean and herd-year-season-sex (the combination of the effects of herd, year, season, and sex of pigs) effect, g is the vector of genomic value effects and e is the vector of residual effects. Furthermore, the body weight of animals was considered a covariate. W is the matrix of genotype codes of SNP marker j (AA = 0, AB = 1 and BB = 2). X and Z are the incidence matrices associating b and g to y, g following a normal distribution of N (0, G$\sigma _g^2$), in which $\sigma _g^2$ is the variance of additive genetic effects. Here, G is the marker-based genomic relationship matrix (VanRaden, Reference VanRaden2008) to account for the effects of population structure and reduce the likelihood of false positives and increase the power to detect true associations between genetic variants and traits of interest. e following a normal distribution of N (0, I$\sigma _e^2$), in which $\sigma _e^2 \;$is the residual variance. The WOMBAT program (Meyer, Reference Meyer2007) was used for model fitting and estimating SNP solutions.
The genomic inflation factor was estimated as the median of the χ 2 test statistics of the nominal P values, divided by the expected median of the χ 2 distribution. If the data follow the standard χ 2 distribution, the expected genomic inflation factor value would be 1. If the genomic inflation factor value is greater than 1, it provides evidence for some systematic bias (Iterson et al., Reference Iterson, van Zwet and Heijmans2017). In the present study, the method proposed by Iterson et al. (Reference Iterson, van Zwet and Heijmans2017) was used to correct P values for the genomic inflation factor. In this method, test statistics (the ratio of SNP solution to the corresponding standard error) are divided by the genomic inflation factor. In the present study, the genomic inflation factor was obtained at 1.01.
Significance testing of SNP effects was done by applying a two-sided t test and the P value of each SNP effect was calculated accordingly. The Manhattan plot of –log10 (P values) of SNP effects was obtained by using the ‘qqman’ R package (Turner, Reference Turner2018).
To account for multiple testing that resulted in false positives, the false discovery rate (FDR) (Weller et al., Reference Weller, Song, Heyen, Lewi and Ron1998) was controlled at 5% genome-wide levels to identify significant associations. Significant thresholds for genome-wide (P < 0.05) and suggestive (that is, one false positive per genome scan) levels were 9.48 × 10−7 and 1.90 × 10−5, corresponding to the –log 10 (P value) of 6.02 and 4.72, respectively.
Identification of candidate genes and gene ontology enrichment analysis
In the present study, Ensembl Variant Effect Predictor (VEP) web interface (http://www.ensembl.org/ vep.) is used for the analysis, annotation and prioritization of genomic variants in the coding and non-coding regions (McLaren et al., Reference McLaren, Gil, Hunt, Riat, Ritchie, Thormann, Flicek and Cunningham2016). After identifying significant SNPs by GWAS, the genes located in the 50 Kb downstream and 50 Kb upstream region of the significant SNPs were annotated with reference to the Sus scrofa 11.1 genome assembly (http://www.ensembl.org/Sus_scrofa/Info/Index/). The gene ontology (GO) enrichment analysis is performed via the web-based Gene Ontology Resources interface (http://geneontology.org/) by considering the biological process of the identified genes.
Results
Extracting BS latent trait through factor analysis
Figure 1 provides a comprehensive representation of the hypothesized model considered for BS by underlying relationships within the system under investigation and incorporates BS as a latent variable in the centre of the five measured variables. The goodness of fit measures including CFI, TLI, RMSEA and SRMR were 0.96, 0.93, 0.09 and 0.03, respectively. The factor loadings for the observed variables of BL, CG, TG, BH and CW were 0.62, 0.84, 0.57, 0.54 and 0.63, respectively and statistically significant (P < 0.01). Positive factor loadings indicate that an increase in the observed variable is associated with an increase in the latent variable. In this study, the CG variable had the strongest relationship with the BS latent variable based on the estimated factors loading.
Genome-wide association analysis and selecting significant SNP
Figure 2 represents the Manhattan plot of the GWAS result for BS. By considering a threshold of 6.02 for genome-wide significant level (P < 0.05), a total of 53 significant SNPs was identified. As shown in Supplementary Table 2, 10 out of the 53 significant SNPs were located in chromosome 11.
GO enrichment analysis
The results obtained from GO revealed that 24 genes were in the flanking regions of the top SNP (Table 1). Some information about the characteristics of SNP related to the biological functions of the genes is presented in Supplementary Table 3. Among them, nine genes including Rho GTPase activating protein 12 (ARHGAP12), transmembrane protein 108 (TMEM108), T-cell lymphoma invasion and metastasis inducing factor 1 (TIAM1), ras homologue gene family member B (RHOB), POU class 4 homeobox 1 (POU4F1), follistatin-related protein 4 (FSTL4), cellular communication network factor 2 (CCN2), beaded filament structural protein 2 (BFSP2) and attractin-like protein 1 (ATRNL1) were found to be related to the BS. A summary of the biological processes related to these candidate genes is shown in Table 2.
SCA, Sus scrofa autosome chromosomes.
a False discovery rate (FDR) adjusted P values.
b Genes in ±50-kb flanking regions of significant SNP.
Discussion
The goodness of fit measures appropriately indicates the adequacy of the confirmatory factor model proposed for the latent variable of body size (BS) in the present study. Leal-Gutierrez et al. (Reference Leal-Gutiérrez, Rezende, Elzo, Johnson, Penagaricano and Mateescu2018) applied a similar methodology for defining latent variable of carcass quality by applying observed variables of quality grade, fat over ribeye and marbling in beef cattle.
All loading factors were positive and implying that an increase in the observed variable is associated with an increase in the latent variable of BS. In this study, the CG variable had the strongest relationship with the BS latent variable based on the estimated loading factors. The directions of the loading factors with BS were positive and agree with the theoretical model where the considered observed variables are positively related to BS. The positive direction was expected given that the BS construct reflects mainly body measurement traits (Kominakis et al., Reference Kominakis, Hager-Theodorides, Zoidis, Saridaki, Antonakos and Tsiamis2017). The results suggest that the unobserved variable of BS plays a significant role in explaining the interrelationships among the observed variables. These findings underscore the necessity of incorporating latent variables when modelling intricate systems. Identifying genomic regions affecting BS can provide a suitable opportunity for improving this latent trait and correspondingly the associated observed and measurable traits in pigs. Liu et al. (Reference Liu, Song, Jiang, Jiang, Zhang, Liu, Shi, Ding and Wang2021) pointed out that SNPs controlling body size characteristics are widely distributed on the genome, implying that fitting the infinitesimal model is appropriate. Boyle et al. (Reference Boyle, Li and Pritchard2017) remembered that for complex traits such as body height, action sites are widely distributed across the entire genome, denoting that approximately all genes are involved in the regulation of this trait.
In the present study, nine candidate genes were detected which were potentially associated with BS in pigs. The relevant biological processes included anatomical structure development, homoeostasis, morphogenesis and maturation; regulation of anatomical structure size, regulation of anatomical morphogenesis, response to growth factor, developmental cell growth, developmental growth involved in morphogenesis, regulation of cell growth, regulation of cell size, regulation of anatomical structure size, cell morphogenesis, skeletal system morphogenesis, animal organ morphogenesis, anatomical structure formation involved in morphogenesis and tissue morphogenesis.
Cellular Communication Network-2 (CCN2), formerly termed Connective Tissue Growth Factor (CTGF), is a matricellular protein with four distinct structural modules that can exert a dual function as a matricellular protein and as a growth factor (Rodrigues-Diez et al., Reference Rodrigues-Diez, Tejera-Munoz, Esteban, Steffensen, Rodrigues-Díez, Orejudo, Rayego-Mateos, Falke, Cannata-Ortiz, Ortiz, Egido, Mallat, Briones, Auxiliadora Bajo, Goldschmeding and Ruiz-Ortega2022). CCN2 is a protein belonging to the Cellular Communication Network (CCN) family of secreted extracellular matrix-associated proteins (Leguit et al., Reference Leguit, Raymakers, Hebeda and Goldschmeding2021). Besides, CCN2 and its fragments have been involved in the regulation of several biological processes, including cell proliferation, differentiation, adhesion, migration, cell survival, apoptosis and the making of extracellular matrix products, as well as in more complex processes such as embryonic development, angiogenesis, chondrogenesis, osteogenesis, fibrosis, mechanotransduction and inflammation (Leguit et al., Reference Leguit, Raymakers, Hebeda and Goldschmeding2021). Rebolledo et al. (Reference Rebolledo, Acuna and Brandan2021) pointed out that in the skeletal muscle of mammals, most of the CCN family members are expressed during embryonic development or in adulthood. Their roles during the adult stage are related to the regulation of muscle mass and regeneration, maintaining vascularization and the modulation of skeletal muscle fibrosis. POU4F1, also known as Brn3a, plays a crucial role in normal cardiac development during embryonic stages (Maskell et al., Reference Maskell, Qamar, Babakr and Hawkin2017). In addition, studies in mice have shown that POU4F1 is associated with the development of spiral ganglion neurons (Maskell et al., Reference Maskell, Qamar, Babakr and Hawkin2017). RHOB, Ras Homologue family member B, is a GTP-binding protein involved in vesicular trafficking in anterior pituitary cells (Cussac et al., Reference Cussac, Leblanc, LHeritier, Bertoglio, Lang, Kordon, Enjalbert and Saltarelli1996) and influences endothelial cell morphogenesis through regulating growth factor receptor trafficking and signalling (Howe and Addison, Reference Howe and Addison2012). RHOB has proven to have specific and pleiotropic functions in organisms from mammalian development to DNA damage survival responses (Vega and Ridley, Reference Vega and Ridley2018).
TMEM108 or Transmembrane protein 108, also named Retrolinkin plays an important role in the central nervous system (Xu et al., Reference Xu, Ji, Zhang, Zhang, Nie, Xu, Zhang and Zhang2016). A GWAS found that TMEM108 is a susceptibility gene for psychiatric disorders, including schizophrenia and bipolar disorder (Neale et al., Reference Neale, Medland, Ripke, Asherson, Franke, Lesch, Faraone, Nguyen, Schafer, Holmans, Daly, Steinhausen, Freitag, Reif, Renner, Romanos, Romanos, Walitza, Warnke, Meyer, Palmason, Buitelaar, Vasquez, Lambregts-Rommelse, Gill, Anney, Langely, O'Donovan, Williams, Owen, Thapar, Kent, Sergeant, Roeyers, Mick, Biederman, Doyle, Smalley, Loo, Hakonarson, Elia, Todorov, Miranda, Mulas, Ebstein, Rothenberger, Banaschewski, Oades, Sonuga-Barke, McGough, Nisenbaum, Middleton, Hu and Nelson2010). TIAM1 regulates signalling pathways that affect the control of neuronal morphogenesis and neurite outgrowth by modulating the actin cytoskeletal network (Lu et al., Reference Lu, Hernan, Marcogliese, Huang, Gertler, Akcaboy, Liu, Chung, Pan, Sun, Oguz, Oztoprak, de Baaij, Ivanisevic, McGinnis, Guillen Sacoto, Chung and Bellen2022). FSTL4 regulates cell size and is considered as a candidate gene that might affect rump angle in Chinese Holstein cows (Lu et al., Reference Lu, Mohamed Abdalla, Nazar, Fan, Zhang, Wu, Xu and Yang2021). Intermediate filaments, which integrate individual cells into their respective tissues, are key components of the cytoskeleton in all vertebrate cells. BFSP2 is one of them (Song et al., Reference Song, Landsbury, Dahm, Liu, Zhang and Quinlan2009). ATRNL1 or Attractin-Like protein 1 is a 1379 amino acid single-pass type I membrane protein that may play a role in melanocortin signalling pathways that regulate energy homoeostasis. The ATRNL1 gene is conserved in canine, bovine, mouse, rat, chicken and zebrafish (Stark et al., Reference Stark, Bruno, Mountford, Lockhart and Amor2010). ARHGAP12 is specifically expressed in the CA1 region of the hippocampus, where it localizes to the post-synaptic compartment of excitatory synapses. ARHGAP12 negatively controls spine size. ARGHAP12 knockdown results in precocious maturation of excitatory synapses, as indicated by a reduction in the proportion of silent synapses. ARHGAP12 regulates excitatory synaptic structure and function during development (Ba et al., Reference Ba, Selten, van der Raadt, van Veen, Li, Benevento, Oudak er, Lasabuda, Letteboer, Roepman, van Wezel, Courtney, van Bokhoven and Nadif Kasri2016).
The results imply that these genes could be promising candidates for further exploration of the underlying mechanisms of body size variation. Furthermore, the findings have significant practical implications for enhancing the efficiency and profitability of pig farming through genetic selection and may advance our understanding of body size regulation in other organisms.
Conclusion
In the present study, we implemented appropriate latent variable modelling by applying the CFA technique for constructing the latent variable of BS from five body measurement traits in Yorkshire pigs. Genomic best linear unbiased prediction (GBLUP)-GWAS was used for identifying the SNPs and candidate genes associated with latent trait BS. In total, 53 significant SNPs were identified; widely distributed on all chromosomes except for chromosomes 9 and 18. Nine genes were related to BS in Yorkshire pigs which involved several biological processes like regulation of anatomical structure, morphogenesis and development, regulation of cell size, growth and morphogenesis.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0021859623000424
Acknowledgments
We thank Mahmoud Amiri Roudbar for helping with genotype imputation.
Authors’ contributions
Ali Esmailizadeh, Mehdi Momen and Morteza Mokhtari conceptualized the study. Elahe Sanjari Banestani and Morteza Mokhtari performed statistical analyses and wrote the preliminary version of the manuscript. Ali Esmailizadeh, Ahmad Ayatollahi Mehrgardi and Mehdi Momen edited the initial version of the manuscript.
Funding statement
This research received no specific grant from any funding agency, commercial or not-for-profit sectors.
Competing interest
None.
Ethical standards
Not applicable.