Introduction
The linear amino acid chains of most proteins fold into a specific three-dimensional (3D) structure to become stable and functional. Protein folding is determined by the biophysical and biochemical constraints on the amino acid sequences (Anfinsen, Reference Anfinsen1973). Misfolded proteins usually malfunction and can often be lethal to the cell if not degraded. The Protein Data Bank (PDB), established in 1971 with a handful of protein structures determined by X-ray crystallography, is one of the first open-source data repositories (Bank, Reference Bank1971). The PDB now hosts more than 180,000 structures of proteins, nucleic acids, and assemblies of supramolecular complexes. The PDB has transformed many life science disciplines by enriching our understanding of the physical and chemical basis of the fundamental biological processes. In celebrating 50 years of the PDB, recent article collections recount how PDB changed biology (Berman and Gierasch, Reference Berman and Gierasch2021; Gierasch and Berman, Reference Gierasch and Berman2021; Zardecki et al., Reference Zardecki, Shao, Voigt and Burley2021). On the heels of these celebrations, the latest computational tools for ab initio structure prediction joined the celebratory bandwagon. Considered to be a once-in-a-generation advance, the latest computational tools such as Alphafold (Tunyasuvunakool et al., Reference Tunyasuvunakool, Adler, Wu, Green, Zielinski, Žídek, Bridgland, Cowie, Meyer, Laydon, Velankar, Kleywegt, Bateman, Evans, Pritzel, Figurnov, Ronneberger, Bates, Kohl, Potapenko, Ballard, Romera-Paredes, Nikolov, Jain, Clancy, Reiman, Petersen, Senior, Kavukcuoglu, Birney, Kohli, Jumper and Hassabis2021) and RoseTTaFold (Baek et al., Reference Baek, DiMaio, Anishchenko, Dauparas, Ovchinnikov, Lee, Wang, Cong, Kinch, Schaeffer, Millán, Park, Adams, Glassman, DeGiovanni, Pereira, Rodrigues, van Dijk, Ebrecht, Opperman, Sagmeister, Buhlheller, Pavkov-Keller, Rathinaswamy, Dalwadi, Yip, Burke, Garcia, Grishin, Adams, Read and Baker2021) are a major leap in ab initio protein structure prediction.
Ab initio structure prediction from protein sequences that have no representative structures is a notoriously hard problem. The new algorithms extract information in protein sequences that are ‘trained’ over eons by evolution for spontaneous folding into specific and complex 3D shapes. The significance of these new algorithms compared to their predecessors is (a) the speed of determining the best folded conformation of a given linear amino acid sequence among the numerous possible 3D conformations and (b) the unmatched accuracy of the predicted structure, which is comparable to structures determined by X-ray crystallography and other experimental methods (Baek et al., Reference Baek, DiMaio, Anishchenko, Dauparas, Ovchinnikov, Lee, Wang, Cong, Kinch, Schaeffer, Millán, Park, Adams, Glassman, DeGiovanni, Pereira, Rodrigues, van Dijk, Ebrecht, Opperman, Sagmeister, Buhlheller, Pavkov-Keller, Rathinaswamy, Dalwadi, Yip, Burke, Garcia, Grishin, Adams, Read and Baker2021; Tunyasuvunakool et al., Reference Tunyasuvunakool, Adler, Wu, Green, Zielinski, Žídek, Bridgland, Cowie, Meyer, Laydon, Velankar, Kleywegt, Bateman, Evans, Pritzel, Figurnov, Ronneberger, Bates, Kohl, Potapenko, Ballard, Romera-Paredes, Nikolov, Jain, Clancy, Reiman, Petersen, Senior, Kavukcuoglu, Birney, Kohli, Jumper and Hassabis2021).
Tools such as Alphafold and RoseTTaFold are a shot in the arm in the efforts to map the ‘protein universe’. The protein universe is the assortment of all proteins from all organisms that have evolved on Earth over ≈3.8 billion years (Levitt, Reference Levitt2009). Whether or not such computational de novo structure predictions can replace the many experimental methods is an intriguing prospect for the future. At any rate, the success of the algorithmic predictions and the boost in their predictive power rely on the thousands of experimentally vetted structures available in the PDB. The training data from which the rules and physicochemical constraints of protein folding are learnt underscore the invaluable insights gained from these high-resolution structure data.
The availability of the large number of protein structures has significantly improved the efforts in resolving another remarkably hard problem—reconstructing evolution itself—specifically, looking back into the earliest stages of cellular evolution through ‘evolutionary telescopes’. The tremendous advantages of a protein structure-based evolutionary telescope (i.e., a phylogeny) over its predecessor—the more commonly used sequence-based telescope, is underappreciated. Current ‘sequence vs structure’ debates (Kurland and Harish, Reference Kurland and Harish2015a; Harish, Reference Harish2018; Harish and Morrison, Reference Harish and Morrison2020; Williams et al., Reference Williams, Cox, Foster, Szöllősi and Embley2020) are reminiscent of the ‘morphology vs molecules’ disputes (Simpson, Reference Simpson1964), that arose about 50 years ago, about which data type is better to investigate evolutionary problems. In celebrating the transformative influence of the PDB and structure-based insights on resolving a myriad of biological problems, this essay puts the spotlight on the impact of structural biology in bringing longstanding debates in evolutionary biology to empirical resolution. Given the decades-long development, the historical context of the current thinking and how it can be re-evaluated in light of abundant new structural data is discussed. In so doing, the analyses and new evidence presented here decisively show that structure-based data are much superior to the widely used sequence data to reconstruct the earliest stages of evolution of life.
Evolution of the protein universe recapitulates the evolution of cellular universe
The vast majority of cellular life is microbial (Locey and Lennon, Reference Locey and Lennon2016), especially single-celled species populate the bulk of the universe of cellular organisms. Proteins are components of the molecular machinery involved in all cellular functions (Figure 1a), from the birth through death of cells. Proteins are not only the workhorses of cells that drive the molecular machinery, but they also make up the infrastructure that maintains the morphology and internal organization of cells (Figure 1b). Enzymatic proteins that catalyze the biochemical reactions are an example of the former and cytoskeletal proteins of the latter. Cells can be seen as membranous ensembles studded with proteins inside and out (Figure 1). Based on the extent of membranous organization observed in ultrastructures of cells, two basic types of cellular organisms are known (Figure 1c).
-
• Eukaryotes (Greek; eu, ‘well’ and karyon, ‘kernel’): Organisms with a well-defined membrane-bound nucleus and other membrane-bound intracellular compartments.
-
• Akaryotes (Greek prefix ‘a-’ meaning ‘without’): Organisms without a nucleus or other membrane-bound compartments.
The terms eukaryote and akaryote are comparative descriptions of cell ultrastructure, though the term ‘prokaryote’ is commonly used to represent organisms with akaryotic cell organization. However, the term ‘prokaryote’ is misleading (Pace, Reference Pace2006) as it is based on a misconception that prokaryotes are ancestors of eukaryotes, which runs counter to Darwin’s proposal that all species share a common ancestor (Darwin, Reference Darwin1859).
The concept of a protein universe was put forward to organize proteins in a natural hierarchical system using tools of protein taxonomy (Ladunga, Reference Ladunga1992). The number of distinct and stable 3D structures possible is limited by the physical and chemical constraints on protein folding, and thus, the number of unique 3D conformations (or folds) possible was estimated to be finite. Furthermore, based on the relationship between sequence and structure divergence in proteins (Chothia and Lesk, Reference Chothia and Lesk1986), it was predicted that a vast majority of proteins belong to no more than a thousand structural families (Chothia, Reference Chothia1992). At the time of this prediction almost 30 years ago, 866 structures were available in the PDB. In spite of the exponential growth of the number of structures available in PDB, the prediction has turned out to be largely true. Structure-based protein taxonomy developed by SCOP (Murzin et al., Reference Murzin, Brenner, Hubbard and Chothia1995) and CATH (Orengo et al., Reference Orengo, Michie, Jones, Jones, Swindells and Thornton1997) classification systems identify ≈1,500 and ≈1,400 folds, respectively. Despite being finite, and in spite of the remarkable advances in experimental 3D structure determination technologies, the protein universe is yet to be fully mapped (Levitt, Reference Levitt2009; Waman et al., Reference Waman, Blundell, Buchan, Gough, Jones, Kelley, Murzin, Pandurangan, Sillitoe, Sternberg and Kihara2020). Due to the relative ease of DNA sequencing, mapping genomes has far outpaced protein structure determination. Tools such as Alphafold and RoseTTaFold could significantly speed up the efforts to map the protein universe.
At any rate, up to 70% of proteins of many species can be mapped to known structures (Kurland and Harish, Reference Kurland and Harish2015a; Waman et al., Reference Waman, Blundell, Buchan, Gough, Jones, Kelley, Murzin, Pandurangan, Sillitoe, Sternberg and Kihara2020). This is already providing a substantial view of the distribution of proteomes in the cellular universe (Buchan et al., Reference Buchan, Shepherd, Lee, Pearl, Rison, Thornton and Orengo2002; Chothia et al., Reference Chothia, Gough, Vogel and Teichmann2003). In addition, the availability of such large numbers of protein structures has proved to be a new source of data as well as novel type of phylogenetic marker for (a) developing a new class of empirical models, namely nonstationary and nonreversible evolution models for statistical phylogenetics (Harish and Kurland, Reference Harish and Kurland2017a, Reference Harish and Kurland2017b), (b) empirical testing of competing hypotheses for the evolution of cellular life, including eukaryogenesis (Harish and Kurland, Reference Harish and Kurland2017a, Reference Harish and Kurland2017c), and (c) genome/proteome scale comparative analyses for reconstructing the major patterns of diversification of cellular life (Yang et al., Reference Yang, Doolittle and Bourne2005; Fang et al., Reference Fang, Oates, Pethica, Greenwood, Sardar, Rackham, Donoghue, Stamatakis, De Lima Morais and Gough2013; Harish et al., Reference Harish, Tunlid and Kurland2013). Importantly, the former two were previously not possible with commonly used nucleic acid and protein sequence data (Kurland and Harish, Reference Kurland and Harish2015b; Harish, Reference Harish2018). In the following sections, I will discuss the pros and cons of both sequence and structure data and show why structure-based features are superior for a reliable reconstruction of the evolution of cellular universe or life as we know it.
The search for a perfect evolutionary character
Biologists utilize a variety of features or characters to describe and study organisms in a comparative framework. A character is any recognizable and heritable trait, feature, or property of an organism (Figure 2a) that can be employed for comparative analysis of character variances as a measure of evolutionary divergence of species (Figure 2b). Thus, ‘characters’ are fundamental data for evolutionary analyses and the character concept is central to evolutionary biology (Wagner, Reference Wagner2000). Initially, comparisons of morphological characters were used in taxonomic classification (Linnaeus, Reference Linnaeus1758) to study evolutionary processes (Darwin, Reference Darwin1859) and to determine phylogenetic relationships (Hennig, Reference Hennig1965). In principle, a single distinctive character is sufficient to distinguish species and groups of species from one another. For instance, the vertebral column is the defining feature of vertebrates—animals with a bony or cartilaginous backbone, which includes more than 60,000 species (Galbusera and Bassani, Reference Galbusera and Bassani2019). Likewise, the ascus, a microscopic structure that produces ascospores, is the defining feature of sac fungi or ascomycetes, with about 65,000 species (Schoch et al., Reference Schoch, Sung, López-Giráldez, Townsend, Miadlikowska, Hofstetter, Robbertse, Matheny, Kauff and Wang2009). However, in practice, such defining features are not readily available for all species; hence, multiple characters are used for grouping related organisms into a clade or a monophyletic group, which is composed of a common ancestor and all its lineal descendants in a phylogenetic tree.
Distinctive characters that define clades are called synapomorphies or shared-derived characters that indicate a monophyletic origin of the character, i.e., synapomorphic characters represent a historically unique origin of an evolutionary novelty in the common ancestor of a clade (Hennig, Reference Hennig1965). Hennig reasoned that only synapomorphies should be used to diagnose common descent by tracing character evolution along a phylogeny. Complex developmental pathways and multiple genes are involved in the development of a morphological character. Complex characters like morphological features most likely arose only once and hence deemed to be homologous. In contrast, characters that evolve independently in multiple lineages are deemed to be homoplasious. The evidence to show that protein domains are superior characters while the sequence-based characters—nucleotides and amino acids—are poor quality data to resolve the deeper divergence of the tree of life is manifold. Here, I discuss the qualitative and quantitative evidence from recent studies that encourage the use of protein domains (Harish et al., Reference Harish, Tunlid and Kurland2013; Harish and Kurland, Reference Harish and Kurland2017a; Harish, Reference Harish2018). In addition, I present new quantitative evidence to show that sequence characters by themselves are unsuitable for confidently resolving the deeper branches of the universal phylogeny.
In mapping the protein universe, a protein domain is the basic unit of structure, function, and evolution (Figure 2a). Domains are independently folding sectors of a polypeptide chain, with a unique 3D shape that is associated with a distinct amino acid sequence profile and a characteristic function (Murzin et al., Reference Murzin, Brenner, Hubbard and Chothia1995; Orengo et al., Reference Orengo, Michie, Jones, Jones, Swindells and Thornton1997). For these reasons, domains are excellent ‘characters’ to study many aspects of biological evolution. Therefore, tracing the history of the variation of domain composition in species is valuable for determining the evolutionary relationships and patterns of diversification among species groups. The species-specific composition of unique domains was termed as ‘intrinsic proteomic complexity’ (Harish and Kurland, Reference Harish and Kurland2017a).
The idiosyncratic assortment of domains in organisms corresponds to species groups (Figure 2c) and other levels of the taxonomic hierarchy of organismal classification (Harish et al., Reference Harish, Tunlid and Kurland2013; Harish and Kurland, Reference Harish and Kurland2017b). Advantages of protein domains for assessing both qualitative and quantitative empirical evidence are summarized below. For details and incisive analyses, see the studies by Harish and Kurland, Reference Harish and Kurland2017a, Reference Harish and Kurland2017b; Harish, Reference Harish2018; Harish and Morrison, Reference Harish and Morrison2020.
-
• Protein domains, unlike their component amino acids, provide for a large number of ‘unique’ characters. Latest updates of SCOP and CATH classification of PDB entries identify ≈2,750 and ≈5,500 homologous superfamilies. This translates to anywhere between 2,750 and 5,500 unique structure characters as opposed to only 20 distinct sequence characters (amino acids).
-
• Since each homologous domain has a distinct 3D structure, a unique sequence profile, and a characteristic function, substitution between domains is not known. In contrast, repeated substitution of amino acids at the same site is frequent, resulting in a rapid decay of historical signal.
-
• Independent evolution of complex structural domains in diversified species and ab initio evolution of new proteins by random mutations are both extremely rare. However, it is relatively easier to lose domains via multiple mechanisms. For example, a mutation causing a premature stop codon or loss of a genomic locus during genetic recombination. This naturally skewed propensity for loss (death) over gain (birth) of a new domain can be exploited to implement time nonreversible or directional evolution models, which are better suited to reconstruct the universal tree.
-
• Finally, the relatively lower variation of (a) compositional heterogeneity and (b) rate heterogeneity of birth/death of unique protein domains compared to point mutations in sequences supports statistically robust phylogenetic inferences.
Thus, structure-defined characters provide for a robust and reliable resolution of the deeper nodes of the universal tree. A caveat is that some of the recent divergences in certain clades are not as well supported as the deeper ones when only structure-based characters are used (Harish and Kurland, Reference Harish and Kurland2017b; Harish, Reference Harish2018). In contrast to the underappreciated advantages of structure-based data, the deficiencies of sequence data have been hashed out in multiple studies over the past three decades. Here, I will describe key qualitative and quantitative evidence. Although the ribosome as a whole and the small subunit ribosomal RNA (SSU rRNA) in particular were thought to be the ‘universal chronometer’ of evolutionary analyses initially (Woese, Reference Woese1987), it is now abundantly clear that focusing on the ribosome alone is a reductionist approach (Harish, Reference Harish2018). The deficiencies of sequence characters, in general, and the resulting error prone analyses, specifically of the rRNA and r-proteins are rather pronounced, as shown in many studies during the past two decades (Tourasse and Gouy, Reference Tourasse and Gouy1999; Rokas and Carroll, Reference Rokas and Carroll2008; Philippe et al., Reference Philippe, Brinkmann, Lavrov, Littlewood, Manuel, Wörheide and Baurain2011; Gouy et al., Reference Gouy, Baurain and Philippe2015). For instance, inclusion (or exclusion) of different sets of ribosomal genes/proteins produces different relationships between Archaea, Bacteria, and Eukarya (Da Cunha et al., Reference Da Cunha, Gaia, Gadelle, Nasir and Forterre2017). In addition, application of different models of sequences evolution to the same dataset produces different results (Tourasse and Gouy, Reference Tourasse and Gouy1999; Harish, Reference Harish2018).
These incongruences are due to many well-known deficiencies of sequence data, such as:
-
• higher incidence of homoplasy (or lack of homology) in sequence characters,
-
• large variation in rates of evolution among different genes/proteins and/or within different sections of the same gene/protein (e.g., in different domains of multi-domain proteins),
-
• sequence data are often limited to the application of time-reversible models of evolution.
To mention a few. Chief among these deficiencies is the inapplicability of time nonreversible models of character evolution for rRNA and r-protein sequences, and for sequence-based analyses in general. This serious deficiency of sequence data is demonstrated here with quantitative evidence obtained from model selection tests, as shown in Table 1.
Note: Best-fitting models for the sequence data were determined, in the present study, using ModelFinder (Kalyaanamoorthy et al., Reference Kalyaanamoorthy, Minh, Wong, Von Haeseler and Jermiin2017) implemented in IQ-TREE (v 2.1.3) (Nguyen et al., Reference Nguyen, Schmidt, Von Haeseler and Minh2015). The top ranked time reversible and nonreversible models are shown here; the complete list of models tested is in the Supplementary Material. Sequence alignments used to estimate a global ToL in an earlier study (Williams et al., Reference Williams, Foster, Nye, Cox and Embley2012) were employed. BF scores were estimated from BIC scores in bayestestR (v 0.13.1.7) (Makowski et al., Reference Makowski, Ben-Shachar and Lüdecke2019). Model selection tests for the structure-based data are from a previous study (Harish, Reference Harish2018) using tests implemented in MrBayes (Klopfstein et al., Reference Klopfstein, Vilhelmsen and Ronquist2015). Time nonreversible models of evolution are highlighted in bold and italicized. lnL, log-likelihood scores; BIC, Bayesian Information Criterion scores; BF, Bayes Factor scores.
Bayes factor (BF) scores are useful to assess the relative merits of competing models, as BF is considered as the weight of the evidence coming from the data. A difference in BF scores in the range of 20–150 is typically treated as strong evidence in favor of the better model (and the resulting tree), while BF difference of above 150 is considered very strong empirical evidence (Kass and Raftery, Reference Kass and Raftery1995; Bergsten et al., Reference Bergsten, Nilsson and Ronquist2013). Thus, the quantitative evidence in Table 1 shows that time reversible models are better fitting models and time nonreversible models are worse fitting models for sequence data, by a large margin. In contrast, time nonreversible models are better fitting models for structure data, by a huge margin. Time reversible models can only produce unrooted trees, which has no evolutionary direction, for example, from ancestor to descendant (Figure 3c). However, in practice a rooted tree, which has an evolutionary direction (Figure 3b), is necessary for almost all evolutionarily relevant answers sought using phylogenetic analyses, such as (i) ancestor–descendant polarity, (ii) branching order in evolutionary history, and (iii) evolutionary groups that are clades (Morrison, Reference Morrison2006). Thus, the limited applicability of nonreversible models is a severe intrinsic deficiency of sequence data, which in and of itself, is the source of ambiguity, and often discord, among proponents of competing hypothesis for the relationships and origin of the major clades of life including origin of animals and eukaryotes (Kurland and Harish, Reference Kurland and Harish2015a; Harish, Reference Harish2018; Harish and Morrison, Reference Harish and Morrison2020).
Many deficiencies of sequence-based analyses have been addressed over the past two decades, primarily through improved statistical modeling (Philippe et al., Reference Philippe, Brinkmann, Lavrov, Littlewood, Manuel, Wörheide and Baurain2011). However, since models are as good as the data on which they are based on, data quality is the most important aspect of building and testing statistical models. Thus, high-quality data (characters) is essential for the success of data-driven resolution of competing hypotheses. Here, data quality refers to the quality or the strength of the phylogenetic signal of homology that can be recovered from the data. The strength of the phylogenetic signal is proportional to the confidence with which unique state transitions can be determined for a given set of characters on a given tree (Harish, Reference Harish2018). Ideally, historically unique character transitions that entail rare evolutionary innovations are desirable to identify patterns of uniquely shared innovations (synapomorphies) among lineages. Although improved modeling can correct errors of estimation and improve the fit of the data to the tree, it is not a solution to improve phylogenetic signal, especially when the historical signal is exceedingly limited or absent in the source data. Thus, structure-based characters like protein domains are probably the closest to a perfect character that is currently available for evolutionary biologists.
Farsightedness and nearsightedness
Contemporary species are the evolutionary successors of long-gone ancestors. About 99% of species that evolved on Earth have gone extinct (Barnosky et al., Reference Barnosky, Matzke, Tomiya, Wogan, Swartz, Quental, Marshall, McGuire, Lindsey and Maguire2011) with little trace left as fossils, especially of microbial species. Therefore, reconstructing a detailed picture of the common ancestor of all extant life—the universal common ancestor (UCA)—is a daunting task (Harish and Morrison, Reference Harish and Morrison2020). The UCA was most likely a single-celled species estimated to have lived between 3.8 and 3.5 billion years ago (BYa) (Figure 3). Otherwise, the nature of UCA is still fuzzy and rife with speculation. When it comes to peering back into the distant past, astrophysicists and evolutionary biologists are faced with similar problems in collecting reliable data and building tools to analyze and interpret the data (Ade et al., Reference Ade, Aikin, Barkats, Benton, Bischoff, Bock, Brevik, Buder, Bullock and Dowell2014; Krauss, Reference Krauss2014; Kurland and Harish, Reference Kurland and Harish2015a).
Studies to reconstruct the cosmological past initially relied on refracting telescopes. Telescopes are the fundamental tools to study the observable universe. Refracting telescopes are one of the two main types of optical telescopes, which operate by collecting light through a large lens and focusing the light on an eyepiece/camera (STScI, 2021). However, refracting telescopes suffer from a phenomenon called ‘chromatic distortion’—a common optical problem resulting from an inability of the lens to bring all wavelengths of light to a sharp focus. As a result, the reconstructed images of distant galaxies—galactic ancestors of contemporary universe—are fuzzy. In a telescope, the function of a good lens is to minimize such optical aberrations as much as possible to produce an unblurred and high-fidelity view. The deficiencies of refracting telescopes were overcome by reflecting telescopes wherein the lens was replaced with a mirror to collect light and focus better for a clearer picture. The Hubble and Webb space telescopes are the largest reflecting telescopes that can collect high-quality data of the most distant ‘galactic fossils’ of the cosmological universe.
The observable cosmic universe converges into a singularity known as the cosmological light horizon, which represents beginnings of the universe and the boundary between the observable and unobservable universe (Figure 3a). In reconstructing the biological past, the UCA or Universal Common Ancestor represents a singularity—a phylogenetic event horizon (Figure 3b)—which is the root node of the universal tree of life (Harish et al., Reference Harish, Tunlid and Kurland2013). Among the tools used to peer back into the galaxies of the cellular universe, protein structure–based evolutionary telescopes are like the reflecting telescopes with minimal distortions (Harish and Kurland, Reference Harish and Kurland2017a; Harish, Reference Harish2018). Therefore, structure-based telescopes produce a well-resolved and high-fidelity picture of the distant biological past (Figure 3b). In contrast, their predecessors—sequence-based telescopes produce a poorly resolved and low-fidelity picture of the past, which makes the identification of UCA and its immediate descendants unreliable (Figure 3c). It is worth noting that UCA is the ‘most recent’ UCA or the ‘last’ UCA of a lineage of cellular species since the origins of cells; evolutionary telescopes cannot peer into pre-UCA or pre-cellular epochs of evolution (Kurland and Harish, Reference Kurland and Harish2015a). However, plausible models of pre-UCA and pre-cellular evolution can be developed based on our knowledge of the biophysical and biochemical constraints that govern protein folding and protein evolution, independently of the use of evolutionary telescopes (Abroi and Gough, Reference Abroi and Gough2011; Norden, Reference Norden2021; Kocher and Dill, Reference Kocher and Dill2023).
Resolving the deeper nodes of the universal tree of life (hereafter universal tree) in general, and the root node in particular, using sequence telescopes is fraught with distortions similar to the chromatic aberrations of refracting telescopes (Harish, Reference Harish2018). This is because (1) the rates of substitution mutations in sequences show extreme variations and (2) the historical signal decays significantly with time due to repeated substitutions that overwrite the evolutionary record (Harish and Kurland, Reference Harish and Kurland2017a; Harish and Morrison, Reference Harish and Morrison2020). The decay of historical signal increases spurious signals and decreases the reliability of analyses. While distortions due rate variations can be corrected with mathematical models, decay and loss of signal cannot be compensated (Harish, Reference Harish2018). Thus, distortions of evolutionary signal and high uncertainty in identifying the UCA are common with comparative analysis of primary sequence data (Harish, Reference Harish2018; Harish and Morrison, Reference Harish and Morrison2020; Williams et al., Reference Williams, Cox, Foster, Szöllősi and Embley2020; Liu et al., Reference Liu, Makarova, Huang, Wolf, Nikolskaya, Zhang, Cai, Zhang, Xu and Luo2021).
Speculative descriptions and theoretical predictions about the nature and origin of UCA are abundant, including the ‘panspermia’ hypothesis, which presumes that terrestrial life originated in outer space (Crick and Orgel, Reference Crick and Orgel1973). Regardless of an extraterrestrial or terrestrial origin, and whether life arose only once or multiple times, identifying the UCA using a data-driven and rigorous phylogenetic analysis boils down to determining the root of the universal phylogenetic tree (Theobald, Reference Theobald2010; Harish and Morrison, Reference Harish and Morrison2020). Determining the root node, which is the deepest node of the universal tree, is one of the hardest problems in phylogenetic analysis and thus far rooting using sequence characters has either (a) generally not been possible (Pace, Reference Pace1997; Woese, Reference Woese1987) or (b) has been ambiguous at best (Philippe and Forterre, Reference Philippe and Forterre1999; Morrison, Reference Morrison2006; Harish et al., Reference Harish, Tunlid and Kurland2013, Reference Harish, Abroi, Gough and Kurland2016; Gouy et al., Reference Gouy, Baurain and Philippe2015; Harish and Morrison, Reference Harish and Morrison2020). Hence, in practice rooting, the universal tree relies on assuming a false root (Woese, Reference Woese1987; Pace, Reference Pace1997; Spang et al., Reference Spang, Saw, Jørgensen, Zaremba-Niedzwiedzka, Martijn, Lind, van Eijk, Schleper, Guy and Ettema2015; Liu et al., Reference Liu, Makarova, Huang, Wolf, Nikolskaya, Zhang, Cai, Zhang, Xu and Luo2021) that is based on unverified suppositions that align with the traditional and falsifiable hypothesis that prokaryotes evolved before eukaryotes.
Nothing in the universal tree makes sense except in the light of the universal ancestor
During last two decades, the universal tree is routinely constructed as a composite tree of 30-50 phylogenetic ‘marker’ proteins (Ciccarelli et al., Reference Ciccarelli, Doerks, von Mering, Creevey, Snel and Bork2006; Spang et al., Reference Spang, Saw, Jørgensen, Zaremba-Niedzwiedzka, Martijn, Lind, van Eijk, Schleper, Guy and Ettema2015; Liu et al., Reference Liu, Makarova, Huang, Wolf, Nikolskaya, Zhang, Cai, Zhang, Xu and Luo2021) rather than from a single marker gene: the SSU rRNA gene (Woese, Reference Woese1987). Marker protein datasets are either solely or predominantly composed of ribosomal proteins (Harish, Reference Harish2018). Standard sequence-based methods trace the history of substitution mutations using time-reversible substitution models, which are devised for computational convenience rather than to represent biological reality (Harish and Kurland, Reference Harish and Kurland2017b; Harish, Reference Harish2018). Time-reversible models can only produce ‘unrooted trees’ and hence lack the ability to identify the root node, which in the case of the universal tree represents the UCA. This inherent deficiency of the sequence telescopes was noted early on (Woese, Reference Woese1987). Thus, an unrooted universal tree (Figure 3c) is not only poorly resolved but is also an incomplete depiction of evolution. In addition, since unrooted trees do not present a clear evolutionary interpretation, they are prone to misreading and thus potentially misleading (Morrison, Reference Morrison2006; Harish, Reference Harish2018; Harish and Morrison, Reference Harish and Morrison2020). Because of the total absence of a root node, the deficiency of standard sequence telescopes is, in fact, worse than chromatic distortion of refracting telescopes.
In general, a ‘rooted tree’ is a straightforward depiction of the principle of common ancestry with a clear branching order along a time axis of ancestor–descendant polarity (Figure 3b). In contrast, an unrooted tree is undirected with no particular direction for evolutionary time and thus with undefinable branching order (Figure 3c). This distinction between an unrooted and a rooted tree is of prime importance as most conclusions from phylogenetic analyses strictly depend on a rooted tree. The primary conclusions of significance include determining (a) ancestors and descendants, (b) branching order (i.e., tree topology), (c) evolutionary groups (clades), (d) degree of relatedness among clades, and (e) ancestral states of characters under study. Hence, an unrooted tree is not an evolutionary tree (phylogeny) in its true sense, even though it depicts relatedness among the organisms (Harish, Reference Harish2018; Sánchez-Pacheco et al., Reference Sánchez-Pacheco, Kong, Pulido-Santacruz, Murphy and Kubatko2020).
Evidently, the importance of identifying the UCA cannot be emphasized enough (Gouy et al., Reference Gouy, Baurain and Philippe2015; Harish and Kurland, Reference Harish and Kurland2017b; Harish, Reference Harish2018; Harish and Morrison, Reference Harish and Morrison2020). Yet, rooting is relegated as a secondary task and is often trivialized. As commonly used phylogenetic routines cannot identify the root, ‘pseudo rooting’ (see Box 1) based on external information and/or unverified conjectures that are independent of the data used to produce unrooted trees becomes necessary.
Rooting trees in general is a difficult problem, conceptually and technically (Harish and Kurland, Reference Harish and Kurland2017b). The distinction between unrooted and rooted trees is nontrivial as shown in Box–Figure 1. The unrooted tree (Box–Figure 1; center) shows four groups of animal species A-H, of which A-F are vertebrates (species with a vertebral column) and G-H are invertebrates (species without a vertebral column). Species A-D are terrestrial (with lungs), while species E-H are marine (without lungs). The internal nodes represent common ancestors: the common ancestors of contemporary species, as well their common ancestors (gray circles). However, the overall common ancestor of all species (black circle) is unknown and unidentifiable in the unrooted tree.
Standard time-reversible models of evolution produce ‘unrooted trees’, in which (a) the position of the overall common ancestor cannot be identified and (b) nor a particular direction for evolutionary time is implied. Unrooted trees are not only incomplete depictions of the hierarchy of descent but can potentially misrepresent the evolutionary kinships (Box–Figure 1; center). To complete the picture, an outgroup is usually chosen to assign a ‘pseudo root’. The addition of the root node introduces a branching order by rearranging the tree around the root node (Box–Figure 1; left and right). Choice of an outgroup is based on assumptions about features (i.e., characters). For example, presence or absence of lungs/vertebrae is assumed to be the ancestral state. Since an artificial root node representing the overall common ancestor is introduced after-the-fact, it is designated as a ‘pseudo-root’. Hence unrooted/undirected trees are not true evolutionary trees.
In the example shown above, choosing an invertebrate outgroup implies that the absence of vertebral column is the ancestral state. Choosing the invertebrate outgroup results in a clade wherein all the vertebrates (A-F) are grouped together. Similarly, a marine outgroup implies that the absence of lungs is the ancestral state. The choice of the marine outgroup results in a clade in which some of the vertebrates (E, F) as well as invertebrates (G, H) are grouped together. However, fossil evidence confirms that the invertebrate outgroup assumption is correct as far as the Animal tree is concerned (Donoghue and Purnell, Reference Donoghue and Purnell2009). Regardless of the fossil evidence, grouping together vertebrates and invertebrates in one clade is rather odd. Likewise, grouping some akaryotes (archaea) and all eukaryotes together is odd too. Thus, this rooting exercise shows that if assumptions about outgroups are wrong, as with the “marine outgroup", the results can be blatantly wrong. Besides, rooting with an outgroup is merely a tree drawing option, but the mathematically estimated tree remains unrooted as a mathematical entity. Therefore, though widely used, the outgroup rooting method is able to introduce only a false root, because a non-existent root node is artificially introduced to the unrooted tree.
As for the Universal tree, in the absence of both organismal outgroups and reliable geological fossils, protein domains are perhaps the best characters available at present (Harish, Reference Harish2018). Locating the root node (UCA) with a directional character evolution model is the most straightforward approach to infer evolutionary trees (Harish and Kurland, Reference Harish and Kurland2017a, Reference Harish and Kurland2017b). Accordingly, empirical evidence in favor of the descent of eukaryotes and akaryotes from UCA supports Eukarya and Akarya as the primary clades of life (see Figure 4 and associated discussion).
Pseudo-rooting converts undirected trees to directed trees so that evolutionarily meaningful conclusions can be drawn (Woese, Reference Woese1987; Harish, Reference Harish2018). Depending on the external information (e.g., outgroups) or conjectures about the UCA, pseudo-rooting of the universal tree is the common practice (Woese et al., Reference Woese, Kandler and Wheelis1990; Spang et al., Reference Spang, Saw, Jørgensen, Zaremba-Niedzwiedzka, Martijn, Lind, van Eijk, Schleper, Guy and Ettema2015; Imachi et al., Reference Imachi, Nobu, Nakahara, Morono, Ogawara, Takaki, Takano, Uematsu, Ikuta, Ito, Matsui, Miyazaki, Murata, Saito, Sakai, Song, Tasumi, Yamanaka, Yamaguchi, Kamagata, Tamaki and Takai2020; Williams et al., Reference Williams, Cox, Foster, Szöllősi and Embley2020; Liu et al., Reference Liu, Makarova, Huang, Wolf, Nikolskaya, Zhang, Cai, Zhang, Xu and Luo2021). Though widely accepted, the assumptions underpinning pseudo-rootings of the universal tree were rarely tested until recently (Harish and Kurland, Reference Harish and Kurland2017a; Harish, Reference Harish2018). The studies provided, to my knowledge, the first formal test of the widely accepted conjectures about UCA and eukaryogenesis as well as the first empirical evidence that supports the independent evolution of eukaryotes and akaryotes (archaea and bacteria) and rejects the popular endosymbiotic origin of eukaryotes and origin of key eukaryotic features within archaea and bacteria.
The difficulty of locating the root node in sequence-based analysis and the importance of a statistically robust root inference were recently highlighted in efforts to trace the origin of SARS-CoV-2 (the human severe acute respiratory syndrome coronavirus 2) and the coronavirus 2019 (COVID-19) pandemic (Morel et al., Reference Morel, Barbera, Czech, Bettisworth, Hübner, Lutteropp, Serdari, Kostaki, Mamais and Kozlov2021; Pipes et al., Reference Pipes, Wang, Huelsenbeck and Nielsen2021). One study claimed to have identified the ancestors of the human SARS-CoV-2 lineages (Forster et al., Reference Forster, Forster, Renfrew and Forster2020) by including the bat corona virus as outgroup to root a median joining network (MJN). Rather than tracing the history of substitution mutations, MJNs estimate genetic distances, which is unsuitable to trace the history of mutations and to reconstruct ancestral sequence states (Sánchez-Pacheco et al., Reference Sánchez-Pacheco, Kong, Pulido-Santacruz, Murphy and Kubatko2020).
In contrast, several other studies employed suitable substitution models and more rigorous statistical phylogenetic methods to evaluate multiple rootings (Morel et al., Reference Morel, Barbera, Czech, Bettisworth, Hübner, Lutteropp, Serdari, Kostaki, Mamais and Kozlov2021; Pipes et al., Reference Pipes, Wang, Huelsenbeck and Nielsen2021). The SARS-CoV-2 tree was rooted with (1) nonreversible substitution models, (2) molecular clock models, and (3)(pseudo)rooted using the outgroup criterion with multiple outgroups. Yet, an unambiguous and statistically robust rooting was not possible using the best available methods of primary sequence analysis, in spite of the availability of massive whole-genome datasets. The difficulty and unreliability of rooting the SARS-CoV-2 tree was due to a rapid loss of evolutionary signal (Morel et al., Reference Morel, Barbera, Czech, Bettisworth, Hübner, Lutteropp, Serdari, Kostaki, Mamais and Kozlov2021; Pipes et al., Reference Pipes, Wang, Huelsenbeck and Nielsen2021). The shortcomings of unrooted MJNs and potential misinterpretations were pointed out in a sharp response as ‘Median-joining network analysis of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary’ (Sánchez-Pacheco et al., Reference Sánchez-Pacheco, Kong, Pulido-Santacruz, Murphy and Kubatko2020).
Poor resolution due to a lack of historical signal of homology along with misinterpretation of unrooted trees can lead to profoundly misleading conclusions in many other evolutionary studies as well (Baum et al., Reference Baum, Smith and Donovan2005; Harish, Reference Harish2018). Recently, it was shown that significant loss of historical signal in standard sequence data is the basis of the problems and persistent ambiguities in resolving the deeper nodes of the universal tree (Harish, Reference Harish2018). Thanks to structure telescopes, the deficiencies of sequence telescopes can now be overcome so that a well-supported and well-resolved universal tree can be reconstructed. The advantages of embracing structure-based characters for studying evolution are manifold (see next section). However, the routine use of sequence characters with weak evolutionary signal goes hand-in-hand with the standard practice of pseudo-rooting (Lake, Reference Lake1986; Woese et al., Reference Woese, Kandler and Wheelis1990; Liu et al., Reference Liu, Makarova, Huang, Wolf, Nikolskaya, Zhang, Cai, Zhang, Xu and Luo2021).
Pseudo-rootings are routinely used to assert that (a) archaea are the closest relatives of eukaryotes (Woese et al., Reference Woese, Kandler and Wheelis1990), (b) eukaryotes evolved from a specific lineage of archaea (Spang et al., Reference Spang, Saw, Jørgensen, Zaremba-Niedzwiedzka, Martijn, Lind, van Eijk, Schleper, Guy and Ettema2015; Imachi et al., Reference Imachi, Nobu, Nakahara, Morono, Ogawara, Takaki, Takano, Uematsu, Ikuta, Ito, Matsui, Miyazaki, Murata, Saito, Sakai, Song, Tasumi, Yamanaka, Yamaguchi, Kamagata, Tamaki and Takai2020), (c) archaea are intermediates on the evolutionary path to eukaryotes from bacteria (Imachi et al., Reference Imachi, Nobu, Nakahara, Morono, Ogawara, Takaki, Takano, Uematsu, Ikuta, Ito, Matsui, Miyazaki, Murata, Saito, Sakai, Song, Tasumi, Yamanaka, Yamaguchi, Kamagata, Tamaki and Takai2020), (d) archaeal origin of eukaryote protein homologs (Cotton and McInerney, Reference Cotton and McInerney2010), and (e) bacterial origin of eukaryote protein homologs (Karlberg et al., Reference Karlberg, Canbäck, Kurland and Andersson2000; Martijn et al., Reference Martijn, Vosseberg, Guy, Offre and Ettema2018). However, pseudo-rootings based on pseudo-outgroups or unverified assumptions are error prone and unreliable (Gouy et al., Reference Gouy, Baurain and Philippe2015; Harish, Reference Harish2018). Thus, the standard practice of using time-reversible evolution models and pseudo-rootings, solely for interpreting the results of an unrooted universal tree, is prone to faulty conclusions and can be misleading. Indeed, unsupported false-rootings along with poorly resolved trees tend to foster common misconceptions about evolution (Harish et al., Reference Harish, Abroi, Gough and Kurland2016; Harish and Kurland, Reference Harish and Kurland2017b; Sánchez-Pacheco et al., Reference Sánchez-Pacheco, Kong, Pulido-Santacruz, Murphy and Kubatko2020).
Misreading of even well-resolved, rooted trees is surprisingly common (Baum et al., Reference Baum, Smith and Donovan2005). For instance, the universal tree in which eukaryotes and akaryotes descend and diverge from the UCA (Figure 3b) is often misinterpreted as a ‘eukaryotes first’ scenario or an ‘upside down’ tree of life, since it contradicts the common false rootings. Such rootings (a) conflate UCA with other common ancestors (Figure 3c) and (b) are usually based on the notion that archaea and bacteria are primitive and thus assumed to be ancestors of eukaryotes in a prokaryote-to-eukaryote progression (Harish and Kurland, Reference Harish and Kurland2017a; Harish and Morrison, Reference Harish and Morrison2020). Rather, the straightforward conclusion is that eukaryotes and akaryotes are sister clades and that the closest relative of the eukaryote common ancestor as well as the akaryote common ancestor is UCA (Figure 3b).
Roots of stability: The diminishing relevance of the three domains classification system
The Linnean system of organizing species into nested hierarchies, Systema Naturae (Linnaeus, Reference Linnaeus1758), first published in 1735, was developed a century before Darwin’s oft-cited vision ‘Our classifications will come to be, as far as they can be so made, genealogies’ (Darwin, Reference Darwin1859). The term ‘phylogeny’ was coined when one of the first genealogical tree of life was depicted (Haeckel, Reference Haeckel1866), inspired by the principle that a ‘natural system’ and a true classification should be represented as an evolutionary tree (Darwin, Reference Darwin1859). Yet, some of the prominent genealogical trees (Figure 4a–c) are not Darwinian trees. In Darwinian phylogenetic trees, contemporary species are at the leaves (terminal nodes) and extinct ancestors at the internal nodes. Many notable hypotheses of phylogenetic progression assume, explicitly or implicitly, that some extant species groups are primitive (Figure 4a–c), much like the popular depictions of evolution as a linear progression from simple to complex forms. For instance, unicellular species with akaryotic cell organization were assumed to be primitive and placed near a ‘virtual root’ of the tree (Figure 4a; Haeckel, Reference Haeckel1866), (Figure 4b; Whittaker, Reference Whittaker1969), and (Figure 4c; Woese, Reference Woese1987). However, the venerable ancestor, UCA in this case, was neither a distinct entity nor an empirically derived node on the phylogeny. Hence, neither the three-kingdom system (Figure 4c; Woese, Reference Woese1987) nor its predecessor, the five-kingdom system (Figure 4b; Whittaker, Reference Whittaker1969), is truly phylogenetic.
The poor resolution of archaea due to unreliable phylogenetic signal in routinely used ‘marker sequences’ is often seen as non-monophyly of archaea (Figure 4d,e). Regardless of rooting, poor resolution of archaea further confounds phylogenetic classification (Figure 4i–k). If we are to accept a poorly resolved universal tree, then some of the possibilities depending on different rootings are:
-
• Eukarya ceases to exist as an exclusive clade and a taxonomic domain (Figure 4i).
-
• Bacteria ceases to exist as an exclusive clade and a taxonomic domain (Figure 4j).
-
• All of eukarya, bacteria and archaea cease to exist as exclusive clades or domains of life (Figure 4k).
Arguably, resolving the root node of the universal tree is one of the hardest problems in evolutionary biology given the time depth (Harish, Reference Harish2018). Fortunately, structural characters cut the Gordian knot by facilitating an empirical resolution of the rooting problem as well as diagnosis of the monophyly of the major species groups by allowing the assessment of empirical evidence in favor of the different suppositions and tentative hypotheses (Figure 4l). BFs provide a means to evaluate the strength of evidence in favor of the best hypothesis among a set of competing proposals (Harish and Kurland, Reference Harish and Kurland2017a; Harish, Reference Harish2018). BF is the ratio of the likelihoods of the different hypotheses being compared. A BF of 5 or greater is considered as very strong empirical evidence in favor of the hypothesis with the better likelihood (Harish, Reference Harish2018). Therefore, a BF of 335 means extremely strong empirical evidence for the two-domain universal tree (Figure 4h) and for primary clades Eukarya and Akarya (Bacteria and Archaea are sister clades), compared to other competing hypotheses (Harish, Reference Harish2018). Likelihood scores are the log odds of the hypotheses; thus, the Eukarya-Akarya two-domain phylogeny is at least 10145 times more probable than the closest competing phylogeny (Figure 4f), which is the three-domain phylogeny (Eukarya and Archaea are sister clades). The alternative two-domain proposal is improbable and least supported. Put simply, the Eukarya-Akarya two-domain phylogeny is most likely to be correct.
The universal tree is primarily a phylogenetic classification. The basic requirements of phylogenetic classification are:
-
Monophyly: Only monophyletic groups (or clades) are true evolutionary groups. That is, groups comprising all the descendants of a given common ancestor should be identified.
-
Homology: Delineating clades is based on diagnosing patterns of descent of characters that evolved in the common ancestors and were inherited by the descendants (i.e., homologous characters). Nested configurations of sharing of such homologous characters in different species are used to group them in an order of common descent (i.e., branching order), which then diagnose the degree of relatedness among the clades.
This seemingly straightforward procedure of character analysis and the algorithmic logic to diagnose clades was developed so that phylogenetic classification is an objective exercise determined by the branching order (Hennig, Reference Hennig1965).
However, the common practice of pseudo-rooting is essentially based on unverified conjectures and unsupported assumptions, which not only encourages a subjective interpretation of the universal tree but also fosters the continued overlooking of evidence against popular conjectures (Kurland and Harish, Reference Kurland and Harish2015b; Harish and Kurland, Reference Harish and Kurland2017c). Though appealing and widely accepted, these assumptions were rarely tested until recently (Harish and Kurland, Reference Harish and Kurland2017a, Reference Harish and Kurland2017c; Harish, Reference Harish2018), largely because it is not feasible to test the veracity of such assumptions with sequence characters and standard time-reversible models (Harish and Kurland, Reference Harish and Kurland2017a). In addition, perhaps because the pseudo-rooting aligns with another common traditional assumption: that simple is primitive (Harish et al., Reference Harish, Abroi, Gough and Kurland2016; Harish and Kurland, Reference Harish and Kurland2017b). This assumption is pervasive since the time of the earliest efforts to reconstruct a genealogical tree of life (Figure 4a–c). The current practice of pseudo-rooting the universal tree (Zaremba-Niedzwiedzka et al., Reference Zaremba-Niedzwiedzka, Caceres, Saw, Bäckström, Juzokaite, Vancaester, Seitz, Anantharaman, Starnawski, Kjeldsen, Stott, Nunoura, Banfield, Schramm, Baker, Spang and Ettema2017; Liu et al., Reference Liu, Makarova, Huang, Wolf, Nikolskaya, Zhang, Cai, Zhang, Xu and Luo2021) goes back to the initial efforts of classification of life using molecular characters (Woese et al., Reference Woese, Kandler and Wheelis1990). Thus, conclusions based on widely assumed pseudo-rooting are compromised both by a lack empirical evidence and their reliance on poor quality data (characters) (Harish and Kurland, Reference Harish and Kurland2017c; Harish, Reference Harish2018; Harish and Morrison, Reference Harish and Morrison2020). Notably, the commonly acknowledged but rarely tested notions that eukaryotes evolved from within archaea (Spang et al., Reference Spang, Saw, Jørgensen, Zaremba-Niedzwiedzka, Martijn, Lind, van Eijk, Schleper, Guy and Ettema2015; Zaremba-Niedzwiedzka et al., Reference Zaremba-Niedzwiedzka, Caceres, Saw, Bäckström, Juzokaite, Vancaester, Seitz, Anantharaman, Starnawski, Kjeldsen, Stott, Nunoura, Banfield, Schramm, Baker, Spang and Ettema2017; Imachi et al., Reference Imachi, Nobu, Nakahara, Morono, Ogawara, Takaki, Takano, Uematsu, Ikuta, Ito, Matsui, Miyazaki, Murata, Saito, Sakai, Song, Tasumi, Yamanaka, Yamaguchi, Kamagata, Tamaki and Takai2020; Williams et al., Reference Williams, Cox, Foster, Szöllősi and Embley2020) and that eukaryotes evolved from a merger of archaea and bacteria (Karlberg et al., Reference Karlberg, Canbäck, Kurland and Andersson2000; Cotton and McInerney, Reference Cotton and McInerney2010; Martijn et al., Reference Martijn, Vosseberg, Guy, Offre and Ettema2018) are most seriously compromised due to (a) the strong evidence for the monophyly of archaea and akaryotes and (b) because the widely accepted false-rootings and their underlying assumptions lack support (Harish and Kurland, Reference Harish and Kurland2017c; Harish, Reference Harish2018; Harish and Morrison, Reference Harish and Morrison2020).
That said, it is worth emphasizing that given the hierarchy of descent, contextualizing the recent divergences is conditional on the resolution of the deeper divergences. In addition, character evolution models for protein domains and their component amino acids describe mutually exclusive processes that account for hierarchically different evolutionary timescales (Harish, Reference Harish2018). Nevertheless, support for recent divergences can be improved in several ways: (1) by employing an expanded character set defined by the updated domain descriptions. While previous studies were based on ≈1,800 domains described at the time (Harish and Kurland, Reference Harish and Kurland2017b; Harish, Reference Harish2018), the number of domain descriptions have tripled to 5,400 at present; (2) combining sequence and structural characters; and (3) a multi-phasic approach to resolve different parts of the universal tree independently using either structure-based or sequence-based approaches, depending on the questions addressed, is a useful alternative (Harish, Reference Harish2018).
Summary and implications
Perhaps, portraying the pros and cons of the sequence-based and structure-based reconstruction of the universal tree as a ‘battle of characters’ would make for an entertaining tale. However, both types of molecular features are complementary and are valuable for resolving different parts of the universal tree. That is, by melding together structure telescopes and sequence telescopes, both farsightedness and nearsightedness of evolutionary telescopes can be corrected. During the last two decades, neither increasing the sophistication of the substitution models nor aggregating more sequences has been productive in (a) reliably resolving contentious evolutionary relationships, (b) accurately determining the temporal order of key evolutionary innovations, and (c) describing the exceptional sister group differences of the major species groups. Embracing the well-defined structure-based characters will certainly prove to be beneficial.
Structure telescopes provide for a straightforward and objective means for identifying the UCA and to determine the major clades and key evolutionary transitions across the tree of life. Protein structural domains define unique molecular phenotypes, which are robust evolutionary characters that improve the level of confidence in resolving the deepest divergences of the universal tree. It is abundantly evident that, in addition to a richer representation of cellular and molecular phenotypes, protein domains offer a deeper perspective of the evolutionary history. Thus, they provide for a better means of (a) describing the key innovations and evolutionary transitions across the tree of life, (b) objective evaluation of competing hypotheses for the evolution of cellular life, and (c) a phylogenetic classification that is an accurate representation of the two basic types of cell organization.
The universal phylogeny wherein eukaryotes and akaryotes are sister clades is by far the best empirically supported universal tree of life by any measure, qualitative and quantitative. Qualitative measures include both tree-independent assessment of character homology and the tree-based assessment of common ancestry—character homology and homology of clades. Quantitative measures include robust statistical support for (a) fit of character evolution model to a rooted tree, (b) the branching order, starting with the rooting, and (c) higher measures of confidence to reject alternative universal tree proposals. Hence, structure telescopes are better suited in contrast to sequence telescopes to look further back into the biological past. The serious limitation of sequence characters, especially with regard to the assessment of qualitative evidence, can be overcome with structure characters.
Outlook
The durability of Linnean classification rests on the choice of excellent ‘diagnostic features’ or characters used to group species into genera, families through Kingdoms. If only high-quality molecular characters such as protein domains were available early on, perhaps there never would have been a third domain of life. Hindsight is 20/20, after all. The Linnean hierarchical classification implicitly reflected common descent of the species thus classified and ultimately converged into an Empire, Imperium Naturae. Although the taxonomic grade Domain/Empire is warranted, in light of the new, stable rooting and well-supported branching order of the universal tree, grades for Archaea and Bacteria should be revised to Kingdoms, whether or not their respective initial nomenclatures Archaeabacteria and Eubacteria should take precedence. Eukarya and Akarya are the primary domains of life as well as the principal taxonomic ranks, both terms being descriptive of the two basic cell types.
Looking forward, as genome sequences and protein structures continue to accumulate, future efforts for a better resolved universal tree could employ a variety of new molecular features. In addition to primary sequences and known protein domains, many newer evolutionary characters can be identified by determining (a) novel protein domains for which structures are unknown and (b) new types of homologous features from quaternary assemblies of the protein complexes, among others. Tools like AlphaFold and RoseTTaFold seem to be primed for such undertakings. In this way, the evolution of morphological phenotypes as well as physiological phenotypes at the cellular level can be reconstructed in greater detail than what is possible at present.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/qrd.2024.4.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/qrd.2024.4.
Acknowledgements
I am grateful to Måns Ehrenberg and David Morrison for discussions and comments on an earlier version of the manuscript and for the continued support. I also thank the editor and reviewers for their comments, which improved the presentation. An earlier version of this manuscript, written on the occasion of the Golden Jubilee of PDB, can be found on OSF Preprints (https://doi.org/10.31219/osf.io/hyknj).
Image credits: Images available under public licenses for adaptation and sharing are acknowledged below. All other images were produced by the author for this article.
Figure 1: (a) Molecule of the month, illustration by David S. Goodsell, RCSB Protein Data Bank. (b, LHS) Eukaryotic cellular landscape, illustration by Evan Ingersoll and Gaël McGill, PhD (Digizyme Inc.) using Molecular Maya software. Created for Cell Signaling Technology, Inc., and inspired by the stunning art of David Goodsell, this 3D rendering of a eukaryotic cell is modeled using X-ray, nuclear magnetic resonance (NMR), and cryoelectron microscopy datasets for all of its molecular actors. (b, RHS) Escherichia coli bacterium, 2021, illustration by David S. Goodsell, RCSB Protein Data Bank. doi: 10.2210/rcsb_pdb/goodsell-gallery-028. (c) Created with BioRender.com.
Figure 2: (a, RHS) Volkov Vladislav Petrovich, Wikimedia Commons, CC BY-SA 4.0. B: RHS: Danny Cicchetti, Wikimedia Commons, CC BY-SA 4.0.
Figure 3: (a) First Stars: Timeline of the Universe, Space Telescope Science Institute (STSci).
Figure 4: (a) Tree of life by Ernst Haeckel, Wikipedia.org.
Box–Figure 1: Silhouette images of animals, by PhyloPic database (http://phylopic.org/).
Financial support
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Comments
I request Professor Bengt Nordén to be the handling editor for my article.