A computational reconstruction of Papio phylogeny using Alu insertion polymorphisms
Since the completion of the human genome project, the diversity of genome sequencing data produced for non-human primates has increased exponentially. Papio baboons are well-established biological models for studying human biology and evolution. Despite substantial interest in the evolution of Papio, the systematics of these species has been widely debated, and the evolutionary history of Papio diversity is not fully understood. Alu elements are primate-specific transposable elements with a well-documented mutation/insertion mechanism and the capacity for resolving controversial phylogenetic relationships. In this study, we conducted a whole genome analysis of Alu insertion polymorphisms unique to the Papio lineage. To complete these analyses, we created a computational algorithm to identify novel Alu insertions in next-generation sequencing data.
We identified 187,379 Alu insertions present in the Papio lineage, yet absent from M. mulatta [Mmul8.0.1]. These elements were characterized using genomic data sequenced from a panel of twelve Papio baboons: two from each of the six extant Papio species. These data were used to construct a whole genome Alu-based phylogeny of Papio baboons. The resulting cladogram fully-resolved relationships within Papio.
These data represent the most comprehensive Alu-based phylogenetic reconstruction reported to date. In addition, this study produces the first fully resolved Alu-based phylogeny of Papio baboons.
KeywordsAlu Retrotransposon Phylogeny Primates Taxonomy Evolutionary genetics Papio Hybridization
Long interspersed element
Long terminal repeat
Short interspersed element
Target-primed reverse transcription
Whole genome sequencing
The burgeoning diversity and availability of whole genome sequencing (WGS) data offers intriguing possibilities for the field of comparative primate genomics. Currently, WGS data are publicly available for over 100 primate species (NCBI Resource Coordinators 2016). Traditionally, significant interest in the genetics of non-human primates stems from their sustained role as popular research models for studying human biology and evolution [1, 2, 3, 4, 5]. One such primate—well established as a model for human genetics and disease susceptibility—is the Papio baboon [6, 7, 8, 9, 10, 11]. In addition to close genetic relatedness, the temporal and ecological landscape of early Papio evolution bears striking resemblance to that of early hominins [2, 12, 13, 14]. Both include ancient episodes of admixture, as well as migration out of Africa into the Arabian Peninsula during the Pleistocene [2, 15, 16, 17, 18, 19, 20]. Appropriately, Papio baboons represent an intriguing model for human evolution.
Papio baboons occupy the largest geographical distribution of any non-human primate genus on the African continent [21, 22, 23]. These ground dwelling Old World monkeys inhabit most of sub-Saharan Africa, to the exclusion of the tropical rainforests of West Africa and the Congo Basin, and also extend into the south-western region of the Arabian Peninsula [24, 25]. Papio systematics have been extensively studied over the past 60 years with much debate as to which forms warrant species status . The disagreement is in essence philosophical, centered on the question of what constitutes a species. However, recent studies employ a phylogenetic species concept [26, 27, 28, 29], positing that consistent differences in physical appearance, ecology and social behavior justify the recognition of six extant species: P. anubis, P. hamadryas, P. papio, P. cynocephalus, P. ursinus and P. kindae. In this study, we recognize all six as species.
Despite considerable interest in Papio systematics, a fully resolved consensus phylogeny remains undetermined [20, 26, 30]. Interfertility has been documented between all neighboring species, with persisting natural hybrid zones in several regions where distinct morphotypes (species) come into contact [27, 31, 32, 33, 34, 35, 36]. Thus, discordance between mitochondrial, morphological, and nuclear phylogenetic reconstructions could in part stem from a dense history of admixture and reticulation persisting throughout the course of Papio evolution. Mitochondrial based phylogenies support the divergence of Papio into northern and southern lineages [26, 30]. Individuals belonging to P. anubis, P. papio and P. hamadryas are consistently placed within the northern clade; with individuals belonging to P. kindae and P. ursinus comprising the southern clade. In these analyses, however, the placement of P. cynocephalus remains unclear with individuals found in both clades. In addition, such reconstructions have proven unsuccessful at resolving phylogenetic relationships within each clade. Thus additional analyses employing novel methodologies could further serve to elucidate evolutionary relationship within Papio.
Alu elements are well-established DNA markers for the study of systematic and population genetic relationships [37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]. In part, they are effective evolutionary characters because of their high copy number in primate genomes and sustained mobilization throughout the course of primate evolution (~ 65 MY) [48, 49, 50]. Over 1.2 million copies have been identified in the human genome , with similar numbers reported for all other haplorrhine genomes sequenced to date [52, 53, 54, 55]. Alu elements are discrete primate-specific DNA sequences (~ 300 bp) belonging to a class of non-LTR (long terminal repeat) retrotransposons termed short interspersed elements (SINEs). Following the transcription of a SINE, the mRNA sequence can be reverse transcribed into DNA, producing a new copy at a novel position in the host genome [56, 57, 58]. Over time, this process known as target primed reverse transcription (TPRT) can exponentially increase the retrotransposon content of a host genome. Alu elements, as well as all other SINEs, lack the requisite enzymatic machinery for TPRT; thus they require proteins encoded by larger retrotransposons known as LINEs (long interspersed elements) [48, 59, 60].
SINEs are valuable evolutionary characters because they can be assumed to be identical by descent, meaning that insertions shared between individuals were inherited from a common ancestor, rather than acquired by independent events . Additionally, retrotransposons have known directionality [62, 63], with the ancestral state being the absence of the insertion. Alu elements are popular retrotransposon markers because their short length makes them particularly easy to assay using standard PCR. Considered nearly homoplasy-free [48, 49], most potential sources of homoplasy involving Alu elements can be resolved through Sanger sequencing [41, 42, 61]. Recent studies demonstrate the utility of Alu elements for Papio species identification, as well as retrieving population structure within distinct Papio species [28, 29]. Furthermore, Alu elements have been successfully used to resolve controversial relationships between primates [38, 39, 42, 64]. However, little is known about the efficacy of Alu elements to resolve phylogenetic relationships involving high levels of admixture.
Although a high-quality reference assembly currently exists for only one Papio species (P. anubis), WGS data have been generated for individuals representing all six Papio species through the Baboon Genome Consortium. Thus it is possible to conduct a comprehensive whole genome analysis of Papio phylogeny using Alu polymorphisms between species of the genus. For the present study, we created a computational pipeline to identify and characterize recently integrated Alu elements polymorphic within the genus Papio. These Alu insertion polymorphisms were used to reconstruct phylogenetic relationships within Papio. By utilizing M. mulatta as our reference, our approach placed equal evolutionary distance between each Papio diversity sample and the reference assembly [Mmul8.0.1]. The computational analyses performed in this study generated a well-supported phylogeny of Papio baboons and represents the most comprehensive Alu-based phylogenetic analysis reported to date. In addition, we report a novel approach to admixture and reticulation analysis using Alu insertions.
Whole-genome sequencing was performed by the Baylor College of Medicine Human Genome Sequencing Center on a panel of fifteen Papio baboons: four P. anubis, two P. papio, two P. hamadryas, three P. kindae, two P. cynocephalus, and two P. ursinus. In order to sample an equal number of individuals from each species, we used two individuals from each of the six extant Papio species (we randomly selected two individuals from P. anubis and P. kindae) to conduct our computational analysis. Lastly, our panel included WGS data from the macaque sample used to build the latest M. mulatta assembly [Mmul8.0.1] (Additional file 1).
WGS data were accessed from the NCBI-SRA database . The SRA-toolkit (fastq-dump utility)  was used to download paired-end next generation sequencing reads and convert them from .sra files to interleaved fastq files. We then used nesoni (https://github.com/Victorian-Bioinformatics-Consortium/nesoni; last accessed March 2018) to prune all known adapters, cleave bases with a phred quality score of 10 or lower, and exclude reads shorter than 24 base pairs in length. Two output fastq files were produced: one containing clean paired-end reads (both reads passed the nesoni filter), and a second containing unpaired orphan reads (one of the paired-end reads was excised).
Polymorphic Alu insertion detection
The AluY subfamily has been identified as youngest and most active Alu subfamily in Simiiformes [48, 67, 68, 69]. Thus, in the alignment phase, we used BWA mem  to map paired-end NGS reads to a consensus AluY sequence obtained from Repbase . Individual reads were required to map to either the head (5′) or tail (3′) of the AluY consensus sequence. In addition, reads mapping to the head of an Alu insertion were required to contain at least 15 bp of unmapped/non-Alu sequence directly upstream of the (5′) start of the Alu sequence. Likewise, reads mapping to the tail of the consensus Alu sequence were required to contain no less than 15 bases of unmapped sequence directly flanking the (3′) end of the sequence. Reads were mapped to the AluY consensus twice: once using the standard BWA mem parameters, and a second time using more liberal parameters (described in Additional file 2). Split-reads identified using standard parameters were later used to predict the location of an Alu integration site, while those identified during the liberal run were used simply to provide additional support for the insertion event. The Alu portion of each candidate split-read was then cleaved and remaining sequence aligned to Mmul8.0.1 using bowtie2 . Split-reads were categorized as sequences that mapped uniquely to the AluY consensus and the Mmul8.0.1 assembly.
The approximate genomic position of each candidate insertion was calculated directly from the mapping positions of split-reads to Mmul8.0.1 and the AluY consensus. Alu insertion orientation was inferred from the alignment orientation of the supporting reads when mapped to the AluY consensus and Mmul8.0.1 assembly. During this phase the integration orientation of each candidate insertion was predicted in the forward orientation if positioned 5′ to 3′ on the sense strand, and the reverse orientation if positioned 5′ to 3′ on the anti-sense strand. If a split-read mapped in the same orientation to the consensus AluY and the Mmul8.0.1 assembly, it was predicted in the forward orientation. If the alignment orientations were discordant, the insertion was predicted in the reverse orientation.
Approximate genomic positions for non-reference (absent in Mmul8.0.1) Alu insertions, predicted in any of the 12 Papio individuals, were concatenated into a comprehensive list with the goal of identifying phylogenetically informative markers. All of these insertions were predicted from split-reads obtained during the standard Alu alignment run. In principle, phylogenetically informative Alu elements would have integrated into the Papio lineage following its divergence from Macaca. Thus, insertions shared between Papio and the Macaca mulatta sample were excluded. Likewise Alu elements identified in only one Papio sample were phylogenetically-uninformative, and thus were also excluded from this portion of the study. The remaining loci were genotyped in every individual on the panel. The three possible genotypes – homozygous present, homozygous absent, and heterozygous – were determined by analyzing sequences spanning the insertion locus. It was initially assumed that an individual was homozygous present for every insertion predicted in that sample. Likewise, it was initially assumed that an individual was homozygous absent for every locus not predicted in that individual. Insertions initially determined to be homozygous present were then re-evaluated to determine if they were in fact heterozygous present. Heterozygosity was determined by evaluating reads that mapped uniquely to the Mmul8.0.1 assembly. An insertion was reclassified as heterozygous if we identified reads in that individual that mapped continuously (without interruption) through the homologous empty site in the Mmul8.0.1 assembly. This empty site was defined as a sequence containing at least 15 bp of flanking both upstream and downstream from the predicted insertion locus. Additionally, if a homozygous absent genotype was predicted in a region with a local read-depth less than two standard deviations from the global mean, the genotype was instead considered unknown.
The performance of the algorithm used in this study was assessed by comparing PCR validations performed for 494 loci in a panel of six Papio baboons: one from each extant Papio species . From this dataset, our algorithm correctly predicted 98% of the PCR-validated events for presence/absence. In addition, the correct genotype (homozygous present, homozygous absent, or heterozygous) was computationally predicted for 93% of all events.
Basal divergence analysis
Previous phylogenetic analyses support the ancestral divergence of Papio into two clades: northern and southern lineages [26, 30]. To evaluate this hypothesis we created a computational method to identify the basal divergence model best supported by our Papio dataset. A genus comprised of six species with three different possible phylogenetic topologies generates 31 different unique models for estimating the basal divergence (Additional file 3). For each model we determined the total number of insertions that supported and conflicted with each basal divergence. We calculated the standard deviation and z-score for each model. The model with the highest z-score represents the basal divergence model best supported by the dataset.
We used the model representing the basal divergence with the highest z-score (described in the previous section) as a pre-condition for our phylogenetic analysis. A comprehensive list of Alu insertions supporting this model (consistent with the north-south split hypothesis) were used to further resolve phylogenetic relationships within Papio. A heuristic search was performed using PAUP* 4.0b10 . Since it is assumed that the absence of an Alu insertion is the ancestral state of each locus, Dollo’s law of irreversibility  was used in the analysis. Thirteen individuals were evaluated in this analysis: 12 Papio baboons, two representing each of the six extant Papio species, along with the M. mulatta sample used to build the Mmul8.0.1 assembly. Each individual received a score for each locus based on its computationally derived genotype. The presence of an insertion was scored as “1” for a filled site and “0” for an empty site; unknown genotypes were scored as “?”. Using PAUP we conducted a heuristic search using genotype data from Alu polymorphisms concordant with the north-south split with M. mulatta set as the outgroup. All loci were classified as individual insertions and set to Dollo.up for parsimony analysis as described previously . Ten thousand bootstrap replicates were performed with the maximum tree space set to all possible trees.
We wrote a series of Python scripts to sort Alu insertions into clusters based on which baboons shared the insertion. This allowed us to determine the total number of Alu insertions shared between different sets/combinations of baboons. Each cluster contained Alu insertions shared among a distinct combination of baboons, yet absent from all other samples. For example, one cluster contained all Alu insertions shared between the P. cynocephalus samples and the P. kindae samples, yet absent from all remaining samples. Another cluster was comprised of Alu insertions shared between all six northern baboons, yet absent from all six southern baboons. Each cluster represents the total number of insertions shared uniquely between a particular “combination/set” of baboons. The resulting clusters were then analyzed to identify patterns of shared Alu polymorphisms. Using this script we quantified the total number of Papio indicative Alu-insertions, markers present in all six extant Papio species, yet absent from the M. mulatta sample. Clade indicative Alu polymorphisms were defined as insertions present in every species belonging to one clade, yet absent from all individuals in the other clade. In addition, we evaluated patterns of shared Alu polymorphism exhibited within each clade. In this analysis, we identified Alu polymorphisms exclusive to either the northern or southern clade, yet not present in all species within that clade. Lastly, we quantified the total number of species indicative Alu elements, defined as Alu polymorphisms present in both individuals belonging to a species, yet absent from all other Papio individuals in our panel.
Polymorphic Alu identification
WGS data for multiple Papio baboons were generated through the Baboon Genome Analysis Consortium and made available on NCBI. From this dataset we selected a diversity panel consisting of 12 Papio baboons: two from each of the six extant species. We then used our computational pipeline to process these WGS samples, targeting Alu insertions present in multiple diversity samples, yet absent from the latest M. mulatta reference assembly [Mmul8.0.1]. In total, we identified 187,379 Alu insertions fitting this criterion.
Basal divergence modeling
We evaluated 31 distinct basal divergence models (see Methods), to determine the one best supported by our computational genotype data (Additional file 3: Figure S1). The model with the largest z-score divided the Papio genus into two lineages: a northern clade containing P. papio, P. anubis, and P. hamadryas; and a southern clade consisting of P. cynocephalus, P. ursinus, and P. kindae (Additional file 3: Table S1). Of the 187,379 non-reference insertions (not present in Mmul8.0.1) reported in the previous section, 123,120 were concordant with this north-south basal divergence model (~ 66%) and 64,259 (~ 34%) were discordant.
With the increasing availability of WGS data, admixture remains a fundamental challenge for evolutionary biologists. Nevertheless, the abundance of genomic data provides scientists the opportunity to use novel methodologies to re-examine complex evolutionary relationships. Well-documented extant hybrid zones coupled with a dense history of reticulation complicate the task of neatly organizing Papio baboons into a phylogenetic tree. Baboons are popular well-established research models for studying human disease and evolution, and therefore understanding the pattern of genetic variation within and between baboon species is important. As a result, an accurate and detailed understanding of Papio genomic evolution is quite valuable.
Despite the increasing availability of WGS data, high quality assemblies are not commonly constructed for multiple species belonging to the same genus. Instead, one individual is often used to build an assembly representative of an entire genus. However, often times WGS data are generated from individuals belonging to different species within that genus. For Papio baboons, a high quality (chromosome-level resolution) reference assembly exists only for Papio anubis, yet WGS data have been generated for multiple individuals from each extant Papio species. A traditional method used to identify Alu elements polymorphic within a genus involved identifying markers present in an assembly of interest, yet absent from the closest primate relative with a draft assembly. For Papio baboons, a lineage-specific Alu polymorphism would be defined as an element present in P. anubis, yet absent in rhesus macaques [as represented by the assembly Mmul8.0.1]. Since all of the subsequent markers would be identified in a P. anubis individual, this would introduce sampling bias towards markers present in P. anubis. However, our computational approach allowed us to align all of our representative Papio samples against the outgroup rhesus macaque [Mmul8.0.1], placing equal evolutionary distance between each Papio individual and the reference assembly. As a result, we were able to identify polymorphic Alu elements with minimal directional bias.
Analyses conducted using mitochondrial DNA support the most basal divergence of Papio into northern and southern clades. However, these analyses were unable to produce a phylogeny that fully resolved evolutionary relationships between Papio species. Our findings provide support for this basal north-south split hypothesis. Furthermore, this study produces the first whole genome computational analysis of Alu polymorphisms within Papio. By designing a computational method to detect and characterize Alu polymorphisms from multiple Papio individuals representing all known extant species and evaluating various basal divergence models, we were able to produce a fully resolved phylogeny of Papio baboons with 100% bootstrap support at each node.
In addition, our analysis of elements discordant with this phylogenetic model may offer insights into a complex history of admixture and reticulation within the Papio lineage. In the southern lineage, P. kindae shows the highest incidence of Alu insertions shared with the northern clade, yet absent from the other southern clade samples (11,286 elements). In total, we identified 64,259 elements discordant with topology of the phylogenetic tree (Fig. 2) that could be due to incomplete lineage sorting (ILS) or hybridization/admixture. Continued analyses involving a greater number of individuals would be necessary to accurately explain the taxonomic distribution of these insertions. Such analyses could potentially elucidate insertions indicative of speciation, the north-south split, hybridization, and many other evolutionary events. Thus, the data presented in this paper may be utilized to further evaluate Papio evolution. Such studies are likely necessary given the rich diversity that exists within the genus Papio. Furthermore, this approach has outstanding potential to inform analyses of other primate genera with complex evolutionary histories (e.g. Cercopithecus, Macaca, Chlorocebus, Aotus, Microcebus, Saimiri and others).
Contemporary arguments in favor of applying a phylogenetic species concept to the Papio genus rely heavily on the rich species diversity exhibited between morphotypes. Our findings provide support for the genetic diversity that exists within the genus Papio. In each extant species, we found an average of over 8000 elements shared exclusively between members belonging to that species. Despite previous debate as to whether P. kindae warrants species level classification, the largest number of species-specific elements characterized in this study were identified in P. kindae (12,891).
One limitation of this study is that it is based on only 12 Papio individuals: two representing each species. It is very likely that the genetic diversity observed in each individual does not comprehensively represent diversity existing within the species as a whole. Each wild Papio species occupies a large range across the African continent; thus proximity to hybrid zones may contribute to interspecies diversity that is not captured in this analysis. Several species occupy ranges that contact other Papio species (Fig. 3b). Little is known about within species diversity. Only through further large-scale sampling and analyses can this be evaluated.
In conclusion, this study exhibits the utility and efficacy of a whole genome analysis of Alu polymorphisms for resolving controversial phylogenetic relationships. In addition, it demonstrates the importance of employing diverse methodologies. Knowledge of the initial divergence of Papio into northern and southern clades, produced by previous studies and supported in this study, was instrumental in our analysis of Papio evolution. Despite high incidence of hybridization and sustained hybrid zones, we were able to produce a highly supported cladogram, resolving relationships within both the northern and southern clades. These data represent the most comprehensive Alu-based phylogenetic reconstruction reported to date. In addition, this study also produces the first fully resolved Alu-based phylogeny of Papio baboons. Our approach may offer useful applications for investigating other unresolved branches of the primate evolutionary tree.
The authors would like to thank all the members of the Batzer Lab and the Baboon Genome Analysis Consortium for all of their hard work and thoughtful insights.
Membership in the Baboon Genome Analysis Consortium is listed as Additional file 5.
This research was funded by the National Institutes of Health R01 GM59290 (M.A.B.).
Availability of data and materials
The algorithms used in this study are available on GitHub (https://github.com/papioPhlo/polyDetect; last accessed March 2018). The Additional Information files are available on the online version of this paper and through the Batzer Lab website under publications, https://biosci-batzerlab.biology.lsu.edu; last accessed March 2018. Additional file 1, A1, is an excel document (.docx) containing a WGS samples list. This file provides information for all of the WGS samples used in this analysis including a link to the WGS data publicly available on the NCBI-SRA database. Additional file 2, A2, is a word document (.docx) providing a more detailed description of the computational pipeline. Additional file 3, A3, is a word document (.docx) providing all of the possible basal divergence models with the corresponding statistics determine for each. All datasets generated or analyzed in this study will be available from the corresponding author upon reasonable request. Additional file 4, A4, is an extension of Fig. 3. It is an excel file (.docx) containing the complete cluster list: all 57 clusters identified in this analysis. Additional file 5, A5, is a word document (.docx) containing Baboon Genome Analysis Consortium Member Information. This file lists the members of the Baboon Genome Analysis Consortium as well as their contact information.
VEJ, JAW, KCW, CJJ, JR, MKK, and MAB designed the research. VEJ and TOB designed the computational algorithm and performed computational analyses. VEJ, JAW, CJS, TOB, CLM, and CPS-R conducted the PCR validation experiments. VEJ wrote the manuscript; JAW, KCW, JP-C, CJJ, JR, MKK, and MAB revised the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 12.Strum SC, Mitchell W. Baboon models and muddles. In: Kinzey WG, editor. The evolution of human behavior: primate models. New York: New York State University Press; 1987. p. 87–104.Google Scholar
- 22.Caldecott JO, Miles L. Programme UNE, Centre WCM: world atlas of great apes and their conservation: University of California Press; 2005.Google Scholar
- 25.Groves CP. Primate taxonomy. Washington DC: Smithsonian Institution Press; 2001.Google Scholar
- 33.Maples WR, McKern TW. A preliminary report on classification of the Kenya baboon. In: Vagtborg H, editor. The baboon in medical research, vol. 2. San Antonio: University of Texas Press; 1967. p. 13–22.Google Scholar
- 65.Coordinators NR. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2016;44(Database issue):D7–d19.Google Scholar
- 70.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, vol. 1303; 2013.Google Scholar
- 73.Swofford DL. PAUP*: phylogenetic analysis using parsimony, version 4.0b10. 2011.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.