Background

One of the most exciting outcomes from the genomic era and the plunging costs of DNA sequencing is the increased efficiency to identify genetic variants associated with disease risk and progression. This leap forward portends the rapidly approaching era of predictive genetic medicine and personalized therapy. Translating insight into the clinic will require robust genomic models to test and advance gene-based therapeutic approaches. Non-human primates (NHPs), especially macaques, are well positioned to serve as a critical link, providing a highly relevant pre-clinical model. Due to their high level of genomic, physiological, anatomical, neurological and behavioral similarity to humans, macaques are often indispensable to the study of complex human disease.

The close evolutionary history of macaques and humans is evident in their highly similar breadth of natural disease susceptibilities. Moreover, the parallel genetic associations already reported for a broad range of diseases (Table 1) suggests opportunity to leverage the macaque for the discovery of disease-associated biomarkers and for use in the development of new pharmacogenomic or personalized medicine therapies [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]. The range of orthologus genetic associations include common diseases, such as cancers, reproductive disorders, retinal and infectious diseases, and rare maladies such as Krabbe disease. Similarly, complex behavioral traits such as heightened anxiety, and variable response to commonly used pharmacological agents such as naltrexone, have been linked to the variants in the same genes as in humans. Other macaque disease models, such as for autism, polycystic ovarian syndrome (PCOS) or type II diabetes, have been reported, but genetic linkages have yet to be uncovered (Table 1). The growing list of macaque traits are likely the ‘tip of the iceberg’ of the natural models of human disease, as other diseases likely go undetected either due to a less obvious phenotype (e.g., hearing loss) or due to early life lethality. The broad spectrum of natural models in rhesus macaque populations presents largely untapped opportunities for translational study, and may negate the need for more costly model development approaches, such as gene-editing.

Table 1 Example of diseases and traits reported in rhesus macaques. Diseases are grouped based on whether a genetic association has been reported in macaques for that disease

Despite the clear value of macaques as a human disease models, there are very limited genomic resources available to support the exploration and discovery of potential natural models of genetic disease. Notably, there is a paucity of public databases tailored to searching NHP sequence variant and phenotypic data. The major human variant databases, dbSNP [31] and dbVar [32], phased out support for non-human data in 2017. The NHP Research Consortium Genetic Variant Database website provides a portal for users to deposit NHP SNP data [33]. While valuable, related information about allele frequencies, predicted functional effects, cross-species conservation or links to source animals are not available, limiting the utility for variant interpretation. The Ensembl databases include macaque data, which includes a diverse set of databases and tools, a genome browser, and varying degrees of population-level and phylogenetic data. Here we present the macaque Genotype And Phenotype (mGAP) resource, the first public website providing searchable, extensively annotated macaque genomic data generated from a large, well-characterized rhesus macaque cohort. Once registered, users can browse a curated catalog of variants derived from the whole genome sequence analysis of a large and growing cohort of rhesus macaques. The website and dataset complement existing resources such as Ensembl, primarily by providing data from a defined cohort, much like the 1000Genomes project or similar human cohorts [34]. Importantly, because mGAP provides genotype-level data for all subjects, it is possible to identify specific individuals harboring alleles of interest. We developed a novel annotation pipeline that leverages human annotation data sources to guide interpretation of the macaque variants, an approach that could easily be adapted for other model organisms. The website is designed to allow direct exploration of genomic data, with easy export of all information in standard conventional formats for external processing. mGAP represents an important new resource for NHP genetic and genomic research, which together with the expanding range of macaque ‘omics’ data, supports investigation into links between sequence variants and molecular mechanisms of disease.

Construction and content

Rhesus macaque variant catalog

A key feature of mGAP is a catalog of high confidence genomic variants, derived from high depth, whole genome data. The current mGAP release at the time of this publication (1.7) contains whole genome data from 293 Indian-origin rhesus macaques. These animals were selected from a multi-generational outbred cohort of macaques bred at the Oregon National Primate Research Center (average kinship 0.0016). Genomic DNA was isolated, used to generate Illumina libraries as previously described, followed by sequencing on an Illumina XTen sequencer [35]. We obtained an average of 704,193,015 paired Illumina reads per animal, with a range of 549,660,386 - 1,843,458,162 (Additional file 1: Table S1). Primary FASTQ files have been submitted to SRA under BioProject PRJNA382404. All data were aligned to the MMul_8.0.1 genome (GCF_000772875.2), with an average of 700,505,668 paired reads aligned per animal, providing an average high-confidence genome coverage of 20X. These sequence data were processed using a previously described macaque variant calling pipeline ([35] and Additional file 2: Supplemental Methods), adapted from the Broad Institute Best Practice Variant Discovery Pipeline [36]. The pipeline includes extensive quality filtering of the raw variant data as previously described, including the use of Mendelian inheritance for validation.

The mGAP 1.7 release contains 17,087,212 passing variants from this dataset (Fig. 1a). Of these, 2,664,020 (15.5%) are private alleles (detected in only a single individual). Because the cohort includes distantly related individuals, the latter is likely lower than would be observed in an unrelated cohort of similar size. These data include single nucleotide variants (SNVs) and short indels, with the latter comprising 12.0% of variants. As expected, the majority of variants are outside of coding regions (Fig. 1b); however, we identified 250,030 variants within coding regions, including 4888 frameshift mutations and 1871 stop-gained mutations. The mGAP variant database is updated approximately 3–4 times per year, as new data become available.

Fig. 1
figure 1

Summary of variant catalog. a Table listing the number of total variants and variants by type. b Chart showing the distribution of variants relative to gene regions. For each category, the first percentage indicates the fraction of variants of this type. The second percentage indicates the fraction of the genome of this category (when excluding repetitive regions, which are masked from mGAP variant calling). The categories Downstream Gene and Upstream Gene are defined as within 5000 bp of the gene start or end, respectively. c The derived allele frequency spectrum of variants in the mGAP database

Variant functional annotation

Assigning functional significance to genetic variants is often a challenge, and this is particularly problematic for model organisms like macaques that have far less developed annotation resources than human. To overcome this, we developed a multi-part process to annotate our data (Fig. 2). Rhesus macaque variants are first lifted to the human genome (GRCh37.p13, GCA_000001405.14). In the latest release, approximately 65% of macaque variants were successfully lifted to the human genome. Any variant that was unable to lift to the human genome was annotated accordingly. Of the variants that lifted to human, 3,046,012 (17.8%) matched positions reported in the 1000Genomes phase 3 dataset, representing a range of macaque allele frequencies (Additional file 3: Figure S1). The successfully lifted variants were annotated using multiple sources. We utilized Cassandra, a utility that annotates variants overlapping with a variety of human data sources [37], including regulatory elements (ENCODE [38], RegulomeDB [39]), overlap with known disease- or phenotype-associated variants (GRASP [40]), predicted impact (CADD [41, 42], SIFT [43], PolyPhen2 [44]), conservation (Phylop [45], PhastCons [45]), and others [37, 46, 47]. We additionally annotated any variants with identical position and allele as a variant recorded in ClinVar as ‘pathogenic’ or ‘likely pathogenic’ (ClinVar download 9/1/2018). [48,49,50]. After annotation, variants were translated back to their original macaque genomic coordinates. In addition to human annotations, SnpEff was used to annotate predicted impact on protein coding using the macaque gene annotations [51]. While developed for rhesus macaque, the same liftover annotation process could easily be adapted to other model organisms. The final product of the annotation pipeline is a single annotated VCF file, which is a standard and interchangeable format. All tools created for this process are publicly available [52].

Fig. 2
figure 2

Variant annotation strategy. Macaque variant data are first lifted to the human genome (GRCh37), and annotated against multiple sources. The resulting variants are translated back to the original rhesus macaque coordinates (MMul8.0.1), and merged with the non-lifted variants

Using this annotation pipeline, there were 2767 variants identical at both position and allele to variants in the human ClinVar database (version GRCh37 20,180,128). Of these, 276 are listed as pathogenic, spanning 258 distinct genes. This list includes variants linked to 160 distinct disorders/phenotypes, including cardiovascular phenotypes (aortic aneurysm, congenital heart disease), immune disorders (polyglandular autoimmune syndrome, Behcet’s syndrome), and many additional rare diseases (Peutz-Jeghers syndrome, Leigh syndrome). Our annotation process also identified many variants predicted to severely impact protein coding, including 12,472 predicted as high impact by SnpEff and 13,129 predicted as damaging by PolyPhen2, encompassing 4091 genes with predicted frameshift mutations. A complete list of variants is available through the mGAP website.

mGAP website design and data storage

The mGAP website was developed as a module for LabKey Server [53], written using a mixture of Java and JavaScript. The site was designed to heavily leverage existing public resources, such as integration with JBrowse, an open-source genome browser [54]. The source code for the mGAP module and all related modules is freely available through the LabKey Server public subversion repository (https://svn.mgt.labkey.host/stedi/trunk). The primary mode of data storage for each release is the annotated VCF file, which has the advantage of being easily exportable and capable of being processed with most standard variant-processing tools. Limited summary statistics on each release are stored in the database, such as the number of variants and a breakdown of variants by type. The site includes demographics data on all animals, which is also stored and implemented using the LabKey study framework [53].

Utility and discussion

The mGAP website overview

The mGAP website provides an interface to download raw data or to explore macaque variants, available at https://mgap.ohsu.edu/. The main page of the website contains a summary of the current release, and tabs for each of the main datatypes: the variant catalog, macaque phenotype and model information, cohort/demographics data and raw sequence data (Fig. 3). The mGAP database currently includes a list of disease models and phenotypes reported in rhesus macaques. The website provides a table with demographics data from the cohort, including sex, species, age, ancestry, and average kinship within the cohort. The mGAP website will be expanded in the future to include clinical data and other phenotypic measures for each subject as well.

Fig. 3
figure 3

Overview of mGAP website navigation. The front page of the mGAP site contains an overview of the current release, description of the resources, and links to commonly used pages. The website has additional tabs, providing detail on the variant catalog, phenotypes and macaque models, demographics data from the mGAP cohort and access to raw sequence data

Access to variant data

The website is designed to allow any user to easily interact with variant data and to support basic queries of the dataset. The genome browser is the primary mode through which users can interactively explore variants. Users can search by genomic position or gene name, and view variants in the context of genes and other macaque genomic features. Variants are color-coded by type and functional prediction to highlight predicted missense and high-impact variants (Fig. 4). Users can click on any variant to view greater detail, including the complete set of functional annotations and cross-species conservation data. The variant details window also includes the minor allele frequency within the currently sequenced population, distribution of homozygous and heterozygous genotypes, and a link to view individual genotypes at this position (as a separate table). In addition, for each release we generate a list of predicted high-impact variants, intended to include those variants either previously implicated in human disease (such as those overlapping ClinVar data), or predicted to have a high impact on protein coding (such as SnpEff high-impact or PolyPhen2 damaging variants). In the 1.7 release, this high-impact list comprises 25,877 genome-wide variants that are predicted to be associated with 2071 human disease or phenotype entries listed in OMIM (Online Mendelian Inheritance in Man).

Fig. 4
figure 4

The mGAP genome browser. Variants of interest may be searched by chromosome position or gene name in the search bar. Diamonds indicate SNVs and bars indicate indels, with colors indicating predicted impact on protein coding (red = high, yellow = moderate, blue = low). Clicking a variant provides a detailed pop-up with summary and annotations. The provided information includes predicted function, regulatory data, species conservation, phenotypic data, allele frequencies, and genotypes by animal

Conclusions

The macaque Genotype and Phenotype Resource (mGAP) represents a critical resource for genomic research that fills a major gap in the NHP research field. mGAP provides a unique combination of variant and genotype data from a large cohort of Indian-origin rhesus macaques, processed in a consistent manner across all subjects. While comparable datasets are common for most models, until now they have been lacking for macaques. These data serve basic yet essential functions to support modern genomic analyses, including a catalog of common and rare variants in macaques, coupled with allele frequency and genotype-level data. The mGAP website is designed to allow exploration of data, primarily though the genome browser, or complete export of the entire dataset into common formats for external analysis.

The resources provided by mGAP are designed to advance genomic research in macaques, with the primary goal of identifying and developing new genetic disease models. In this regard, the resource has utility that extends beyond the initial set of sequenced animals. Investigators can explore potential associations between a macaque phenotype of interest and variants in candidate genes by using the mGAP browser to identify compelling candidate variants based on predicted translational effects, allele frequencies and cross species conservation. Prioritized variants can then be evaluated using targeted re-sequencing or genotype-specific assays in macaques lacking genome-wide sequence and tested for phenotype association.

mGAP also presents new opportunities to reveal currently unrecognized macaque disease models, which is facilitated by the highlighted ‘Variants of Interest’ list on the website. This list includes variants that are either identical to those linked to disease in humans or have a predicted highly deleterious impact on protein coding. Supporting this ‘reverse genetic’ approach to model discovery, a comprehensive medical record is established for each of the rhesus macaques housed at the Oregon National Primate Research Center, including regularly updated clinical data, and as available, imaging, histology, and pathology notations. The identification of naturally occurring genetic macaque models provides several advantages over genetically-induced models, such as those produced using CRISPR-like approaches, since there is no danger of experimentally induced off-target effects, the phenotypic evaluation can be carried out in existing animals at considerably less cost than creation of transgenic animals, and the identification of multiple subjects with the desired genotype has the potential to support model expansion.

The rapid expansion of ‘-omic’ level data produced by both NHP and human studies, affords exciting new opportunities for high-resolution data integration and analysis. In this regard, mGAP provides a missing link in the coordinated study of genomic variants in combination with macaque transcriptomic, metabolomic and proteomic findings [55,56,57,58]. In parallel, deep learning framework approaches are now emerging that enable the prediction of the functional effects of single base variants on gene expression and disease risk [59]. In addition to purely macaque datasets, studies have shown the value of combined analysis of human and NHP genome-wide data, through the use of neural network approaches that predict variant association with disease, and mGAP data may facilitate similar studies [60].

In addition to model discovery and analysis, mGAP also provides a much-needed resource to improve the design of molecular research tools. For example, information on variant location and allele frequency is critical information to optimize the design of DNA primers for use in PCR amplification or peptide-based ligands for antibody generation. Similarly, SNP array designs for research use or colony management applications will benefit from the available information on allele frequencies and potential pathogenic effects, guiding the selection of appropriate variants to meet the goals of each tool.

Interpretation of genome-wide data can be daunting. For human genomic data, tools continue to be developed to sift through variants and prioritize potentially damaging or relevant variants. The algorithms used in these tools are generally aided by the considerable body of human genome annotation, ranging from catalogs of disease-associated variants to functional annotation developed by experimental studies. Because these resources do not currently exist for macaques, we developed a process to annotate macaque variants using available human data. The annotation processes developed for mGAP can also serve as a template, adapted to enhance the interpretation of genomic variants associated with other model organisms. While lifting human annotation to macaque has value and can identify relevant variants, interpreting annotation across species must be viewed carefully and can have pitfalls, especially for non-coding and regulatory regions. It is therefore important to develop macaque-specific functional annotation, such as data linking genetic variation to changes in gene expression (i.e. eQTLs) or other functional genomic data. The rapid decrease in sequence costs, and rise of massively parallel single-cell analysis approaches, should enable these data to be generated much more rapidly and at a fraction of the cost of the analogous human studies.

While the majority of mGAP data is centered on Indian-origin rhesus macaques from the ONPRC breeding colony, previous studies have found that these macaques are genetically representative of major research captive breeding populations at other US National Primate Research Centers [61]. Further, the utility of mGAP has attracted interest from a broad range of NHP investigators, and thus mGAP now supports data submissions from other rhesus macaque populations. This expansion in data sources will enable immediate cross referencing of variants of interest, facilitating the identification of genotype-specific animals at multiple facilities. In the advancing era of precision medicine, selection and access to genotyped animals will be increasingly critical to enabling state-of-the-art biomedical research.