Introduction

Rhesus macaques (Macaca mulatta) are one of the most widely studied nonhuman primate species. Biomedical researchers investigating the “normal” physiology, endocrinology, immunology, and neuroscience of primates depend extensively on analyses of this species (Chiou et al. 2020; Kenwood and Kalin 2021; Rossion & Taubert 2019; Wiseman et al. 2013). Studies of fundamental questions in developmental psychology, social behavior, and other aspects of the psychology of complex social mammals have benefited tremendously from analyses conducted in rhesus macaques (Barr et al. 2003, 2004; Beisner et al. 2020; Fawcett et al. 2014; Morin et al. 2020; Schwandt et al. 2010; Shannon et al. 2005; Talbot et al. 2020). In addition, rhesus macaques are widely used as nonhuman primate models of human health and disease (Bimber et al. 2017; Gibbs et al. 2007; Phillips et al. 2014). This species is the primary animal model for studies of infectious diseases such as HIV-AIDS (Liang et al. 2019) and tuberculosis (Sharan et al. 2020) and recently has been employed to understand SARS-CoV-2 and COVID-19 (Klasse et al. 2021). The Expert Panel report titled “Nonhuman Primate Evaluation and Analysis” submitted to the US National Institutes of Health in 2018 indicated that rhesus macaques accounted for 65% of all nonhuman primates (NHP) used in NIH-funded research studies over the period 2013 to 2017.

Given the importance of rhesus macaques for a diverse array of basic and disease-related research questions, it is not surprising that this species has received a great deal of attention from geneticists. The present paper cannot be a comprehensive review of genetic and genomic studies of rhesus macaques, as that literature is voluminous. The goal here is to provide a brief introduction to some of the major sources of genomic information and other genomic resources relevant to rhesus macaques that are available to the research community. Of course, the most significant “resource” for genetic and genomic analysis of rhesus macaques is the population of rhesus maintained and made available to the research community by the NIH-funded National Primate Research Centers (NPRCs). Investigators using this species have long depended both on this animal resource itself and on the veterinary and scientific expertise that are part of the NPRC program. Information about each of the individual primate centers and the resources they make available can be found at www.nprc.org. Each of the seven centers has its own website and all have active programs that can collect and provide DNA, tissue, or other biomaterials to investigators. Additional information is available at www.nprcresearch.org.

In order to fully appreciate and effectively use the genetic information available for rhesus macaques, one must recognize that this species is geographically widespread and consequently genetically diverse. Rhesus macaques have the largest geographic distribution of any NHP (Groves 2001), ranging from Pakistan and Afghanistan in the west (Goldstein and Richard 1989) across Asia to the Pacific Ocean (Groves 2001). In its eastern range, the species extends from as far north as the Taihang Mountains in China (Wenyuan et al. 1993) southward into central Vietnam, Thailand, and Burma (Groves 2001; Ito et al. 2020). While taxonomists consider all these populations to be members of M. mulatta, the number of subspecies recognized differs among authors (Groves 2001; Rowe and Myers 2016). There is nevertheless agreement that the populations of rhesus macaques across this wide geographic range differ in body size, pelage, temperament and other phenotypes, some with biomedical relevance (Ling et al. 2002). The prior recognition of this morphological and behavioral variation suggested to researchers that there were likely to be meaningful genetic differences among rhesus populations. Thus, researchers have performed various comparisons of Indian-origin rhesus macaques (which constitute most although not all of the rhesus in US research colonies) and Chinese-origin rhesus macaques (which are now extensively studied in China).

Early studies of the genetics of rhesus macaques

Researchers have been studying genetic variation within rhesus macaque populations for more than fifty years. The first analyses were led by Christine Duggleby and examined blood group antigens (Duggleby et al. 1971; Duggleby and Stone 1971), which provided only limited information but did generate repeatable data documenting molecular differences among individuals. Several years later geneticists applied methods to detect electrophoretic variation in a wider range of proteins and thus extended analyses beyond blood group antigens (Cheverud et al. 1978; Melnick et al. 1986). One of the rhesus macaque populations that received substantial attention from these early primate geneticists was the free-ranging population of Indian-origin rhesus macaques introduced onto Cayo Santiago Island, Puerto Rico (Widdig et al. 2016). Subsequently, Melnick and other researchers began to quantify genetic variation within and between natural populations of rhesus macaques (Melnick et al. 1986, 1984) and asked other types of questions, such as evidence for natural selection (Smith and Small 1982).

In the early 1990’s geneticists working on the human genome identified a new class of DNA sequence polymorphisms that was rapidly adopted as a useful tool in a wide range of applications. The human genome, as well as the genomes of other mammals, contains thousands of short sequences that consist of tandem repeats of two, three, or four base pairs. These loci, called microsatellites or sometimes simple sequence repeats, are susceptible to mutations that alter the number of repeats, and consequently any given microsatellite locus will tend to accumulate substantial allelic variation in any moderately sized population. Primatologists studying many different primate species have used microsatellite variation to quantify the amount and geographic patterns of variation among individuals within species and to conduct paternity testing or manage the genetics of captive colonies (Kanthaswamy et al. 2006; Vigilant et al. 2001). Researchers most often genotype microsatellites using PCR to amplify across the variable set of repeats and detect differences among alleles by comparing the length of PCR products. One valuable result of this approach to microsatellite genotyping was that the investigator could use PCR primers for known microsatellites in one species to test orthologous loci in closely related species (Langergraber et al. 2007). Microsatellite variation in Indian rhesus was described early in this field (Hadfield et al. 2001; Kayser et al. 1996; Morin et al. 1997) and polymorphic loci have been identified in both Indian-origin and Chinese-origin rhesus (Satkoski et al. 2008). A whole genome linkage map for the rhesus genome was initially developed using microsatellites (and associated PCR primers) originally identified in the human genome (Rogers et al. 2006). The map was subsequently extended using new microsatellite polymorphisms identified in the rhesus genome (Raveendran et al. 2006). Lists of polymorphic microsatellite loci are available in these various publications, but this author is not aware of any structured online database collating this type of macaque polymorphism.

Reference genomes for rhesus macaques

In order for investigators to conduct efficient, large-scale genomic analyses of any species, it is critical to have an accurate reference genome sequence for that species. For some types of analysis, such as discovery of single-nucleotide variants, one can use a reference genome from a closely related species (Guevara et al. 2021; Rogers et al. 2019). But this approach is not appropriate for many genomic analyses because the genomes of even closely related species contain differences. For example, gene copy number in particular gene families can differ between species within a primate genus, as can synonymous mutations in protein coding genes or the presence of individual enhancers for specific coding loci. There are today a number of strategies for producing a reference genome adequate to support basic genome-wide studies. It is outside the scope of this paper to discuss the advantages and disadvantages of these different approaches to genome assembly. But a brief history of reference genome sequences for rhesus macaques may be useful. Both assembled sequence scaffolds and relevant supporting information for rhesus macaque genomes are available on several publicly accessible databases, such as Ensembl.org, the US National Center for Biotechnology Information, the Univ. of California Santa Cruz genome browser, and the National Genomics Data Center in China (Table 1).

Table 1 Databases providing various types of genomic data for rhesus macaques

The first NHP species for which researchers generated a reference genome sequence was the chimpanzee, Pan troglodytes. Published in 2005, this work was motivated by a desire to compare the newly completed human reference genome to our closest evolutionary relatives (Consortium 2005). The importance of rhesus macaques for biomedical research was the driving motivation for producing an initial reference genome for M. mulatta just two years later (Gibbs et al. 2007). Both the chimpanzee and rhesus macaque first-pass reference genomes were produced using Sanger sequencing methods, making them expensive and time-consuming efforts. As it does for many mammalian genomes, the National Center for Biotechnology Information (NCBI) maintains an online database of genome assemblies that is readily accessible at www.ncbi.nlm.nih.gov/assembly. This database (Table 1) collects reference genomes from prokaryotes and eukaryotes and accumulates new references for a given species as new versions or improvements are submitted. Thus, on the NCBI site there is now a historical record of seven different reference genomes for rhesus macaques, five of them produced using DNA from Indian-origin rhesus and two sequenced from Chinese-origin individuals (He et al. 2019).

The most recent and most complete version of the Indian-origin rhesus genome is Mmul_10, recently published and analyzed by a consortium of investigators (Warren et al. 2020). This new reference assembly was produced using a combination of molecular methods to generate a reference genome more complete and more contiguous (i.e., fewer gaps) than previous rhesus macaque assemblies. The researchers used Pacific Biosciences RSII long-read sequencing technology to generate initial contigs that were sequence corrected using Illumina short-read data. Bionano optical mapping information was then applied to build extended scaffolds. Next Dovetail Genomics Hi-C proximity ligation sequencing was used to further correct scaffolding and finally a series of quality control assessments were performed. The result is a reference genome that is only about 100 megabases longer than the previous Mmul_8.0.1 reference, but has contig N50 length of 42 Mb and more than 99% of the gaps in the Mmul_8.0.1 assembly have been closed. The assembly of long-read sequence data and integration with optical mapping and proximity ligation Hi-C sequencing are able to produce dramatically improved reference assemblies for macaques and for other nonhuman primates (Kronenberg et al. 2018).

The Ensembl genome browser (Howe et al. 2021) is a valuable source for a wide range of genomic information. At www.ensembl.org one can access several versions of the rhesus genome, including Mmul_10. Like the NCBI site, the Ensembl database also provides easy access to a wide range of annotation data and other information. One should be aware that there are also assemblies for genomes from other macaque species (M. fascicularis and M. nemestrina) on Ensembl. The rhesus reference genome is found by following the “macaque” link in the pull-down list, while the cynomolgus or long-tailed macaque (M. fascicularis) reference is listed under “crab-eating macaque” and the pigtail macaque (M. nemestrina) reference is listed as “pig-tailed macaque.” The UC-Santa Cruz genome browser (Navarro Gonzalez et al. 2021); www.genome.ucsc.edu) provides access to Mmul_10 in a convenient browser view that many investigators find user-friendly and valuable. The Mmul_10 reference can be found by clicking the “rhesus” link on the phylogenetic tree at the left of the UCSC genome browser landing page. Older versions of the rhesus reference genome are also available on UCSC. However, readers are warned that the Mmul_10 assembly is labeled “rheMac10” on UCSC. The original reference assembly from 2007 was labeled “rheMac2” and this naming format has been maintained in some databases as new upgrades appear.

In some cases, the new version of a reference genome is produced using DNA from a different individual than the prior assembly (e.g., Macaca_fascicularis_6.0 vs. Macaca_fascicularis_CE_1.0). In these cases, one should expect sequence and possibly even copy number differences due to legitimate biological differences among individuals within a species. In other cases, investigators have added new data from the original animal used to generate a reference assembly and using the additional data improved the prior version. One example of this is the MacaM reference that in 2014 upgraded the rheMac2 assembly based on Sanger sequencing by incorporating new Illumina data, newer assembly algorithms, and newer gene annotations (Zimin et al. 2014). On any of the genome browsers, one should carefully check the date of submission and the origin of the DNA sample used, to confirm that you are looking at the version of the genome you wish to use. For many NHP species (e.g., chimpanzee, gorilla, marmoset, and others), the recent reference genome sequences are more accurate and complete than older versions, so versioning matters.

The National Genomics Data Center (https://ngdc.cncb.ac.cn/) is maintained by the Beijing Institute of Genomics in China. This database provides searchable access to a number of data types related to the RheMacS assembly of the Chinese rhesus macaque genome, which is labeled rheMacS_1.0 on NCBI. This is a high-quality reference for the Chinese rhesus sequenced using long-read Pacific Biosciences technology and assembled using FALCON. As with other recent long-read assemblies the contig N50 (8.2 Mb) and scaffold N50 (148.4 Mb) are better (longer) than older primate assemblies not employing long-read methods. Information including genome sequence, gene annotations, gene expression, and other data are managed there (Cncb_Ngdc_Members and Partners 2021). The Institute of Molecular Medicine, Peking University, Beijing maintains RhesusBase (Zhang et al. 2013; Zhong et al. 2016), a database that also contains a genome browser for viewing the rhesus genome, as well as various types of annotation information.

Current resources for single-nucleotide variants

As mentioned above, genetic variation within rhesus macaque populations has been the subject of study for decades (Duggleby et al. 1971; Melnick et al. 1986). The present paper will not attempt a comprehensive review but will address recent efforts to document single-nucleotide variants (SNVs) discovered through one or another form of DNA sequencing. The first reports of SNVs this author is aware of include data published as part of the initial reference genome sequencing for rhesus macaques (Gibbs et al. 2007; Hernandez et al. 2007), as well as smaller-scale studies by Ferguson, Norgren, and Smith (Malhi et al. 2007; Street et al. 2007). Researchers are commonly more interested in polymorphisms that influence protein coding sequences as opposed to non-coding or UTR regions. Early studies identifying substantial numbers of coding sequence variants soon followed the first set of SNV analyses (Fawcett et al. 2011; Yuan et al. 2012). It is now clear that rhesus macaques, like many other nonhuman primate species, are segregating for a larger number of SNVs per individual than are humans. Although at this time we know little about SNV diversity in natural populations of Indian rhesus, the Indian-origin rhesus from the US national primate centers have high levels of nucleotide heterozygosity in both coding and non-coding sequences (Warren et al. 2020). Natural populations of Chinese rhesus macaques are currently better studied than wild Indian rhesus, and the Chinese populations also exhibit substantial levels of variation, likely higher than their Indian-origin conspecifics (Liu et al. 2018; Xue et al. 2016).

There are several databases cataloging information about SNVs in rhesus macaques. The UCSC genome browser has the capacity for investigators to generate custom tracks illustrating SNVs in their genomic positions, and all the > 85 million SNVs identified through the recent consortium study of variation among US research colonies (Warren et al. 2020) can be viewed on the annotated SNV track within the UCSC rhesus macaque browser (https://genome.ucsc.edu/; then choose the rhesus rhemac10 genome and open the “Rhesus SNVs” track). A relatively new and tremendously useful database of rhesus macaque sequence polymorphism is the Macaque Genotype and Phenotype (mGAP) database (Bimber et al. 2019) developed and maintained by the Oregon National Primate Research Center (https://mgap.ohsu.edu/). Like the UCSC browser, mGAP is a searchable database that allows investigators to identify SNVs in any region or gene within the rhesus genome. Within mGAP one can also obtain information about allele frequencies across all individuals covered by the mGAP database, as well as allele frequencies specific to different research colonies. Most of the information in mGAP relates to Indian-origin rhesus macaques in US research colonies. The mGAP database provides easy quick access to a great deal of SNV information and metadata, but does not include all the variants found on the UCSC SNV track. Rhesusbase is a database that provides information on variation in Chinese-origin rhesus macaques, developed and maintained by the Institute of Molecular Medicine, Peking University (rhesusbase.cbi.pku.edu.cn).

Other information concerning macaque genomic variation

There are multiple different applications for information regarding genetic variation in rhesus populations, including studies of population genetics (Hernandez et al. 2007; Liu et al. 2018), analyses of the genetic causes of macaque pathology relevant to human disease (Bimber et al. 2017; Dray et al. 2018; Moshiri et al. 2019; Peterson et al. 2019; Rogers et al. 2013), applications of genetic markers to assist in the genetic management of captive breeding colonies (Kanthaswamy et al. 2014, 2006; Petty et al. 2021; Smith 1980, 1982), or studies of functional variation influencing normal (non-pathogenic) phenotypic diversity (Warren et al. 2020). Furthermore, while SNVs are the most common type of polymorphism in the macaque genome, gene copy number variants, structural variation, and other types of insertion/deletion polymorphisms may account for a larger number of affected base pairs. The first analysis of gene copy number variation among rhesus macaques was published by Perry and colleagues (Lee et al. 2008). Some analyses have addressed focused questions of genotype–phenotype relationships (Degenhardt et al. 2009), while others have surveyed copy number differences more broadly (Braso-Vives et al. 2020; Thomas et al. 2021; Warren et al. 2020). To my knowledge, there is no large-scale searchable database of copy number variation yet available for this species. However, structural variation (SV) is now recognized as a significant element of human genetic variation and an important contributor to disease risk. As long-read sequencing technologies become more accessible to researchers working on nonhuman primate models, we can expect that information concerning SVs in rhesus macaques and other laboratory primates will increase. Given the potential importance of identifying and studying SVs that influence functional genes in nonhuman primate populations, it will be useful to develop primate databases for SVs similar to those now available for human genetics.

Within the arena of biomedical research, one of the major applications of rhesus macaques is the study of immunology and infectious diseases. The major histocompatibility complex (MHC) loci of this species are more complex than the MHC loci in the human genome because macaques display both sequence variation within particular coding genes and copy number variation such that different rhesus macaques can have different numbers of functional, expressed Class I MHC genes (Wiseman et al. 2013). The Immuno Polymorphism Database (https://ebi.ac.uk/ipd/) maintained by the European Bioinformatics Institute provides extensive information concerning genetic polymorphism in MHC and Killer Immunoglobulin-like Receptor (KIR) genes in rhesus macaques. For other immune-related genes, the IMGT reference database (https://imgt.org/) provides extensive information on macaque immunoglobulin and T-cell receptor gene sequences, as well as equivalent information for a wide range of other species (Giudicelli et al. 2006). Further data on immunoglobulin gene sequences can be found in various publications (e.g., (Cirelli et al. 2020)). Additional information on immunoglobulin IGH genes is available at the Karolinska Macaque Database: https://kimdb.gkhlab.se/ (Vazquez Bernat et al. 2021).

Gene expression data for rhesus macaques

Analyses of gene expression in rhesus macaques are obviously important for a wide range of biomedical and basic science questions. Large-scale data describing gene expression in rhesus are available as a result of various publications (e.g., (Bakken et al. 2016; Bakken et al. 2015; W. Zhang et al. 2021; Zhu et al. 2018). Much of this work has focused on gene expression in the brain and during neurodevelopment, but other tissues have also been examined (Peng et al. 2015; Zhang et al. 2021). Data on gene expression are available through the Nonhuman Primate Reference Transcriptome Resource (https://nhprtr.org/), the UCSC genome browser, the Rhesusbase browser, NCBI, and Ensembl. As discussed above, significant attention has been given to immunogenetics in rhesus macaques due to their importance in studies of infectious disease. Consequently, substantial data are available regarding the expression of genes involved in immunity and immune system activation (e.g., (Palesch et al. 2018)). Transcriptome data for various T-cell populations from SIV-infected macaques have been analyzed (Mavigner et al. 2019) as has expression data from dendritic cells (M. Y. Lee et al. 2021).

Other genomic resources

Researchers have also generated other genetic and genomic resources for rhesus macaques. One aspect that has received modest attention is recombination and genetic linkage in the rhesus macaque genome. The first linkage studies using polymorphisms known at the time were published in the mid-1980’s (Ferrell et al. 1985). Investigators subsequently produced a whole genome pedigree-based linkage map using microsatellite loci (Rogers et al. 2006) and then later a higher-resolution recombination map using SNV genotypes (Xue et al. 2016, 2020). Variation in mitochondrial DNA sequences have also received substantial attention. It is not possible to cite all mtDNA studies of rhesus macaques here, but several large-scale analyses deserve mention. Comparisons of mtDNA sequence data both among primate species, including genus Macaca (Evans et al. 2020; Roos et al. 2019) and among populations within M. mulatta (Hasan et al. 2014; Smith and McDonough 2005; Su et al. 2019) have been informative for population genetics and phylogeny.

One aspect of primate genomics that has been difficult to study until the recent decrease in the cost of whole genome sequencing is the rate of de novo single-base mutations. Studies in human pedigrees have calculated the rate of de novo nucleotide mutation by comparing DNA sequences of offspring to their parents. Such studies find that most de novo mutations are transmitted by males and that the number of de novo mutations transmitted increases with increasing paternal age (Besenbacher et al. 2015; Jonsson et al. 2017). The same approach is now being applied to nonhuman primates with interesting results (Besenbacher et al. 2019; Thomas et al. 2018; Wu et al. 2020). Among rhesus macaques, as in humans, more de novo mutations are transmitted by males than females and increasing paternal age does increase the rate of observed de novo mutations (Wang et al. 2020). We should expect that over time researchers will learn more about the rate and pattern of de novo DNA sequence mutations in rhesus macaques and other primates and that this information will become a resource for additional downstream analyses of various aspects of molecular genetics, embryonic and postnatal developmental, aging, disease risk, and other fundamental questions.

Closing Comments

Rhesus macaques are one of the most intensively studied nonhuman primate species. Analyses of the genetics and genomics of rhesus cover a wide range of topics relevant to evolutionary biology, population genetics, genome structure, genome function, and models of human disease. This level of interest and investigation is unlikely to change in the future as rhesus macaques continue to be a mainstay of biomedical and basic primatological research. The amount of genomic information available for rhesus macaques continues to grow rapidly and the number of specialized databases collecting, archiving, and presenting those data is also likely to grow. This partial summary of the information resources available will serve to provide an introduction to what is available today, but is not intended to be comprehensive.