Key words

1 Introduction

Population genomics can address very different biological questions related to speciation, divergence of closely related species, within species population structure or within population evolutionary processes that affect adaptation. In the era of next-generation sequencing (NGS) with increasing taxonomic sampling, the crucial factor to apply population genomics is not any longer the number of genetic markers (quantity) but it is quality and complexity of the massive amount of available information that needs to be integrated and interpreted.

In this chapter, we focus on studies of population genomics in rodents and in particular on the Murinae. Murinae as a subfamily of rodents comprises more than one hundred genera and it is among mammals one of the largest subfamilies with species native to most continents. Murinae includes the house mouse (Mus musculus) and the brown rat (Rattus norvegicus) of which laboratory strains have been used since decades for biomedical research, as well as to serve as models to study human diseases. Further, as human commensal species, both harbor also vectors for spreading infectious diseases that makes the wild living animals and populations of special interest. But also their evolutionary histories make them perfect models for studying general evolutionary processes, such as speciation, rapid adaptation and behavioral changes.

1.1 History of the House Mouse

A recent book, “Evolution of the House Mouse” [1], provides a broad overview on a variety of evolutionary aspects for the house mouse. Other general reviews can be found in [2, 3]. Here, we provide a short summary.

Mice consist of four major clades (Coelomys, Mus, Nannomys, and Pyromys), of which the subgenus Mus harbors the species Mus musculus, the house mouse. House mouse genetics began early in the twentieth century based on the first inbred strains from wild derived animals to study modes of inheritance [4, 5]. The world-wide distribution range of the house mouse is depicted in Fig. 1. It shows three main subspecies, the southeastern Asian house mouse (Mus musculus castaneus), the eastern European house mouse (Mus musculus musculus) and the western European house mouse (Mus musculus domesticus). Next to these main subspecies, there exist other subspecies (e.g., M. m. molossinus, a presumptive hybrid species between M. m. castaneus and M. m. musculus; [6], M. m. gentilulus [7, 8], M. m. homoulus [9], and further recently diverged ones like M. m. helgolandicus [10]). Most inbred strains and the reference genome sequence are derived from M. m. domesticus. The mouse genome was the first sequenced mammalian genome published in 2002 (Mouse Genome Sequencing Consortium, 2002) [8]. The genome consists of 19 autosomes and 2 sex chromosomes (X and Y) with a total length of 2.7 Gbp (currently with 22,612 coding and 15,402 noncoding genes annotated). The mouse ENCODE [11] consortium and genome assemblies of wild-derived inbred strains of the main subspecies have further enhanced the available genomic information [12,13,14], complemented by detailed recombination maps [15,16,17]. Genomic and transcriptomic data from wild derived populations of the subspecies and the sister species Mus spretus were reported in [14].

Fig. 1
figure 1

Sampling locations of mice for which public population scale WGS data exist. Population scale sampling locations of house mouse and close relatives

As one of the prominent human commensals, the dispersal and phylogenetic history of the house mouse were intensively studied. The ancestor of all subspecies within Mus musculus was initially thought to have lived in India [1, 18], but a broader sampling has shown that the Iranian plateau shows the highest diversity of lineages, including some as yet unnamed lineages [10]. The main subspecies started to diverge ~350–500 thousand years ago. As recently diverged species, one finds frequently phylogenetic discordance at different loci, whereby the statistical analysis of discordance patterns shows a strong deviation from a neutral model of pure lineage sorting [19]. Based on population data, it was shown that this is most likely due to secondary adaptive introgression, even across large geographic distances [20, 21]. The overall phylogenomic analysis suggests that M. m. musculus and M. m. castaneus are sister groups and that M. m. domesticus is more basal [12, 19].

The subspecies meet in several zones of secondary contact, where they form hybrid zones [2, 18, 22]. Fertility of offspring is impaired across these hybrid zones, and this serves as a general model to study the genetic basis of hybrid sterility as part of speciation processes (e.g., [22,23,24,25,26]).

Studies on house mouse phylogeography showed that the spread of the populations, especially those of M. m. domesticus, reflects human colonization and settlement history. For example, by looking into mtDNA haplotypes of worldwide distributed mouse samples, some historical human movements, such as following the seafarer routes of Vikings [27,28,29] or the colonization history of sub-Antarctic islands, could be reconstructed [30, 31].

Systematic population-level sampling of mouse populations has been introduced by Ihle et al. [32], where the sampling regime has taken care of the fact that mice tend to show inbreeding in family groups. Initial microsatellite based scans of populations that were sampled in this way suggested a high rate of positive selection between closely related populations [33]. The colonization history of Western European populations was traced by fossil evidence [34] and shown to be less than 3000 years ago. Nonetheless, these populations show clear genomic differentiation [20, 32, 33], differences in gene expression [35, 36], ultrasonic vocalization and mate choice [37, 38]. They harbor also a number of deme-specific MHC haplotypes [39].

Despite genomic resources, including a variant database of 17 laboratory inbred strains [12, 40], there was the need to derive laboratory strains that harbor most of the natural variation found in wild-derived populations [41, 42]. Genotype arrays were established that were constructed to maximize variant information at low sequencing costs [43]. The still commonly used genotyping arrays are MegaMUGA with a set of 77,808 SNP markers and GigaMUGA with a set of 143,259 SNP markers [44], which only represent a fraction of variants found between any sequenced inbred strain and the reference genome (~4 to 5 million SNPs; [12, 45]). However, researchers started to complement their analyses with NGS based datasets and genomic resources for wild populations of the house mouse are now common ground for subsequent analysis [14].

1.2 Brown Rat History

Mice and rats approximately diverged 7–12 million years ago [46]. Similar to house mice, brown rats (Rattus norvegicus) have been used for more than two centuries for biomedical studies to learn about the basis of human diseases and to deal with human pest management [47, 48]. The genome of the brown rat was published in 2004 [49] and consists of 20 autosomes and 2 sex chromosomes (X and Y) with a total length of 2.8 Gb (currently with 22,250 coding and 8934 noncoding genes annotated). The house mouse genome and the brown rat genome show a high number of shared syntenic homologous blocks with different levels of recombination [50]. Approximately 30% of the rat genome aligns only with the mouse genome, which might correspond to rodent-specific repeats [49]. A syntenic view of both genomes is given in Fig. 2 to illustrate the pairwise chromosome assignment obtained from the Synteny Portal (see Table 1 for web page URL link; [51]).

Fig. 2
figure 2

Syntenic blocks between house mouse and rat. House mouse (GRCm38/mm10) and brown rat (Rnor_6.0/rn6) chromosome-wide syntenic blocks obtained via the web-based Synteny Portal [51]

Table 1 Useful public URL links for house mouse resources

The origin of the laboratory brown rat (Rattus norvegicus) and the black rat (Rattus rattus) most likely lies in central Asia [52]. Spatial population genomics studies were conducted on brown rats living in New York City [53] and, like in mice studies, mtDNA haplotype data could disentangle the phylogeography of brown rats in the countries surrounding the South Atlantic Ocean [54]. While the phylogeography of black rats, like the phylogeography of house mice, reflects human colonization and settlement history [53, 55, 56, 57], brown rats did not appear in Europe until the sixteenth century. Their dispersal routes from Asia to Europe are still under debate [57]. For example, one route is thought to lead via northeast China and Siberia, while another route inferred on whole-genome sequencing may represent an expansion via a Southern East Asia route [58]. Figure 3 illustrates the sampling distribution of Rattus norvegicus from publicly available whole-genome data sets.

Fig. 3
figure 3

Sampling locations of rats for which public population scale WGS data exist. Population scale sampling locations of brown rats obtained from [58]

2 Population Genomics

As mammal species expand, they are faced with new abiotic and biotic factors, such as different climatic conditions, different food or new pathogens, prey and/or predators, which potentially lead to adaptation and contributes to shaping the genome over time. Evolutionary changes in the genome can result from mutation, gene flow, random genetic drift, recombination and selection. Genome-wide scans for deviation from modelled neutrality aim at revealing such evolutionary processes. Genome-wide scans can help to identify genotypic and phenotypic variation, and by taking demographic events into account, they can even detect genes under recent positive selection [59]. Negative selection leads to sequence conservation by removing disadvantageous alleles. Positive selection can yield to an excess of nonsynonymous fixed differences or lead to an altered allele-frequency spectrum (AFS). Multiple approaches exist to detect adaptation, each with its own caveats. For example, dN/dS ratios can be used in comparative studies to detect selection on genes. But this analysis is limited to species that represent a certain evolutionary distance to allow a sufficient number of substitutions to have occurred [60]. When samples are drawn from different populations of the same species, it is necessary to study frequency changes of polymorphisms instead of substitutions. As compared to studies with a limited number of neutral markers, population genomics uses high marker density to robustly infer genome-wide effects, usually as signals of departure from expectations of the neutral theory of molecular evolution (see Chapter 5 for a detailed description how to detect positive selection).

2.1 House Mouse Genetic Variation

Population genetic studies revealed a fairly large effective population size (Ne) for wild natural populations of mice in the order of Ne = 5 × 105 to 2 × 106 [61, 62] with two to three generations per year. Based on a genotyping array, the effective population sizes for the subspecies were estimated to range between Ne = 0.25 × 105 to 1.2 × 105 for M. m. musculus, Ne = 0.58 × 105 to 2 × 105 for M. m. domesticus and Ne = 2 × 105 to 7 × 105 for M. m. castaneus [63]. This assumption was validated recently by a population genomic study on nucleotide diversity within the subspecies of M. m. castaneus [64]. In the same study an excess of adaptive substitutions in protein-coding genes, UTRs and conserved noncoding elements (CNE) were observed [64]. A follow-up study based on the same data recently inferred the recombination landscape within the same subspecies and revealed that genetic diversity is positively correlated with the rate of recombination [17] (see ref. 13 for the recombination landscape in the collaborative cross [41] and see ref. 65 for mouse inbred strains). The frequency-weighted mean estimate of the recombination rate was inferred from a broad-scaled map to 4Ner/bp = 0.0092 for autosomes per bp and to 4Ner/bp = 0.0026 for the X chromosome [17].

One candidate gene that is known to influence recombination break points in mammals is PRDM9 [66,67,68,69]. PRDM9 is highly polymorphic in natural populations of the house mouse [70, 71] and it was recently shown that some alleles are preferred over others in hybrid mice [72]. What is remarkable in the study of Booker et al. [17] is the high level of variability of recombination hot spots within one population and between wild-derived and classical inbred strains, which is worth further consideration. For example, phasing approaches should depend on an accurate recombination map and the question arises whether global heterogeneous recombination rates provide sufficient information for fine-scaled phasing inference.

Researchers need to rely on high-quality genome information to perform reference-based whole-genome analysis to retain variant information for the populations under study. However, in some cases the sequence divergence of the analyzed population and the reference is high and might produce mapping artefacts [73]. To cope with such situations Sarver et al. [74] performed a pseudo-reference based approach using exome data to infer the phylogenetic relationship and gene tree incongruence of the Mus clade. While Sarver et al. [74] used the d-statistic [75] to detect introgression between M. m. musculus and M. m. domesticus, other methods have been recently applied to infer introgression signals [8, 20, 21, 76, 77].

In their genomic comparison, Harr et al. [14] incorporated the two other house mouse subspecies M. m. domesticus and M. m. musculus together with the M. m. castaneus samples. In total this study covers a divergence time of roughly two million years by complementing the data with samples from the sister species M. spretus and the recently diverged species M. m. helgolandicus [14]; see Fig. 1. In combination with the short generation time of mice, this constitutes a substantial molecular divergence, which is, for example, larger than the divergence between humans and Hominidae across the same time scale. Figure 4 represents the inferred population sizes for the subspecies M. m. domesticus and the diverged species M. m. helgolandicus, this data set was analyzed with the smc++ software setting the mutation rate to μ = 5 × 10−9 per base pair per generation [78].

Fig. 4
figure 4

Inferred population history for subspecies of the house mouse. Effective population size inference across populations of the house mouse subspecies M. m. domesticus and Mus musculus helgolandicus. SNP data from [14] was filtered to only retain intergenic regions without any feature annotation. For each population a separate smc++ [78] model was created setting the per generation mutation rate to 5 × 10−9 (see Note 1 for a detailed method description)

Population genetic variation in segmental duplications (copy number variation) was systematically studied by Pezer et al. [79]. They found among the most copy-number variable genes three highly conserved genes that encode the splicing factor CWC22, the spindle protein SFI1, and the Holliday junction recognition protein HJURP. These genes showed population-specific expansion patterns that suggested an involvement in local adaptations. Other variable genes were found to encode proteins that are relevant for environmental and behavioral interactions, such as vomeronasal and olfactory receptors, as well as major urinary proteins. In a follow-up study, it was suggested that duplications in the Androgen-binding protein gene region might specifically have contributed to species diversification [80].

Another study also identified the CWC22 region as a region which shows major segmental duplication in the house mouse. It received the genetic name R2d and it was shown that the structural mutation rate appears to depend on the diploid configuration at that locus [81]. By reconstructing the origin and history of copy-number variants (CNVs), the study of Morgan et al. [81] is a nice example how important refined analyses are to disentangle complex genome structures. This is particularly true for genomic regions that are duplicated and are absent from the reference genome, which the author termed the “missing genome” [81].

The sequence and structural diversity of Y chromosomes in natural populations was studied in [82]. The mouse Y chromosome is in comparison to other mammals larger and harbors more annotated genes. The authors could show that CNV on the long arm of both sex chromosomes is highly variable, but sequence diversity as compared to autosomes is low in nonrepetitive regions.

The autosomal AFS of neutral intergenic regions was used to infer demography of all subspecies with the software “∂a ∂i” [83]. All simple models applied predicted effective population sizes that fall inside the range mentioned above (M. m. domesticus: Ne = 1.6 × 105, M. m. musculus: Ne = 1.6 × 105, M. m. domesticus: Ne = 4.2 × 105; [82] but could not explain the reduction of sex chromosome diversity. Important findings are for instance that there is a moderately strong selective sweep on the Y chromosome in the M. m. domesticus population and that positive selection of genes expressed in the male germline might shape the sex chromosomes.

2.2 Brown Rat Genetic Variation

Rats and in particular the species Rattus norvegicus have an effective population size comparable to that of the mice subspecies M. m. domesticus and M. m. musculus. Denium et al. [84] estimated the effective population size to be Ne = 1.24 × 105, based on silent mutations of 12 wild-derived animals. The authors highlight a recent bottleneck in rats (20,000 years ago) based on a ‘PSMC’ [85] analysis (see Chapter 7 for a discussion of MSMC and MSMC2). This bottleneck might be the cause of negative estimates of the rate of adaptive evolution in proteins and noncoding elements. Compared to mice, rats show a larger proportion of mildly deleterious mutations and concordantly a lower rate of highly deleterious mutations [84]. However, the reduction in diversity around exons is comparable to values obtained for mice [64]. Considering the different Ne of mice and rats, Denium et al. [84] estimated linkage disequilibrium (LD) decay to be six to seven times faster in mice than in rats.

As for mice, researchers looked into speciation and introgression events using population genomics. Teng et al. [86] used the Himalayan field rat (Rattus nitidius) as an outgroup, which is geographically restricted to Southeast Asia, to investigate introgression in brown rats sampled in China. With whole-genome data from 44 individuals, the Ne for brown rats and Himalayan field rats was estimated to Ne brown rats = 2.53 × 105 and Ne Himalayan field rats = 5.18 × 105, which reflects a difference of similar order to that of the house mice subspecies M. m. musculus and M. m. castaneus. According to the “PSMC” analysis the sibling species R. norvegicus and R. nitidius diverged ~650 thousand years ago, that is, within a time frame where the mouse divergence is suggested to be at the level of subspecies. The proportion of admixed fragments was estimated to 1.59% with admixture block sizes from 100 kbp to 1.42 Mbp [86]. Among the 346 introgressed regions detected, 92 loci were classified as adaptive. The strongest candidate is located on chromosome 1 overlapping with the “vomeronasal 1 receptor cluster,” a chemical communication protein. As in mice [20], the regions were enriched in biological terms like “chemosensory perception” and “immune response.” Next to regions showing signals of introgression, 352 regions were identified as having undergone a selective sweep based on allele frequency differentiation between populations “XP-CLR” [87] and cross population extended haplotype homozygosity calculations “XP-EHH” [88] which, like introgressed regions, are enriched in proteins involved in immune-response and metabolism.

Zeng et al. [58] extended the publicly available whole-genome sample set of brown rats to a world-wide distribution. With more than 100 individuals the authors investigated the geographic origin and migration paths. In contrast to previous hypothesis that Rattus norvegicus dispersed from northern Asia to Europe, their data supports the southern East Asian dispersal route to Europe [58]. Similar to Teng et al. [86], Zeng et al. [58] consistently identified candidate genes with signatures of positive selection that are associated with the immune-response by comparing European and Chinese populations.

3 Examples of Genes Under Positive Selection

In this section, we discuss three of several examples of genes that have been shown to be involved in adaptation in mice and rats. One prominent example is the evolution of the resistance against warfarin, a rodent pest management poison.

3.1 Rodent Resistance to Anticoagulants: Vkorc1

As vectors for human diseases, rodents have been reduced over half a century by rodenticides. Common compounds of rodenticides target the blood coagulation (e.g., warfarin) and target the vitamin K reductase reaction [89]. Several mutations have been found in house mice and brown rats within the Vkorc1 gene that confer resistance against warfarin [90]. Song et al. [76] suggested that an allele introgressed from the Algerian mouse (Mus spretus) into M. m. domesticus led to anticoagulant resistance. Both species live today in sympatry in south-western Europe. Vkorc1 was subject to adaptive protein evolution in M. spretus since it separated from other Mus lineages and four introgressed polymorphisms could be linked to a strong resistance phenotype [76, 91]. Based on whole-genome data [14], this region shows negative Tajima’s D values within western European mouse populations in contrast to a population from Iran (see Fig. 5a), compatible with recent positive selection acting on it.

Fig. 5
figure 5

Views from the UCSC genome browser showing haplotypes, nucleotide diversity and Tajima’s D values for M. m. domesticus subpopulations. UCSC tracks are shown for (a) Vkorc1 region on chromosome 7 and (b) Xpr1 region on chromosome 1. Tracks were obtained from data published in [14] via a “public track hub” showing haplotypes from SNPs, nucleotide diversity (pi) and Tajima’s D values for the subpopulations from France (DOM-FRA), Germany (DOM-GER) and Iran (DOM-IRA). pi and Tajima’s D was calculated on 10 kbp windows (see Table 1 for web page URL link)

3.2 Pathogen Related Resistance: Xpr1

Next to artificial human-made selection pressure, there exists natural selection caused by pathogens. Hasenkamp et al. [92] have studied the gene Xpr1, coding for the receptor of murine leukemia virus (MLV) They found that the gene has been subject to a recent selective sweep in the population from Iran and that the selected haplotype has adaptively introgressed into a population from France, where it has mixed with existing haplotypes and thus creates a higher average population diversity than in the nonintrogressed population from Germany (see Fig. 5b). It seems that the Xpr1 gene itself is under frequent positive selection and that alleles coping with new virus variants can rather quickly spread into other subpopulations if these are actively dealing with infectious cycles of that virus variant [92].

3.3 Segmental Duplications and Selective Sweeps: R2D2

As mentioned above, R2d is a CNV region on chromosome 2 that was found to cause nonrandom segregation [93]. Didion et al. [93] showed that signatures of selective sweeps obtained via genome-wide scans can be mimicked by “selfish” alleles. Within the 127 kbp genome region of R2d there is one annotated gene, namely Cwc22, which is a spliceosomal protein. Based on haplotype sharing, analysis of almost 400 individuals sampled across Europe revealed that all individuals with an extreme excess of shared identity showed a high copy number of R2d. If only one subpopulation was analyzed, the haplotype sharing methods failed to detect this “selfish” sweep. However, if individuals from different geographically locations were included in the analysis, R2d was identified as a selective sweep. Morgan et al. [81] showed for the same locus that an initial duplication event ~3.5 million years ago led to R2d1 and R2d2 and, therefore, mouse strains containing a single copy must have lost the second one. The authors identified nonallelic gene conversion in R2d1, which were transferred from R2d2 and caused the appearance of deep coalescence among R2d1 sequences [81]. Given both the patterns of concerted evolution, as well as the evolutionary dynamics of the selfish alleles, this could be a case of evolution through “molecular drive” [94].

3.4 The t-Haplotype as Meiotic Drive Element

Meiotic drive elements, or segregation distorters, transmit themselves to over 50% of the progeny of heterozygous individuals. The mouse t-haplotype, located within several inversions on chromosome 17, is a classic example of such a meiotic drive element [1]. Despite a strong driving capacity, t-haplotypes remain at relatively low frequency in natural populations, since homozygous individuals have strongly reduced viability [95]. The population genomics of the t-haplotype was studied in [96] based on the data provided in [14]. They found evidence for an accumulation of nonsynonymous substitutions within the inversions, but also signatures of recombination events that appear to have regenerated coding sequences that had accumulated deleterious mutations. Based on the corresponding transcriptome data in [14] they could show that individuals carrying a t-haplotype display also a change in the testis expression of genes outside of the t-complex.

4 Conclusion

Per sample cost reduction for sequencing has led to an exponential increase in available whole-genome data for model and nonmodel organisms. Being among the longest studied mammals, both house mouse and brown rat have proven to serve as models for studying the processes that shape genome evolution in natural populations, including introgression and positive selection. However, while the public domain is steadily filled with population genomic usable datasets, there is still a gap between studies that predict candidates and studies that functionally validate them. As a consequence, functional studies to prove that genes have a direct impact on fitness in a certain species should be extended. The experimental set up to measure fitness will always depend on the species level and should be imbedded in an environmental context.

5 Note

  1. 1.

    SMC++ [78] analysis is based on 24 Mus musculus domesticus and 3 Mus musculus helgolandicus individuals described earlier [14]. SMC++ version 1.12.1 was used to infer population history for Mus musculus domesticus subpopulations based on a Variant Call Format (VCF) file obtained via the following URL: http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/vcf/AllMouse.vcf_90_recalibrated_snps_raw_indels_reheader_PopSorted.vcf.gz. First, bcftools version 1.3.1 [97] was used to filter SNP positions (bcftools filter) aside indel regions (--SnpGap 3), setting genotypes of failed samples to missing values (--set-GTs.) and excluding all sites with either low coverage or low genotype quality (FORMAT/DP<5 | FORMAT/GQ<30). Further, bcftools was used to retain only biallelic SNPs (bcftools view -m2 -M2 -v snps) and SMC++ was used to convert the VCF file to SMC++ format. Only subpopulations indicated above were retained from the input VCF file and only autosomes were extracted individually by additionally masking all exons, regulatory features, simple repeats, and missing sites from the reference mm10 (exons URL: ftp://ftp.ensembl.org/pub/release-90/gtf/mus_musculus/Mus_musculus.GRCm38.90.chr.gtf; regulatory features URL: ftp://ftp.ensembl.org/pub/release-90/regulation/mus_musculus/mus_musculus.GRCm38.Regulatory_Build.regulatory_features.20161111.gff;

    simple repeats URL: http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/simpleRepeat.txt). The per generation mutation rate was set to 5 × 10−9 to fit a size history for each subpopulation based on the extracted autosome data (scm++ estimate 5e-9 chr∗.smc.gz) and plotted with SMC++ (smc++ plot) as shown in Fig. 4.