The genetic diversity captured in the major mouse genetic resources depends on the number and identity of parental strains involved in their derivation, as well as the breeding design used to generate the resource (Fig. 1). With the resequencing of the mouse genome, there are new insights into the single nucleotide polymorphism (SNP) architecture captured by widely used mouse genetic resources. However, since the mouse genome resequencing project did not include every parental strain used in common genetic resources, we conservatively replaced the nonsequenced strains by an appropriate substitute, ensuring that our analysis of the SNP architecture does not underestimate the actual diversity present in existing resources. Resequencing to estimate the false-positive and false-negative rate in the Perlegen data has been reported (Yang et al. 2007). The missing variation is for the most part randomly distributed. However, resources such as the CC that include wild-derived strains will have underestimates of the true variation captured, while those lacking wild-derived strains will have overestimates because of the high false-negative SNP call rate in wild-derived strains. In all analyses, we considered that a polymorphic variant was captured if the two alleles are represented among the parental strains of a particular mouse genetic resource. However, it should be noted that the diversity present in the founder population for each resource represents the upper bound of diversity that can be captured by the derived resource. The actual diversity captured may be lower, particularly in small resources, due to genetic drift during generation of the resource. The color scheme used for the classical inbred strains in Fig. 1 reflects data that indicate that their genomes are largely derived from M. m. domesticus as recently determined (Yang et al. 2007).
Diversity captured is a function of the number of parental strains
Most resources used in genetic studies are derived from crosses involving two parental strains or multiples thereof in order to introduce equivalent variation from each parental strain. Therefore, we used the mouse genome resequencing data to determine the range (maximum, minimum, and average) of diversity captured in any theoretical resource involving any 2, 4, 8, and 16 parental strains (Fig. 2). As expected, on average the diversity captured increases with the number of parental strains involved. However, there is an extremely wide variation in the level of diversity captured within a given number of parental strains and a large overlap between the diversity that can be captured in resources with different number of parental strains. Our analysis reveals that the CC outperforms all combinations of two or four parental strains. However, an optimal set of fours strains would capture a similar, albeit lower, level of genetic diversity as what is present in the CC. We conclude that although the number of strains is an important factor in determining the level of diversity captured in a given resource, other factors such as the identity of the parental strains are of much greater consequence. This is illustrated by comparing the B.P CSS (two parental strains) with the Northport HS (eight parental strains). Because all Northport HS parental strains have a common ancestry, they contribute a relatively small amount of additional variation per strain. Conversely, because the two parental strains of the B.P CSS represent different subspecies, they capture over half of the known polymorphic sites within the mouse genome. Similarly, when the Northport HS is compared with the CC (also derived from eight parental strains), the level of diversity is almost threefold more in the CC (36% vs. 89%). This is expected since the CC has at least one representative from all three subspecies. The CC captures 89% of the variation in the mouse genome, which is close to the maximal amount of variation that can be captured by eight strains (97% by 129S1/SvImJ, CAST/EiJ, DBA/2J, FVB/NJ, KK/HIJ, MOLF/EiJ, PWD/PhJ, and WSB/EiJ).
Diversity captured is a function of the subspecific origin of the parental strains
A recent analysis of the mouse genome resequencing data demonstrates that over 92% of the genome of classical inbred strains is derived from the M. m. domesticus subspecies, and, unexpectedly, approximately 75% of the genome of MOLF/EiJ is of M. m. musculus origin (Yang et al. 2007). Based on these observations, it is possible to assign each of the 16 sequenced strains to a major subspecies (see Fig. 1 for assignments). After plotting the fraction of genetic diversity captured by strain sets of a given size (Fig. 3), it is clear that the distributions are multimodal. Furthermore, each lobe in these distributions perfectly clusters according to the number of subspecies represented among the parental strains (indicated by different shades of purple in Fig. 3). This indicates that the number of subspecies contributing to a particular resource is the major determinant of the level of genetic variation captured. This analysis also shows that the fraction of diversity captured by most existing resources is small, particularly those that have only one contributing subspecies like the BXD RI or Northport HS. We also analyzed other resources and found that the conclusions reached for BXD RI apply to other RI panels such as AXB/BXA (C57BL/6J and A/J), CXB (BALB/cByJ and C57BL/6J), AKXD (AKR/J and DBA/2J), and BXH (C57BL/6J and C3H/HeJ) (Table 1). Similarly, CSS derived from the introgression of A/J or 129S1/ImJ chromosomes into the C57BL/6J background (Nadeau et al. 2000) or the Boulder HS derived from C57BL/6, BALB/c, RIII, AKR, DBA/2, I, A/J, and C3H leads to similar results. While these genetic resources capture little variation because all of these strains are derived from the M. m. domesticus subspecies, the B.P CSS, B6.CAST CON, and LSDP fair better since they have representatives from two subspecies, M. m. domesticus and M. m. musculus or M. m. castaneous. This analysis also explains why the CC, with all three subspecies represented, dramatically outperforms other genetic resources in capturing genetic diversity.
Table 1 Genetic variation captured by widely accessible mouse genetic resources
Spatial distribution of the diversity varies significantly among resources
In addition to the total diversity captured, it is critical to consider how the variation captured in each resource is distributed across the genome. When such analyses are performed (Fig. 4), they reveal that the BXD RI, Northport HS, and LSDP genetic resources show a multimodal complex distribution with many intervals capturing very little variation and a variable number of intervals capturing a larger fraction of the available variation. In contrast, the B.P CSS and the CC have unimodal distributions centered on their respective genome-wide means (Fig. 2). It is also evident that the CC outperforms all other resources in uniformly capturing a large fraction of the available genetic variation.
When the distribution of the variation captured is plotted in consecutive high-resolution intervals (Fig. 5), it becomes evident that only the CC maintains a uniformly high level of variation while all other resources vary dramatically from interval to interval. Such variation distributions destroy the uniformity required for systems biology analyses and leads to extended regions of blind spots with little or no variation in resources like the BXD RI panel. Most interestingly, blind spots are also present in the Northport HS, the B.P CSS, and the LSDP resources, although their locations vary among the resources. An important corollary is that blind spots are found in both gene-dense and gene-poor regions, creating potentially dramatic negative consequences when saturating the genome in the search for functional interactions among genes and phenotypes.
Allele frequency of the variation captured
In addition to the level and distribution of the variation captured, the frequency of the minor alleles can impact the utility of a particular genetic resource. Therefore, to compare this characteristic among the different genetic resources, we determined the allele frequency present in the 8.3 million SNPs reported for the mouse genome resequencing project in the different resources considered in this study (Fig. 6). For reference we also added the distribution of allele frequencies reported for human populations (pink bars in Fig. 6) (Kruglyak and Nickerson 2001) and the fraction of SNPs that are not captured in each mouse genetic resource or that have very low allele frequency (≤1%) in humans. Resources fall into two distinct groups, with the BXD RI and B.P CSS having uniformly 50% allele frequency at the captured variants, as would occur with any resource that is equally derived from two parental strains. Conversely, the Northport HS, the LSDP, and the CC have a true distribution in which the fraction of SNPs captured decreases as the minor allele frequency increases. Among the latter group, the CC retains the most desirable distribution because the total number of variants with high minor allele frequency is significantly higher than that found in either the Northport HS or the LSDP. Interestingly, even though the CC is derived from only eight parental strains, the allele frequency distribution is remarkably similar to that observed in humans.