Background

A genome is constantly damaged by internal metabolic factors and external environmental factors. In order to maintain genome stability, living organisms are equipped with a highly sophisticated DNA damage repair (DDR) system to effectively repair the damages. The DDR system is composed of multiple pathways including homologous recombination (HR), non-homologous end joining (NHEJ), Fanconi anemia pathway (FA), base excision repair (BER), nucleotide excision repair (NER), mismatch repair (MMR), and single-strand annealing (SSA). Each pathway consists of a group of genes to repair a specific type of DNA damage through their collaborative action.

As DNA damage repair is vital for survival, it would be expected that evolution selection play roles in maintaining a highly functional DNA damage repair machinery for survival and better fitness. BRCA1 and BRCA2 (BRCA) are two important DDR genes for repairing DNA double-strand break through homology recombination (HR) pathway and mutation in BRCA substantially increases cancer risk [1, 2]. Studies indeed revealed that BRCA in the humans and its close relatives of chimpanzee and bonobos are under positive selection [3]. However, the same type of selection was not observed in other mammals [4,5,6,7,8,9]. This raises the possibility that the same DNA damage repair genes in different species could be under different evolution selections [10]. Except a few cases, however, nearly all BRCA variation data reported from non-human mammals were derived from a single individual in the tested species. From population genetics point of view, it is questionable if the observation made in a single individual could represent the situation in the tested species. Further, few other DDR genes except BRCA have ever been analyzed for their evolution selection (Table 1). Therefore, it remains unclear for the relationship between DDR system and evolution selection, a fundamental question in biology for the mechanisms of genome stability maintenance.

Table 1 Previous evolutionary studies in BRCA1 and BRCA2 in mammals

Dynamic monitoring of genetic variation is a powerful approach to study evolution selection. This is best exemplified by the variation studies in E. coli by following its constant growth for four decades of over 60,000 generations under laboratory cultural conditions [15], and in laboratory rat by following its genetic variation in the genes involving in learning, circadian rhythm, and metabolism [16]. C57BL/6 J is one of the most used laboratory mouse models in biological and oncogenic studies. C57BL/6 J is the descendent of cryopreserved embryo stock with clear genetic background (Fig. 1). Its DNA extracted in 1998 was used for the Mouse Genome Project to generate the mouse genome reference sequences [18], and its DNA extracted in 2003 was sequenced again to generate the mouse genome reference sequences B6Eve [17]. From 1998 and 2018, C57BL/6 J has passed 30 generations. We hypothesized that this period can be longer enough as an excellent model to test evolution selection in DDR system in C57BL/6 J, and the information could be helpful to understand evolution selection on DDR system in rodents as represented by C57BL/6 J.

Fig. 1
figure 1

Origin and generations of C57BL/6 J. The C57BL/6 J was originated in 1921. Its genome in 1998 was sequenced by the Mouse Genome Project to develop the mouse genome reference sequences. After 14 generations, its genome in 2003 was sequenced to develop Eve6B genome sequences. The DNA used in current study was derived from 2018 C57BL/6 J, 30 generations after its genome was sequenced in 1998. See reference [17] for more details

In present study, we sequenced the coding region of C57BL/6 J genome using the DNA collected from 44 C57BL/6 J individuals in 2018. We searched the variants arisen after 1998 by comparing the mouse genome reference sequences derived from 1998 C57BL/6 J DNA and mouse genome reference sequences B6Eve derived from 2003 C57BL/6 J DNA. We found no evidence for genetic variation arisen in the 169 DDR genes including Brca1 and Brca2 during this period, while we did identify the genetic variation in 116 non-DDR genes involved in other functional categories. From the data, we conclude that DDR system in C57BL/6 J is evolutionarily stable during its 30-generation period.

Results

Identifying genetic variants

C57BL/6 J genome in 1998 was sequenced by the Mouse Genome Project to generate the mouse genome reference sequences. Since then, C57BL/6 J mice has been inbreeded for 30 generations (24 in Jackson Laboratory and 4 in University of Macau Animal Facility) by 2018 (14, Fig. 1). We collected genomic DNA in 2018 from 44 C57BL/6 J mice and performed exome sequencing and called coding variants. We applied the following procedures to ensure the accuracy for the variants called from the exome sequences: 1) Only the variants present in > 50% (22 individuals) of the mice were kept for further analysis; 2) Using both mouse genome reference sequences mm7 and mm10 assemblies as the references for variant calling; 3) use B6Eve variants as the third reference; 4) Using Sanger sequencing to validate the called variants. From the exome sequences collected in the 2018 C57BL/6 J DNA, we identified a total of 3024 variants (Supplementary Table 1), of which 883 (29.2%) were singleton, 1329 (43.9%) were between 2 and 21, and 812 (26.9%) were present in at least 22 mice and used for further analysis (Supplementary Table 2). We reasoned that by setting up this high bar, we can address better population variation rather than individual variation.

Variants in DDR genes

We searched the 812 variants but didn’t identify the variants in Brca1 and Brca2. We further searched the variants in the rest of 167 DDR genes involved in 7 DNA damage repair pathways but didn’t identify any variants in these genes neither (Supplementary Table 3A, B).

Variants in non-DDR genes

We then annotated the 812 variants and identified 116 non-DDR genes with these variants, of which Mroh2a, a HEAT-domain-containing protein with unknown function, had the highest number of 85 variants, and c4b, a component in Complementary system, had the 2nd highest number of 53 variants (Table 3, Supplementary Table 4). We used Sanger sequencing to validate a set of the variants in the original 2018 DNA samples used in exome sequencing. Of the 15 variants tested, 10 (67%) were validated (Supplementary Table 5). The variants identified in the non-DDR genes provided the internal control in ensuring that the absence of variation in DDR genes were a true biological phenomenon instead of missed identification due possibly to technical errors.

Discussion

C57BL/6 J genome in 1998 was sequenced to generate the mouse genome reference sequences. After 20 years from 1998 to 2018 covering 30 generations, we re-sequenced the coding genes of C57BL/6 J in 44 individuals in order to determine if there could be variation arisen during this period in the DDR genes in C57BL/6 J genome. Our study didn’t identify new variants in DDR genes including Brca1 and Brca2 in the C57BL/6 J genome. The presence of new variants in over a hundred of non-DDR genes during the same period provided a strong assurance for the reliability of the observed lack of selection in DDR genes, and ruled out the possibility that the lack of variation in the full set of DDR genes was due to technical failure. The data from our study indicate the absence of positive selection in DDR genes in C57BL/6 J during the 30-generation period.

The lack of positive selection in DDR genes is unlikely due to the short period of C57BL/6 J under investigation. The 20-years of 30 generations in C57BL/6 J is equivalent to 800 years in the humans when counting 1 year in mouse equals to 30-years in the humans per generation [19]. Studies showed that many BRCA variations in the humans occurred in recent human history. For example, 185delAG in BRCA1, a founder variant in Ashkenazi Jews population, was arisen around 750–1500 years ago [20]; 1499insA in BRCA1, a founder variant in Tuscany of Italy, was originated 750 years ago [21]; BRCA1 c.5266dupC, another founder variant in Ashkenazi Jews population, was originated 1800 year ago [22].

Possibility exists that animal under long-term protected laboratory environment could experience relaxed selection pressure, leading to altered genetic variation [23]. If the time period is longer enough and the starting genome sequences are available, testing genetic variation in wild mice would determine if such possibility could exist for the observation made in C57BL/6 J in our study.

The reference genome sequences used can have impact on the variation identification. After mouse genome project accomplished in 2001, 10 different versions of C57BL/6 J genome reference sequences were generated, including the first version of mm1 released in 2010 to mm10 released in 2011, before the mm39 released in 2020 (https://genome.ucsc.edu/FAQ/FAQreleases.html). The different versions of the mouse genome reference sequences used basically the same raw sequence data generated by the mouse genome project, but the variation data between different version were substantially different, which unlikely reflects true variation but annotation artifacts. As such, using all different versions as the reference for variant identification could lead to high complexity and data inconsistence, and decrease reliability of the resulting variation data. On the other hand, using a single version of reference sequences for variant identification could miss potential variants not identifiable in the single version. To address the concerns, we used two later versions of mouse genome reference sequences, mm7 and mm10, as the references for variant identification; we also used the variation data from Eve B6 genome sequences derived from 2003 C57BL/6 J DNA as another reference; we further used Sanger sequencing to validate selected variants. The combinational use of these approaches in our study ensured reliability and sensibility of the variants identified from our study to address the issue of evolution selection in DDR system in C57BL/6 J.

The evidence for the presence of positive selection in DDR genes is mainly from BRCA in human, chimpanzee and bonobos [3]. We propose explanations for why positive selection in BRCA exists in humans and its close relatives, but not in other mammals as represented in laboratory mouse C57BL/6 J: The basic function of BRCA is to repair DNA double-strand break in order to maintain genome stability in mammals. Like many genes involving in essential biological function, BRCA must be maintained in stable condition to perform their essential work [24]. During evolution process, however, BRCA in humans, chimpanzee and bonobos acquired new function such as enhancing intelligent development [25], gene expression regulation [26], and reproduction [27] etc. Positive selection on these function is beneficial for better fitness; whereas BRCA in other mammals retains the classical function of DNA damage repair, therefore, maintains high stability in order to keep genome stability. The explanations may also be applicable to other DDR genes. It will be interesting to find more evidence to support these explanations in different mouse strains and different species.

Conclusion

DDR genes in laboratory mouse strain C57BL/6 J were not under positive selection across its 30-generation period, highlighting the possibility that DDR system in rodents could be evolutionarily stable.

Methods

Sample source

C57BL/6 J mice used in this study was purchased from Jackson Laboratory in 2017, and inter-bred 4 generations in University of Macau Animal Facility. Mouse genomic DNA in 2018 was extracted from the tails of 44 C57BL/6 J mice (15 male and 29 female) using DNeasy Blood & Tissue Kit (Qiagen) following the instruction. The study was approved by University of Macau Animal Welfare Committee (UMARE-041-2017), and was carried out in accordance with relevant guidelines and regulations.

Exome sequence, mapping and variant call

Exome sequencing was performed at pair-end (2 × 150) and > 100x in Illumina Hiseq 2500 through Novogen customer service (Novogen, Hong Kong). Sequences were aligned to mouse reference genome sequence mm7 and mm10 using BWA 0.7.17MEM module and rearranged by Samtools v1.9 with sort option. Duplicates were removed by Picard in Genome Analysis Tool Kit (GATK) v4.1.1.0. IndelRealinger, BaseRecalibrator and ApplyBQSR options in GATK were used for BAM data processing. GenotypeGVCFs in GATK was used to call variants from BAM files, and Annovar was used for annotation, 20% variant allele frequency was used as the cutoff for variant calling. CrossMap was used to convert mm7 identified variants into mm10 to generate a mm10-based single set of variants. The Eve6B variants contain 2652 coding-variants identified from the 2003 C57BL/6 J genome, which differed from the 1998 C57BL/6 J-based mouse genome reference sequence GRCm38 (Supplementary Table 6). The 3 variants of chr11: 3186080 G > A, chr11: 3187266 C > T, and chr11: 3187367 T > C in Sfi1 were eliminated from the mapping analysis as they were determined by B6Eve study as artifacts [17].

Source of DNA damage repair genes

DNA damage repair-related genes were downloaded from KEGG DNA repair related pathways (http://software.broadinstitute.org/gsea/msigdb), which consists of 169 genes in 7 pathways of base excision repair (BER), DNA replication (DR), Fanconi anemia (FA), homologous recombination (HR), non-homologous end-joining (NHEJ), mismatch repair (MMR), and nucleotide excision repair (NER) (Table 2).

Table 2 List of DDR genes included in the study
Table 3 Non-DDR genes with variants detected in 2018 C57BL/6 J genome