Background

De novo mutations enrich the sequence diversity and carry the clue of evolutional selection [1]. Because of the technological advances in whole genome sequencing, genome-wide de novo mutation survey becomes possible. Recent studies show that de novo mutations, including de novo copy number variations, are strongly associated with multiple diseases, such as autism and schizophrenia [2]. Currently de novo mutations are often studied in family trios by comparing the parents' and child's whole genome sequence data, as well as the publicly available dbSNP database [3]. Variants observed in offspring, but not in their parents, are often considered as potential de novo mutations. However, even highly accurate sequencing data will have inevitable errors that lead to false variant callings and possible mendelian errors. Therefore, the de novo mutation candidates observed by comparing offspring's and their parents' sequencing data can be false positive [4]. Thus, researchers often resequence or genotype the candidates to confirm the true de novo mutations [14]. This procedure could be time and money consuming. Here we propose an approach using 3-generation families to detect de novo mutations (a) using the parents and grandparents to search for de novo mutation candidates, and (b) using offspring sequence data to confirm true de novo mutations. We applied this approach to the Genetic Analysis Workshop 18 (GAW18) data and found our results consistent with previous genotyping and further resequencing validation efforts. This result suggested our approach is reliable. With the continuously decreasing cost of whole genome sequencing, this approach should be efficient to detect de novo mutations.

Methods

GAW18 data include 20 large Mexican American pedigrees as part of the Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples (T2D-GENES) project. Whole genome sequence data on the odd autosomes are provided to the GAW18 participants. Our analysis focused on the 464 individuals who were whole genome sequenced, resulting in 12 million SNVs. Among those, more than 6.1 million SNVs are novel and not present in dbSNP129. Among the novel SNVs, 5,086,136 SNVs have minor allele frequencies less than 0.5% (Figure 1). As our goal is to detect de novo mutations, our analysis is restricted to these novel and rare SNVs in order to reduce the false-positive rate. When a real de novo mutation is observed in an individual, there is a 50% probability of it being transmitted to each of the individual's children. Thus, the transmission of variants from an individual to the individual's offspring can be used as a validation procedure in detecting the de novo mutations. Therefore, we selected families with sequenced data available for at least 3 generations. A total of 8 three-generation families were selected (Figure 2). For each of the families in Figure 2, we examined every rare and novel variant and considered it as a de novo mutation candidate if it is present in a parent (the child in the triangle) but absent in both grandparents. We next examined whether a de novo mutation candidate is transmitted from a parent to the parent's offspring. Only a de novo mutation candidate who transmitted to his/her offspring is declared as a true de novo mutation. Among the 8 families in Figure 2, 4 families (including 1 a family and 3 e families) were used to identify de novo mutations in males, and 4 families (including 2 b families, 1 c, and 1 d families) were used to identify de novo mutations in females, depending on whether the parent is male or female. We further categorized the 8 families into 2 family types according to the number of offspring: type I included families a, b, d, and e, and type II included family c. Let N o be the number of de novo mutations observed in a family and L be the sequence length of all odd autosomes in human. For a type I family, the total number of de novo mutations is then estimated as 2N o because only half of them are expected to be transmitted. Because humans have a pair chromosomes, the mutation rate µ is estimated as N 0 /L. For a type II family, mutation rate µ is estimated as 2N o /3L because 75% of de novo mutations are expected to transmit to 1 of the 2 children. As families d and e have both parents with sequencing data available, it is possible to further exclude any of variants present in both parents, further reducing the false discovery rate.

Figure 1
figure 1

Comparison between the distribution of SNVs in dbSNP129 and novel SNVs.

Figure 2
figure 2

A summary of selected family types. We identified 1 a family, 2 b families, 1 c family, 1 d family, and 3 e families. The upper trios (in the dashed triangles) are used to identify de novo mutation candidates, and the third generations are used to confirm a true de novo mutation. Family a and e measure the de novo mutations in males. Family b, c, and d measure the de novo mutations in females.

Results

We analyzed the sequencing data after quality controls provided by GAW18. By investigating the first 2 generations in the 8 families, we were able to identify a total of 13,584 de novo mutation candidates. Among these candidates, 186 were successfully transmitted to the grandchildren. On average, 23.25 (±5.62) de novo mutations on the odd autosomes per family were discovered (Table 1). Considering there is an average of 1.35 billion base pairs on the odd chromosomes, we estimated an average mutation rate (µ) of 1.64(±0.42) × 10−8 per position per haploid genome, which falls in the range between 1.1 × 10−8 and 3.8 × 10−8 reported in the literature [46]. We did not observe a significant difference between the de novo mutations in males (1.61 × 10−8 ) and females (1.67 × 10−8).

Table 1 Summary of de novo mutation numbers in each family.

We used the UCSC genome browser (http://genome.ucsc.edu/) [7, 8] and SIFT (http://sift.jcvi.org/) [9] to map and predict the protein functions of the 186 de novo mutations. Seven of them are in exon regions and 2 are nonsynonymous SNVs. One of the nonsynonymous SNVs is in the gene PDZ domain containing 2 (PDZD2) on chromosome 5; the other is in gene spastic ataxia of Charlevoix-Saguenay (sacsin) (SACS) on chromosome 13. PDZ domains are protein-protein recognition modules that play a central role in organizing diverse cell signaling assemblies, most often in the cytoplasmic tails of transmembrane receptors and channels. PDZD2 and its secreted form (sPDZD2) are possibly involved in functional maturation of human fetal PPC-derived ICCs and the early stages of prostate tumorigenesis [10, 11]. SACS encodes the sacsin protein, which is highly expressed in the central nervous system. Mutations in this gene will cause autosomal recessive spastic ataxia of Charlevoix-Saguenay, but the detail of its function is still unknown [12, 13].

CpG sites are known as the mutation hotspots in mammals [14]. In the great apes, the de novo mutation rate on the CpG sites is estimated to be 11 times higher than that on the non-CpG sites [4, 15]. We extracted the CpG islands from UCSC genome browser and examined the locations of the identified de novo mutations. Of our confirmed 186 de novo mutations, only 1 is located on the CpG islands. Considering the coverage of CpG islands on the odd autosomes, we expect we underestimated the CpG mutations. In the remaining 185 non-CpG mutations, we observed 127 transition mutations and 58 transversion mutations. The transition-to-transversion ratio is 2.2, similar to previous estimates [4, 6].

Furthermore, we examined the relationships between the age of parents and the de novo mutation rate in the child using the first 2 generations in the 8 families by constructing linear models. In general, the de novo mutation rate in the child increases with the child's parents' ages, especially with the father's age. This is consistent with the previous report that the de novo mutation rate in offspring is positively correlated with the paternal age [1]. Nevertheless, no significant association effect was observed because of the small sample size in this study.

Discussion

We conducted an analysis of the whole genome sequences on odd autosomes of 8 three-generation families to identify de novo mutations. We found this 3-generation approach is efficient, although no further resequencing of the candidate variants was performed. In the 8 selected Mexican American families, we estimated a mutation rate of 1.64(±0.42) × 10−8 per position per haploid human genome, which is consistent with the previous estimates [46].

Among the 13,584 de novo mutation candidates observed in 8 three-generation families, only 186 are observed in grandchildren. This is remarkably less than the expected number of transmissions, suggesting that most de novo mutation candidates can be attributed to SNV calling errors. Because the goals in a whole genome sequencing project are to detect rare and possible de novo variants and test for association of these to a complex disease, how to account for the false-positive calls of SNVs is extremely important in an association study. Our analysis suggests sequencing family members is an efficient way to detect these SNV calling errors. For example, our analysis suggests that a variant observed in offspring but not in their parents in a simple trio can usually be treated as an SNV calling error, and should be excluded in downstream analyses. Previous studies suggest family data has many statistical advantages in detecting rare disease variants [16, 17]. Thus, our results suggest whole-genome sequencing family members is worthwhile when most current whole genome sequencing projects only focus on unrelated subjects. It should be pointed out that the recruitment of multigeneration pedigrees is more difficult than family trios. However, many multigeneration pedigrees have already been collected in traditional linkage studies, such as the pedigrees used here. We expect the proposed method can be useful in detecting de novo mutations.