Samples
Sequencing data were taken from our two published studies [7, 17] and reanalyzed. Part of the sequencing data for cases 1 and 2 was from SRP067387 [7]. The study was approved by the Reproductive Study Ethics Committee at Peking University Third Hospital (research license 2014SZ001). In case 1, the father has a family history of hereditary multiple exostoses and suffers from this disease. The affected grandfather, both parents, and 18 embryos were sequenced (Table 1). In case 2, the mother carries an X-linked mutation and her son suffers from hypohidrotic ectodermal dysplasia. The affected born child, both parents, 4 embryos, and their corresponding 8 polar bodies were sequenced (Table 1). Sequencing data for case 3 was from originally published sequencing data [17]. The study was approved by the Research Ethics Committee of the First Hospital of Sun Yat-sen University [2014]134. In case 3, both parents are affected with beta thalassemia and present different mutation sites. Both parents and seven sperm cells were sequenced. All samples were whole-genome amplified (WGA) using MALBAC [21] kit (Yikon Genomics Inc.). After WGA, the causal mutation region was enriched by PCR amplification using specific primers in proximity to the affected area (Table S1). The total product was then sequenced using Illumina Hiseq 2500 with ~ 2× mean genome depth.
Table 1 Sample description. case 1 and case 2 are from reference [7], and case 3 is from reference [17] Calculating disease-carrying probability
The disease-carrying allele is either phased with similar methods with previous analyses [7] (Fig. 1b, Fig. S1, Supplementary methods) when a proband sample is available, or phased as described in the next section when a proband sample is absent. After phasing the disease-carrying allele, error probability is calculated to estimate an embryo’s disease-carrying status through Bayesian inference. Bayesian inference is a method of calculating posterior probability according to Bayes’ theorem (https://en.wikipedia.org/wiki/Bayesian_inference):
$$ {P}_{\mathrm{H}\mid \mathrm{E}}=\frac{P_{\mathrm{E}\mid \mathrm{H}}\times {P}_{\mathrm{H}}}{P_{\mathrm{E}}}=\frac{P_{\mathrm{E}\mid \mathrm{H}}\times {P}_{\mathrm{H}}}{P_{\mathrm{E}\mid \mathrm{H}}\times {P}_{\mathrm{H}}+{P}_{\mathrm{E}\mid \mathrm{no}\ \mathrm{H}}\times {P}_{\mathrm{noH}}} $$
, where PH ∣ E represents the probability of the hypothesis (H) given the evidence (E); PE ∣ H means the probability of the evidence (E) if the hypothesis (H) is true. PH is the prior probability of the hypothesis, which is the estimated before evidence (E). PE is the total probability of evidence (E). And “no H” means the negative side of the hypothesis.
In our case, the evidence (E) is the sequencing data of a proband, parents, and embryos, i.e., the phased disease-carrying allele and genotypes at all sites in the embryos, thus written as “all sites” in the following formula. The hypothesis (H) is that the embryo carries disease. So the probability of the embryo carrying disease given the all sites is written as Pdisease ∣ all sites, shortened as Pdisease. “no H” means embryo is normal. Then Pdisease can be calculated according to Bayes’ theorem (Fig. 1a) as follows:
\( {P}_{\mathrm{disease}}=\frac{P_{\mathrm{all}\ \mathrm{sites}\mid \mathrm{disease}}\times {P}_{\mathrm{disease}\ \mathrm{prior}}}{P_{\mathrm{all}\ \mathrm{sites}\mid \mathrm{disease}}\times {P}_{\mathrm{disease}\ \mathrm{prior}}+{P}_{\mathrm{all}\ \mathrm{sites}\mid \mathrm{normal}}\times {P}_{\mathrm{normal}\ \mathrm{prior}}} \), where Pdisease means the probability of the embryo carrying disease given the sequencing data (all sites); Pall sites ∣ disease means the conditional probability of observing the genotypes at all sites if the embryo carries disease. Pdisease prior is the prior probability of the embryo carrying disease before sequencing data is obtained. The probabilities of “normal,” Pall sites ∣ normal and Pnormal prior, are similar with those of “disease.”
To compute Pdisease for each embryo, we need to calculate the prior probabilities and conditional probabilities. The prior probability, Pnormal prior and Pdisease prior, of an embryo carrying disease, or being normal, is 0.5 for both to reflect Mendelian genetics. If there are N sites upstream of the causal mutation site and N′ sites downstream (Fig. 1d), the conditional probability of Pall sites ∣ disease and Pall sites ∣ normal could be computed from upstream and downstream sites as follows:
$$ {P}_{\mathrm{all}\ \mathrm{sites}\mid \mathrm{disease}}={P}_{\mathrm{sites}\ 1\ \mathrm{to}\ N\mid \mathrm{disease}}\times {P}_{\mathrm{sites}\ {1}^{\prime }\ \mathrm{to}\ {N}^{\prime}\mid \mathrm{disease}} $$
$$ {P}_{\mathrm{all}\ \mathrm{sites}\mid \mathrm{normal}}={P}_{\mathrm{sites}\ 1\ \mathrm{to}\ N\mid \mathrm{normal}}\times {P}_{\mathrm{sites}\ {1}^{\prime }\ \mathrm{to}\ {N}^{\prime}\mid \mathrm{normal}} $$
If recombination rates in non-overlapping regions are independent, conditional probability of upstream sites is calculated as follows. Conditional probability of downstream sites is calculated in a similar manner.
$$ {P}_{\mathrm{site}\mathrm{s}\ 1\ \mathrm{to}\ N\mid \mathrm{disease}}={P}_{\mathrm{site}\mathrm{s}\ 1\ \mathrm{to}\ N\mid \mathrm{site}\ 0\ \mathrm{disease}}={P}_{\mathrm{site}\mathrm{s}\ 2\ \mathrm{to}\ N\mid \mathrm{site}\ 1\ \mathrm{disease}}\times {P}_{\mathrm{site}\ 1\ \mathrm{disease}}\times \left(1-{P}_{\mathrm{recom}\ 01}\right)+{P}_{\mathrm{site}\mathrm{s}\ 2\ \mathrm{to}\ N\mid \mathrm{site}\ 1\ \mathrm{normal}}\times {P}_{\mathrm{site}\ 1\ \mathrm{normal}}\times {P}_{\mathrm{recom}\ 01} $$
$$ {P}_{\mathrm{site}\mathrm{s}\ 1\ \mathrm{to}\ N\mid \mathrm{normal}}={P}_{\mathrm{site}\mathrm{s}\ 1\ \mathrm{to}\ N\mid \mathrm{site}\ 0\ \mathrm{normal}}={P}_{\mathrm{site}\mathrm{s}\ 2\ \mathrm{to}\ N\mid \mathrm{site}\ 1\ \mathrm{disease}}\times {P}_{\mathrm{site}\ 1\ \mathrm{disease}}\times {P}_{\mathrm{recom}\ 01}+{P}_{\mathrm{site}\mathrm{s}\ 2\ \mathrm{to}\ N\mid \mathrm{site}\ 1\ \mathrm{normal}}\times {P}_{\mathrm{site}\ 1\ \mathrm{normal}}\times \left(1-{P}_{\mathrm{recom}\ 01}\right) $$
Similarly, conditional probability of any site i-1 could be computed from site i when i < =N−1 and i > =1.
$$ {P}_{\mathrm{site}\mathrm{s}\ i\ \mathrm{to}\ N\mid \mathrm{site}\ i-1\ \mathrm{disease}}={P}_{\mathrm{site}\mathrm{s}\ i+1\ \mathrm{to}\ N\mid \mathrm{site}\ i\ \mathrm{disease}}\times {P}_{\mathrm{site}\ \mathrm{i}\ \mathrm{disease}}\times \left(1-{\mathrm{P}}_{\mathrm{recom}\ i\left(i-1\right)}\right)+{P}_{\mathrm{site}\mathrm{s}\ i+1\ \mathrm{to}\ N\mid \mathrm{site}\ \mathrm{i}\ \mathrm{normal}}\times {P}_{\mathrm{site}\ i\ \mathrm{normal}}\times {P}_{\mathrm{recom}\ i\left(i-1\right)} $$
$$ {P}_{\mathrm{site}\mathrm{s}\ i\ \mathrm{to}\ N\mid \mathrm{site}\ i-1\ \mathrm{normal}}={P}_{\mathrm{site}\mathrm{s}\ i+1\ \mathrm{to}\ N\mid \mathrm{site}\ i\ \mathrm{disease}}\times {P}_{\mathrm{site}\ i\ \mathrm{disease}}\times {P}_{\mathrm{recom}\ i\left(i-1\right)}+{P}_{\mathrm{site}\mathrm{s}\ i+1\ \mathrm{to}\ N\mid \mathrm{site}\ i\ \mathrm{normal}}\times {P}_{\mathrm{site}\ i\ \mathrm{normal}}\times \left(1-{P}_{\mathrm{recom}\ i\left(i-1\right)}\right) $$
when i equals to N,
Psites N to N ∣ site N − 1 disease = Psite N disease × (1 − Precom N(N − 1)) + Psite N normal × Precom N(N − 1), recombination rate Precom i(i − 1) could be computed as follows:
Precom i(i − 1) = Precom in the 1Mb region × Pdistance i(i − 1)(/Mb), Precom in the 1Mb region is referred to the recombination rate estimated by deCODE [22]. Notably, PCR product of the causal mutation site and linkage analyses separately estimated the disease-carrying status in previous MARSALA analyses. In Bayesian inference, PCR result of the disease causal mutation site is combined to linkage analyses. The disease site is introduced as a special linkage site, by setting the recombination rate between this special linkage site and the disease site to 0.
Psite i disease and Psite i normal are the probability of site i coming from the disease-carrying and the normal allele, respectively. They are calculated by combining the genotype probability generated by GATK [23] of all the family members.
$$ {P}_{\mathrm{site}\ i\ \mathrm{disease}}={\Sigma P}_{\mathrm{disease}-\mathrm{supportive}\ \mathrm{combination}}+\frac{1}{2}{\Sigma P}_{\mathrm{neutral}\ \mathrm{combination}} $$
\( {P}_{\mathrm{site}\ i\ \mathrm{normal}}={\Sigma P}_{\mathrm{normal}-\mathrm{supportive}\ \mathrm{combination}}+\frac{1}{2}{\Sigma P}_{\mathrm{neutral}\ \mathrm{combination}} \), Pdisease − supportive combination means the probability of the genotype combinations of the parents and embryos, based on which the site appears to come from the disease-carrying allele. Pnormal − supportive combination means the probability of the genotype combinations of the parents and embryos, based on which the site appears to come from the healthy allele. And Pneutral combination means the probability of the genotype combinations of the parents and embryos, based on which we cannot decide the allele origin for the embryo.
$$ {P}_{\mathrm{combination}}=\prod \limits_j{P}_{\mathrm{gt}\ \mathrm{of}\ \mathrm{sample}\ j\ \mathrm{in}\ \mathrm{the}\ \mathrm{combination}} $$
$$ {P}_{\mathrm{gt}\ \mathrm{of}\ \mathrm{sample}\ j}={P}_{\mathrm{gt}\mid \mathrm{all}\ \mathrm{read}\ \mathrm{data}}=\frac{P_{\mathrm{gt}}\times {P}_{\mathrm{all}\ \mathrm{read}\ \mathrm{data}\mid \mathrm{gt}}}{\sum \limits_{\mathrm{gt}}\left({P}_{\mathrm{gt}}\times {P}_{\mathrm{all}\ \mathrm{read}\ \mathrm{data}\mid \mathrm{gt}}\right)} $$
$$ {P}_{\mathrm{all}\ \mathrm{read}\ \mathrm{data}\mid \mathrm{gt}}={\Sigma}_{\mathrm{gt}\ \mathrm{after}\ \mathrm{amplification}}\left({P}_{\mathrm{gt}\ \mathrm{after}\ \mathrm{amplification}\mid \mathrm{gt}}\times {P}_{\mathrm{data}\mid \mathrm{gt}\ \mathrm{after}\ \mathrm{amplification}}\right) $$
$$ {P}_{\mathrm{data}\mid \mathrm{gt}\ \mathrm{after}\ \mathrm{amplification}}=\prod \limits_{\mathrm{read}}\left(\frac{P_{\mathrm{read}\mid \mathrm{haplotype}1}}{2}+\frac{P_{\mathrm{read}\mid \mathrm{haplotype}2}}{2}\right)\ \left[22\right] $$
Embryos with Pdisease smaller than 10−4 are assumed to be “normal,” while those with Pdisease between 10−4 and 0.1 are assumed to be “normal_risk.” Embryos with Pdisease greater than 0.9 are assumed to be “disease-carrying” and those with Pdisease between 0.9 and 0.6 are “disease_risk.” The embryos whose Pdisease is between 0.1 and 0.6 are categorized as “risk.”
Error probability is the probability of making a wrong estimation of an embryo, which is 1− Pdisease when we assume an embryo as a disease-carrying one and Pdisease when we assume an embryo as a normal one.
The disease-carrying probability calculated via Bayesian approach was compared with the result of previous papers, which had already been validated by different platforms, including Sanger sequencing, aCGH and STR analyses [7, 17]. The transferred embryo was also validated to be disease-free in prenatal diagnosis by Sanger sequencing, karyotype, or SNP array by amniocentesis [7, 17].
Phasing without proband sample
When a proband sample is absent, the disease-carrying allele is identified by grouping and phasing the genotypes of all embryos. First, the allele inherited from the disease-carrying parent is deduced for each embryo. Since these alleles are from the disease-carrying parent, it should be either disease-carrying allele or normal allele. The next step is to group these alleles into two classes according to the two kinds of genotypes at several sites. To group as many alleles as possible, we chose sites where the genotypes of most embryos, or most embryos and sperm samples, are specified. Finally, nucleotide composition is unified according to alleles in each class. The two unified alleles are the two alleles of the disease-carrying parent. The allele with causal mutation is the disease-carrying allele, while the allele without the causal mutation is the normal allele (Fig. 1c). To avoid genotype errors in embryos or disease-carrying parent, we discard those sites with more than one discordant sample, or those having the same genotype in two alleles.
All steps are detailed in a program online (https://github.com/XiongLuoxing/MARSALA). Once the raw sequencing files of volunteer family members are given, copy number variation (CNV) plot and linkage analyses results could be incorporated in the database automatically.