Background

Diploid eukaryotic species inherit two copies (i.e., two alleles) of the same gene from both parents. If one allele fails to work properly, the other allele can still implement a gene’s cellular function. For some genes, however, this protective mechanism is disabled because only one allele is expressed and its failure probably leads to cellular malfunction. These monoallelically expressed genes can be classified into one of three categories [1]: X-inactivated genes, which are regulated by a random process in which one of the two X chromosomes present in female mammals is silenced [2]; autosomal genes subject to random monoallelic gene expression, such as the T cell receptors and natural killer cell receptors [39]; and autosomal imprinted genes (e.g., CDKN1C and H19), which express from only one of the two alleles according to their parental origin [1013]. Imprinted genes play important functional roles in the control of embryonic growth and development, as well as in post-natal development [1416]. As imprinted genes are expressed from only one of the two parental chromosomes, a de facto haploid state is caused by imprinting and leads to asymmetric functions of parental genomes and loss of diploid protection against recessive mutations [11]. Thus, imprinting dysregulation is linked to numerous human genetic diseases, such as developmental disorders (Prader-Willi syndrome, Angelman syndrome) and cancers (neuroblastoma, Wilms’ tumor) [1720]. In addition, environmental factors can influence gene expression by targeting imprinted genes [21, 22]. Because imprinted genes are more susceptible to disease than non-imprinted genes [23, 24], it is of great importance to identify novel imprinted genes for human.

Identifying imprinted genes experimentally has traditionally been a slow process, and the number of validated ones is much lower than the previous estimation (~1% of all human genes) [25]. However, recently developed high-throughput screening approaches (e.g., expression profiling and single-nucleotide polymorphism [SNP] microarrays) and recently identified DNA sequence characteristics (e.g., the number and type of repeated elements flanking a gene) have led to the proposal of several new methods to predict imprinted genes on a global scale [2631]. With advances in next-generation sequencing technology [32, 33], mRNA-Seq is becoming a powerful tool for transcriptome profiling [34]. It can generate not only the number of reads mapped to exons, which reflects the expression levels of a gene, but also the actual sequence, which may identify the allele from which the mRNA is expressed. Therefore, inference can be made for predicting imprinted genes. For example, by sequencing whole transcriptomes from mice embryos, Babak et al. (2008) measured allelic expression bias and identified six novel imprinted genes [35].

However, to our knowledge, prediction of imprinted genes by deeply sequencing transcriptomes (mRNA-Seq) from multiple independent tissues is still an open problem. In this study, we proposed a Bayesian model – dsPIG (d eep s equencing-based P rediction of I mprinted G enes) – to predict imprinted genes based on allelic expression inferred from observed SNPs in mRNA-Seq data of independent human tissues. With dsPIG, we were able to measure the imbalance of allelic expression among various tissues and calculate the posterior probability of imprinting status for each gene. Under a stringent probabilistic cut-off of the posteriors and other reasonable biological criteria, we identified 57 potentially imprinted genes from 9 diverse human tissues and 94 potentially imprinted genes from 20 cerebellar cortices, with an expected low false discovery rate (FDR). Furthermore, analysis of allelic expression of the same genes among different tissues revealed that, in some cases, even if a gene was biallelically expressed, one allele always had higher expression level than the other.

Results

Statistical model development

Monoallelic expression generally falls into one of three categories: imprinted expression, random monoallelic expression and X-inactivation, all of which express only one of two alleles in a single cell [110]. At a tissue level, however, random monoallelic expression will allow both alleles to be detected in total RNA because of the “mosaicism” of the tissue [9, 36] (also see discussion). Because our study was based on whole transcriptomes of tissue samples, random monoallelic expression was reasonably considered as biallelic expression when averaged over the entire tissue. X-inactivation was also excluded from this study by discarding all predictions on the X chromosome. Thus imprinting is the most likely cause of the observed monoallelic expression among transcriptomes of different tissues even though we cannot infer the parent of origin.

We used known SNPs from dbSNP [37] to distinguish and count the two alleles of each gene. If a gene was imprinted, we expected to observe only one of the two alleles of each SNP in the exon region from the whole transcriptome. With the allelic counts obtained from the mRNA-Seq data (see Materials and Methods), we developed a Bayesian model (dsPIG) to compute the posterior probability of imprinting based on each single SNP. Suppose we have sequenced transcriptomes from n independent tissue samples. For each sample, we count the alleles of all known SNPs, discarding those with 0 counts. For each SNP, let the allelic counts be: (x 1 , y 1 ), (x 2 , y 2 )…, (x n , y n ), where x k and y k are the counts for two alleles X and Y in the sample k (k= 1, …, n). Because each gene can only have two statuses: imprinted (I) or non-imprinted (NI), we consider (I, NI) as a binary event vector for the imprinting status. By denoting data = {(x 1 , y 1 ), (x 2 , y 2 )…, (x n , y n )}, we have by Bayes’ Theorem:

Pr I | d a t a = Pr d a t a | I × Pr I Pr d a t a | I × Pr I + Pr d a t a | N I × Pr N I
(1)

where Pr(I | data) is the posterior probability of imprinting and Pr(I) is the prior of imprinting. Based on current knowledge of prevalence of imprinted genes [25], we set the prior Pr(I)=1%, thus Pr(NI) = 1 – Pr(I) = 99%. Since samples were independent of each other and genotype has only 3 possible combinations (XX, XY, YY), we denote the genotype as follows:

δ = { 1 the genotype is X X 2 the genotype is X Y 3 the genotype is Y Y

Assuming p and q are the allele frequencies for allele X and Y, p + q =1. According to Hardy-Weinberg equilibrium, the prior probabilities for the three genotypes are calculated as follows:

Pr δ = { = Pr X 2 = p 2 δ = 1 = Pr ( X ) Pr ( Y ) = 2 p q δ = 2 = Pr Y 2 = q 2 δ = 3

Since values of p and q can be retrieved from dbSNP [37], p and q are treated as constants.

We used the law of total probability to calculate the likelihood Pr(data|I) as follows:

Pr data| I = k = 1 n Pr x k , y k | I n 1
(2)
= k = 1 n Pr x k , y k | I , δ = 1 × p 2 + Pr x k , y k | I , δ = 2 × 2 p q + Pr x k , y k | I , δ = 3 × q 2

By assuming that (i) the transcript levels of a gene’s two alleles are equal if the gene is biallelically expressed, (ii) two different alleles, if both expressed, have the same chance to be sequenced by mRNA-Seq, and (iii) unexpressed alleles with counts >0 are due to sequencing errors, we had the following derivation:

Pr x k , y k | I , δ = 1 = f y k ; n k , p e
(3)
Pr x k , y k | I , δ = 3 = f x k ; n k , p e
(4)
Pr x k , y k | I , δ = 2
= Pr x k , y k | I , δ = 2 , θ = 0 × Pr θ = 0 + Pr x k , y k | I , δ = 2 , θ = 1 × Pr θ = 1
= f y k ; n k , p e × 1 2 + f x k ; n k , p e × 1 2
(5)

Here, f denotes binomial distribution [i.e., f x ; n , p = n x 1 p n x p x ], n k  = x k  + y k is assumed fixed, and p e is the averaged sequencing error rate for each SNP (p e was obtained from Wang et al. 2008). The binary variable θ is defined as follows θ = { 0 Only X can be expressed due to imprinting 1 Only Y can be expressed due to imprinting . Since X and Y have an equal chance to be inherited from either maternal or paternal genome, X and Y have an equal chance to be expressed in imprinted genes. Hence, Pr θ = { 1 / 2 θ = 0 1 / 2 θ = 1 .

Similarly for the likelihood function Pr(data|NI), we have:

Pr d a t a | N I = k = 1 n Pr x k , y k | N I n 1
(6)
= k = 1 n Pr x k , y k | N I , δ = 1 × p 2 + Pr x k , y k | N I , δ = 2 × 2 p q + Pr x k , y k | N I , δ = 3 × q 2
Pr x k , y k | N I , δ = 1 = f y k ; n k , p e
(7)
Pr x k , y k | N I , δ = 3 = f x k ; n k , p e
(8)

Based on our three assumptions (i) – (iii), Pr(x k , y k |NI, δ = 2) follows a binomial distribution with p=0.5. Therefore, we have:

Pr x k , y k | N I , δ = 2 = x k + y k x k 1 2 x k + y k
(9)

Computation is performed separately for each single SNP. Therefore, a posterior probability of imprinting for a gene is associated with a specific SNP in this gene, unless otherwise specified.

Simulation-based model analysis

We generated allelic counts from simulated data by taking into account imprinting status, SNP frequency, and sequencing error. We then applied dsPIG to estimate the sensitivity, specificity and the FDR. We generated two sets of allelic counts under the assumption that the locus was either imprinted or not. The number of reads for each allele was generated assuming the presence of one dominating allele plus sequencing error for imprinted case; and presence of equal amount of two alleles plus sequencing error for non-imprinted case. The generated allelic counts followed a distribution similar to the actual mRNA-Seq data (Additional file 1 Figure S1). More details of the procedure were illustrated in Additional file 2 Figure S2. Given an allele frequency (0.5), a sequencing error (0.02) and a prior (0.01) of imprinting, the posterior probability calculated from the allelic counts generated for imprinted genes approached 1 as the sample size increased, while concomitantly the posterior probability for biallelically expressed approached 0 (Figure 1a,b; for each SNP, only the tissue samples with allelic counts >2 were considered valid samples, and we used “sample size” to refer to the number of valid samples in this study). With minor allele frequencies between 0.005 to 0.5, sequencing errors between 0.01 to 0.05 and priors between 0.005 to 0.02, we obtained similar results as Figure 1a,b. Using 0.2 as the cut-off for posteriors Pr(I | data) (see Model Analysis Based on Independent Test Sets), sensitivity of our model-based prediction exceeded 99.9% when sample size was >9, and specificity exceeded 99.99% when sample size was >18 (Figure 1c). Under the same cut-off and the allele frequency, the FDR of predicted imprinted genes decreased as sample size increased: it dropped to 5% and 1% as sample size exceeded 20 and 25, respectively (Figure 1d). Using the same cut-off (0.2), we also examined how FDR varied when minor allele frequency changed from 0.1 to 0.5 and sample size increased from 1 to 50 (Figure 2). Based on these simulations, we were able to provide estimated FDRs for most of our predictions in this study (Additional file 3 Table S1).

Figure 1
figure 1

Simulation-based performance analysis of dsPIG. a, b, Simulated (natural log-transformed) posteriors of (a) biallelically expressed genes and (b) imprinted genes. The dashed line in both panels stands for the log-transformed prior (0.01). Results in (a) and (b) were based on 20,000-time simulations with geometric mean as the method of averaging posteriors. c, Sensitivity (the black solid line) and specificity (the read dashed line) of our model. d, the FDR of our predictions. When sample size was <5, the FDR was not computable as sensitivity and specificity were both 0. Results in (c) and (d) were based on 20,000-time simulations with geometric mean as the method of averaging posteriors.

Figure 2
figure 2

FDRs of our predictions with respect to different allele frequencies. When minor allele frequency (mAF) decreased from 0.5 to 0.1, FDR generally increased if sample size was >10. Results were based on 20,000-time simulations. For detailed values of FDR, please refer to Additional file 4: Table S2.

Predictions of imprinted genes

We collected two previously published mRNA-Seq data sets. One set included 9 diverse tissue samples (Group I) from Wang et al. 2008, and the other set had 20 cerebellum cortex samples (Group II) from Mudge et al. 2008 (Table 1; see Data Collection in Materials and Methods) [38, 39]. Wang et al. 2008 showed that, in terms of alternative isoform expression, cerebellum tissues were clustered together and the 9 diverse samples were more closely correlated. Here we performed hierarchical clustering based on the imprinting-inclined SNPs (i.e., SNPs with posteriors >0.01) and obtained similar results (Figure 3; see Sample Clustering in Materials and Methods). As we expected, samples from the cerebellar cortex were clustered together, with Caucasian and African American separated in two sub-clusters (Figure 3a). Using Caucasian allele frequency on African American samples, however, yielded a sub-cluster without separation between the two ethnicities (Figure 3b). This suggests that the separation observed in Figure 3a was due to differences in minor allele frequencies. As a test set, 3 breast cancer cell line samples were clustered together in both cases. Compared with other non-cerebellum samples, the brain sample had higher correlation with cerebellum samples in both cases, which is sensible biologically. The result that the 9 diverse tissue samples were clustered together could be caused by many factors such as different experimental conditions between Group I and Group II samples.

Table 1 Sample information and sequencing results from 9 various normal tissues (Group I) and 14 cerebellar cortices with schizophrenia (SCZD) and 6 normal cerebellar cortices (Group II)
Figure 3
figure 3

Sample clustering in terms of imprinting-inclined SNPs. Spearman correlations were calculated between each pair of samples using the posterior on each SNP calculated by dsPIG in each sample. Hierarchical clustering was conducted with average linkage as the agglomerative method. Posterior probabilities of African American samples were computed with African American allele frequency in panel (a) and with Caucasian allele frequency in panel (b).

Using dsPIG, we predicted imprinted genes for Group I and Group II separately. To call a gene imprinted, we checked the posteriors of all the SNPs on the same gene to make sure that (i) the gene had at least one SNP with a posterior >0.2, which was the same cut-off used in simulations, and (ii) all the other SNPs did not show any contradictory evidence (i.e., all the other posteriors were >0.002 [our 20,000-time simulations showed that 95% of the posteriors of imprinted genes were >0.002]). After applying dsPIG to the mRNA-Seq data and using the above criteria, we predicted 57 potentially imprinted genes for Group I samples and 94 potentially imprinted genes for Group II samples out of a total of 20,559 genes (0.28% and 0.46%, respectively) that had allele frequency data in dbSNP. The distribution of sample sizes for SNPs was shown in Additional file 5 Figure S3. We listed the top 15 predictions for both groups with respect to their posterior probabilities of imprinting in Table 2 (see Additional file 3 Table S1 for the complete list). Surprisingly, we found only 13 common genes between the 57-gene list and the 94-gene list. Functional enrichment analysis performed in Ingenuity® Pathway Analysis (IPA) showed that the two lists of genes were significantly enriched (p-value<0.05 after “BH” correction [40]) in certain Bio Functions but not in any canonical pathways defined by IPA (Additional file 6 Figure S4). We also compared our predictions with the 371 genes that are subject to random monoallelic expression [1] and found none of our predicted genes in either group overlapped with them, which further validates the quality of our predictions and strongly suggests that our predictions are not affected by random monoallelic expression.

Table 2 The top 30 predictions of imprinted genes based on mRNA-Seq data from Group I and Group II samples

Model analysis based on independent test sets

We collected 66 validated imprinted genes and 155 validated non-imprinted genes ([31, 41]; see Data Collection) and used them to check for false-positive and false-negative predictions made by dsPIG. Based on the mRNA-Seq data of Group I samples, 28 of the 66 imprinted genes and 119 of the 155 non-imprinted genes had allelic counts for known SNPs; based on the mRNA-Seq data of Group II samples, 26 of the 66 imprinted genes and 110 of the 155 non-imprinted genes had allelic counts for known SNPs. Under the same criteria used to predict imprinted genes in Table 2, dsPIG identified 2 out of 28 imprinted genes from Group I samples and 9 out of 26 imprinted genes from Group II samples, based on the mRNA-Seq data; under the same criteria, dsPIG misidentified 0 out of 119 non-imprinted genes from Group I samples and 0 out of 110 non-imprinted genes from Group II samples. Moreover, among those imprinted genes, only 8 out of 28 (Group I) and 17 out of 26 (Group II) have sample sizes >5. On the contrary, all 11 dsPIG-identified imprinted genes (2 for Group I and 9 for Group II) have sample size >5. Therefore, for imprinted genes with sample size > 5, 2 out of 8 genes and 9 out of 17 genes could be identified by dsPIG for Group I and Group II, respectively. Again, this showed that the sensitivity of dsPIG increased as sample size increased and dsPIG could probably identify more imprinted genes if the number of sequenced tissue samples further increased. This also agreed with the simulation results. Interestingly, some of the validated imprinted genes had very small posteriors (<10-8), which indicated that they had biallelic expression (or random monoallelic expression) in certain tissues (Table 3).

Table 3 Tissues where validated imprinted genes most likely had biallelic expression

We used the same sets of imprinted and non-imprinted genes to determine the cut-off of the posteriors used in the prediction of imprinted genes. Because most genes (~99%) are expected to be non-imprinted, the cut-off has to yield a very high specificity (>99%) so that the overall FDR of our predictions can be low enough (<50%) for further validations. After trying different cut-offs (0.1, 0.2, …, 0.9), we found 0.2 to be the most appropriate cut-off in terms of the validated gene sets because (i) increasing the cut-off from 0.2 only lowered the sensitivity while left the specificity unchanged (~0%), and (ii) decreasing the cut-off from 0.2 lowered the specificity while the sensitivity didn’t change a lot, which substantially increased the FDR of predictions (Additional file 7: Figure S5). We also showed the ROC curves for dsPIG in Additional file 7: Figure S5.

Candidates for experimental validation

We chose top 30 candidate genes from our predictions and listed them in Table 4. Except one SNP (rs#3106189), all the SNPs in Table 4 have high minor allele frequencies (>0.184), which indicates >30% chance of observing heterozygous alleles in experiments. In addition, these genes also met at least one of the following three criteria: (i) their SNPs (the 4th column of Table 4) had a relatively low FDR (<0.3), (ii) they had multiple SNPs with posteriors > 0.2 (dsPIG calculated a posterior for each SNP in a gene), and (iii) they were located near existing imprinted genes (distance <2M base pairs) [10, 42]. These additional criteria further increased the possibility of imprinting.

Table 4 Suggested predictions for experimental validation.

Detection of allele-preferred expression

By investigating the biallelically expressed genes identified in the mRNA-Seq data, we found that at least 18 genes expressed significantly more transcripts from one specific allele than the other among various individuals and tissues (P < 0.05 by binomial test; Table 5; see Binomial Test in Materials and Methods). This indicated that the difference between expression levels of the two alleles was not caused by sequencing errors or stochastic effects in RT-PCR. In future, as more mRNA-Seq data are generated, if more genes with one specific allele always under-expressed are observed, we would speculate that a sophisticated mechanism (such as nonsense-mediated mRNA decay [43]) may exist to explain this type of allelic preference in gene expression for biallelically expressed genes.

Table 5 20 SNPs of which one specific allele had a higher transcript level than the other one among various tissues and individuals

Web-based service and R package for dsPIG

We have provided a web-based service for dsPIG at http://www.shoudanliang.com/dsPIG/. Users need to upload either mapped mRNA-Seq data in the supported format or processed data files containing allelic counts for each SNP (see the website for more details). After uploading the data, users may set the values for (i) the cut-off for posteriors, (ii) the average sequencing error rate, (iii) the prior of imprinting, and (iv) QS (quality score). In addition, users need to specify the human genome build and the SNP build for dsPIG. Our server will calculate the posterior probabilities of imprinting for each gene and email the results back to the users. In the final submission form, users may request additional analysis, such as suggesting tissues where known imprinted genes most likely have biallelic expression. In addition, we have made an R package for dsPIG, which is available both on the website and in the supplementary materials (Additional file 8 for UNIX and Additional file 9 for Windows). The instruction and the sample files for this R package are in Additional file 10. The web service and the R package generate the same result on the predictions of imprinted genes. For reference, we have also provided the annotated code for dsPIG (including both the R code and the C code) used in this study as Additional file 11 and on the website.

Discussion

dsPIG operates under the assumption that if a gene is biallelically expressed, the transcript levels of two alleles are the same. However, because imbalance of allelic expression has been widely detected in human tissues [27, 4446], our assumption may not always be correct. Indeed, within our 29-tissue samples, allelic imbalance was observed even after excluding possible expression from imprinted genes (data not shown here). However, our simulation studies showed that our model can still distinguish imprinted genes from non-imprinted ones in most situations (Figure 4), unless one specific allele is always expressed at <13% of the other allele’s expression level across different samples (no supporting literature for this yet). Stochastic RT-PCR amplification [4] is not of particular concern in dsPIG because this has been taken into account as allelic imbalances.

Figure 4
figure 4

Effect of imbalanced transcript levels on the posteriors of biallelically expressed genes. Solid lines stand for simulated posteriors for imprinted genes (black line) and biallelically expressed genes (non-black lines). “Randomly imbalanced” means that in each sample we randomly picked one allele to have a lower expression level than the other allele. FIC indicates “Fixed Imbalanced Coefficient”, which means one allele is always expressed at a “FIC” level of the other one in all samples. The dashed line stands for the log-transformed prior. When FIC is low enough (typically <13%), posteriors are not able to tell the difference between imprinted (solid black line) and biallelic expression (green line).

dsPIG is sensitive to biallelic expression and unlikely to falsely predict imprinted genes. A gene will get a very low posterior probability of imprinting when it obviously has biallelic expression in one tissue, even if only one allele of this gene is observed in the transcriptomes from all other tissues (Additional file 12: Figure S6). For example, a gene may have monoallelic expression because of allele-specific differences caused by either heterozygous SNPs or somatic mutations in the promoter region of the gene; as long as monoallelic expression caused by these conditions is not present population-wide and a large number (e.g., >25 in our study) of independent tissue samples is used in dsPIG, these genes will not be falsely predicted as imprinted genes. However, for the same reason, if transcriptomes are collected from various tissues, it becomes very hard for dsPIG to detect tissue-specific imprinted genes. This partially explains why dsPIG predicted much less imprinted genes in Group I samples than in Group II samples (another reason is that Group II has more samples than Group I).

One advantage of dsPIG is that it is able to predict imprinted genes without sequencing the genotype. Although a homozygous allele will also lead to identification of only one allele in the transcriptome, the result will not elevate the posterior belief of imprinting in our model. This is very important and practical for human because we no longer need to sequence the genome for genotypes. However, a disadvantage is that dsPIG cannot tell the parent of origin for predicted imprinted genes, which can be verified only by other studies.

One obvious limit of dsPIG is that it was modeled based on single SNPs. This means, if a gene has more than 1 SNP site in its exons, it may have different single-SNP-based allelic counts and thus different posterior probabilities from dsPIG. Therefore, one single strong posterior cannot determine the imprinting status of this gene. Instead, we have to look into all SNP sites of each gene and make sure no contradictory posterior exists for our predictions (as stated in results). In future, a possible improvement would be integration of all SNP information of a gene and calculate a single posterior to predict imprinting status.

Conclusions

In this paper, we proposed a new method – dsPIG, applicable to all mammals with genomic imprinting, to predict imprinted genes based on mRNA-Seq data of various independent tissues. With enough sequenced samples, dsPIG is capable of predicting imprinted genes on a genome-wide scale with expected low FDRs. The power of dsPIG will be further enhanced after more data generated by mRNA-Seq technology become available in the near future.

Methods

Data collection

To predict gene imprinting status, we used ~650 million short reads from 29 human tissue samples [38, 39]: 206 million from 9 different normal tissues, 320 million from 14 cerebellar cortices with schizophrenia and 124 million from 6 normal cerebellar cortices. Tissue samples were collected from 29 independent individuals, among which 20 were Caucasian and 9 were African American (Table 1). We downloaded SNP data from UCSC genome browser using the table function [47] and used SNPs with only two alleles to make predictions in this study. The source of the original data was SNP Database (dbSNP) build 129 from NCBI [37]. Known imprinted and non-imprinted genes were collected from Additional file 3: Table S1 and Additional file 4: Table S2 in the supplementary research data of Luedi et al. (2007) [31] and from Catalogue of Parent of Origin Effects (http://igc.otago.ac.nz/home.html) [41].

Calculation of allelic counts

RNA-Seq reads were mapped to human genome hg18 from UCSC genome browser using Eland (GAPipeline-1.3.2). The unmapped reads were mapped to the exon-exon junctions downloaded from http://genes.mit.edu/burgelab/mrna-seq/[38]. The junctions contain 56 (28×2) base pairs in total, allowing the reads (32 base pairs) to be mapped with a minimum of 4 base pairs on each side of the junctions [38]. To compute the number of alleles for each SNP, we scanned each mapped tag for all known SNPs (in terms of dbSNP) and counted the number of times each nucleotide occurred at each SNP position in each sample. To reduce the amount of calculation, we only retained SNPs that were covered by any sequencing tag in any sample. This generated allelic counts for a total of 1,261,906 SNPs in the 29 tissue samples. We then discarded SNPs with unknown frequency and very low allelic counts (i.e., total allelic counts <3 in each sample). In addition, we defined a quality score-QS (see Definition of QS) and discarded SNPs with QS <0.9 in all 29 samples (see Additional file 13: Figure S7). After these steps, allelic counts for 82,916 SNPs remained, and these were used in dsPIG.

Definition of QS

We only used biallelic SNPs in dbSNP, which is the majority of SNPs. Nucleotides observed other than these two (alleles) were considered as sequencing errors due to low sequencing quality at the SNP site, or indicted that the allelic information of the SNP was wrong in dbSNP. Thus, we defined quality score-QS to improve the quality of the SNPs used in our predictions:

QS k = x k + y k x k + y k + e k

In the above equation, x k and y k are the allelic counts of allele X and Y in the kth sample (X and Y were determined by dsSNP), and e k is the count of additional nucleotide(s). QS was calculated for each SNP in each sample. We arbitrarily chose 0.9 as a cut-off for QS ( Additional file 13: Figure S7) and only those SNPs with QS>0.9 were used in dsPIG.

Sample clustering

By applying dsPIG to each of the 29 samples, we obtained 29 lists of posterior probabilities (each list had 87,852 posteriors for 87,852 SNPs), which were first multiplied by 100 and then natural logarithm transformed. Thus, if the posterior was the same as the prior (0.01), it would become 0 after the transform. After that, for each SNP, all posteriors ≤ 0 were modified to 0, while all posteriors > 0 were kept at the same value. If all 29 posteriors for a SNP were 0, the SNP was removed from the 29 lists. By doing this, only the SNPs that showed an increased probability of imprinting were kept for clustering. We then computed spearman correlations between samples based on the remaining 29 lists of posteriors, used these correlation values to determine distance between samples, and performed hierarchical clustering in R (http://www.r-project.org). By using this method, we clustered samples in terms of imprinting-inclined SNPs, and thus reduced the influence of biallelically expressed genes.

Binomial test

For a single SNP, we first defined P k = x k y k , where x k and y k are the allelic counts for its alleles X and Y in the sample k (k=1, …, n). We then defined z = k = 1 n I P k , where I P k = { 1 , if P k > 1 0 , if P k < 1 . We used the following criteria to find the SNPs with both alleles (XY) expressed in the sample k: x k >10, y k >10, x k +y k >50, x k /y k <10 and y k /x k <10. The criteria made it very unlikely to observe biallelic expression because of sequencing error. For each SNP that met the criteria in n samples, if no allele was preferably expressed by the transcription machinery, z should follow a binomial distribution f(z; n, p) with p=0.5. To obtain enough testing power, SNPs with n ≥ 8 were deemed qualified for the binomial test. Under the above criteria, we found 24 qualified SNPs and listed all 20 significant testing results in Table 5.