Accurate prediction of the gene structure depends upon the accurate prediction of splice sites. The conserved feature in splicing junction has been successfully used for the prediction of eukaryotic splice sites. In eukaryotes, though the di-nucleotide GT is conserved at 5′ splice sites, the pattern surrounding the conserved di-nucleotide varies from species to species. Most of the work related to splice site analysis has been extensively done in Homo sapiens and Arabidopsis thaliana. However, such works are yet to be fully explored in Oryza sativa and other species of grass family. In this study, statistical techniques have been applied to discriminate the real splice sites from pseudo splice sites in rice, maize and barley genomes and based on this a suitable window size is determined for the prediction of donor splice sites. Depending upon the determined window size, appropriate methods for predicting donor splice sites in rice have been considered and compared in terms of prediction accuracy. The results revealed that a window size of 9 base pair (3 bp at the exon end and 6 bp at the intron start including the conserved di-nucleotide GT at the beginning of intron) is an effective window size in all the three species of grass family for the prediction of donor splice sites. Further, the Maximum Entropy Model based method is found as best among the short sequence based prediction methods for donor splice sites with the 9 base pair window size.
Splice sites Prediction accuracy Window size Short sequence motif
Machine Learning Approaches
Maximum Entropy Modeling
Maximal Dependency Decomposition
Markov Model of 1st order
Weighted Matrix Method
This is a preview of subscription content, log in to check access.
This study is a part of Ph. D. thesis of P. K. Meher, PG School, IARI, New Delhi. Authors acknowledge the INSPIRE fellowship of Department of Science and Technology, New Delhi and IARI Fellowship. The authors also acknowledge the computational facilities of SCGL, developed under NAIP grant NAIP/Comp-4/C4/C-30033/2008-09.
Supplementary Fig. 1Confusion Matrix. TP is the number of TSS being predicted as TSS, TN is the number of FSS being predicted as FSS, FN is the number of TSS being incorrectly predicted as FSS and FP is the number of FSS being incorrectly predicted as TSS. (TIFF 252 kb)
Supplementary Fig. 2Bar diagram of calculated value of Pearson chi-square obtained from the sequence data of TSS and FSS for the three species. X-axis represents positions of the motif and the height of each bar corresponds to the value of chi-square of each positions. (TIFF 748 kb)
Supplementary Fig. 3Graphical representation of the Kull-back Leibler Divergence for different positions of the splice site motifs. The height of each bar represents the distance between the true and false splice site for the corresponding position. (TIFF 187 kb)
Burge C, Karlin S (1997) Prediction of complete gene structure in human genomic DNA. J Comput Biol 268(1):78–94Google Scholar
Cramér H (1946) Mathematical methods of statistics. Princeton University Press, Princeton, p 282. ISBN 0-691-08004-6Google Scholar
De Bona F, Ossowski S, Schneeberger K, Rätsch G (2008) Optimal splice alignments of short sequence reads. Bioinformatics 24:174–180CrossRefGoogle Scholar
Degroeve S, De Baets B, Van de Peer Y, Rouz P (2002) Feature subset selection for splice site prediction. Bioinformatics 18:S75–S83CrossRefPubMedGoogle Scholar
Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouze P, Brunak S (1996) Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res 24:3439–3452PubMedCentralCrossRefPubMedGoogle Scholar
Ho LS, Rajapakse JC (2003) Splice site detection with a higher-order Markov model implemented on a neural network. Genome Inf 14:64–72Google Scholar
Huang J, Li T, Chen K, Wu J (2006) An approach of encoding for prediction of splice sites using SVM. Biochemie 88:923–929CrossRefGoogle Scholar
Sheth N, Roca X, Hastings ML, Roeder T, Krainer AR, Sachidanandam R (2006) Comprehensive splice site analysis using comparative genomics. Nucleic Acids Res 34:3955–3967PubMedCentralCrossRefPubMedGoogle Scholar
Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinforma 8(Suppl 10):S7CrossRefGoogle Scholar