Journal of Plant Biochemistry and Biotechnology

, Volume 24, Issue 4, pp 385–392 | Cite as

Determination of window size and identification of suitable method for prediction of donor splice sites in rice (Oryza sativa) genome

  • Prabina Kumar Meher
  • Tanmaya Kumar Sahu
  • A. R. RaoEmail author
  • S. D. Wahi
Original Article


Accurate prediction of the gene structure depends upon the accurate prediction of splice sites. The conserved feature in splicing junction has been successfully used for the prediction of eukaryotic splice sites. In eukaryotes, though the di-nucleotide GT is conserved at 5′ splice sites, the pattern surrounding the conserved di-nucleotide varies from species to species. Most of the work related to splice site analysis has been extensively done in Homo sapiens and Arabidopsis thaliana. However, such works are yet to be fully explored in Oryza sativa and other species of grass family. In this study, statistical techniques have been applied to discriminate the real splice sites from pseudo splice sites in rice, maize and barley genomes and based on this a suitable window size is determined for the prediction of donor splice sites. Depending upon the determined window size, appropriate methods for predicting donor splice sites in rice have been considered and compared in terms of prediction accuracy. The results revealed that a window size of 9 base pair (3 bp at the exon end and 6 bp at the intron start including the conserved di-nucleotide GT at the beginning of intron) is an effective window size in all the three species of grass family for the prediction of donor splice sites. Further, the Maximum Entropy Model based method is found as best among the short sequence based prediction methods for donor splice sites with the 9 base pair window size.


Splice sites Prediction accuracy Window size Short sequence motif 



Machine Learning Approaches


Maximum Entropy Modeling


Maximal Dependency Decomposition


Markov Model of 1st order


Weighted Matrix Method



This study is a part of Ph. D. thesis of P. K. Meher, PG School, IARI, New Delhi. Authors acknowledge the INSPIRE fellowship of Department of Science and Technology, New Delhi and IARI Fellowship. The authors also acknowledge the computational facilities of SCGL, developed under NAIP grant NAIP/Comp-4/C4/C-30033/2008-09.

Supplementary material

13562_2014_286_MOESM1_ESM.tif (252 kb)
Supplementary Fig. 1 Confusion Matrix. TP is the number of TSS being predicted as TSS, TN is the number of FSS being predicted as FSS, FN is the number of TSS being incorrectly predicted as FSS and FP is the number of FSS being incorrectly predicted as TSS. (TIFF 252 kb)
13562_2014_286_MOESM2_ESM.tif (749 kb)
Supplementary Fig. 2 Bar diagram of calculated value of Pearson chi-square obtained from the sequence data of TSS and FSS for the three species. X-axis represents positions of the motif and the height of each bar corresponds to the value of chi-square of each positions. (TIFF 748 kb)
13562_2014_286_MOESM3_ESM.tif (188 kb)
Supplementary Fig. 3 Graphical representation of the Kull-back Leibler Divergence for different positions of the splice site motifs. The height of each bar represents the distance between the true and false splice site for the corresponding position. (TIFF 187 kb)


  1. Burge C, Karlin S (1997) Prediction of complete gene structure in human genomic DNA. J Comput Biol 268(1):78–94Google Scholar
  2. Cramér H (1946) Mathematical methods of statistics. Princeton University Press, Princeton, p 282. ISBN 0-691-08004-6Google Scholar
  3. De Bona F, Ossowski S, Schneeberger K, Rätsch G (2008) Optimal splice alignments of short sequence reads. Bioinformatics 24:174–180CrossRefGoogle Scholar
  4. Degroeve S, De Baets B, Van de Peer Y, Rouz P (2002) Feature subset selection for splice site prediction. Bioinformatics 18:S75–S83CrossRefPubMedGoogle Scholar
  5. Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouze P, Brunak S (1996) Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res 24:3439–3452PubMedCentralCrossRefPubMedGoogle Scholar
  6. Ho LS, Rajapakse JC (2003) Splice site detection with a higher-order Markov model implemented on a neural network. Genome Inf 14:64–72Google Scholar
  7. Huang J, Li T, Chen K, Wu J (2006) An approach of encoding for prediction of splice sites using SVM. Biochemie 88:923–929CrossRefGoogle Scholar
  8. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86CrossRefGoogle Scholar
  9. Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–451CrossRefPubMedGoogle Scholar
  10. Pertea M, Lin X, Salzberg SL (2001) GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res 29(5):1185–1190PubMedCentralCrossRefPubMedGoogle Scholar
  11. Sheth N, Roca X, Hastings ML, Roeder T, Krainer AR, Sachidanandam R (2006) Comprehensive splice site analysis using comparative genomics. Nucleic Acids Res 34:3955–3967PubMedCentralCrossRefPubMedGoogle Scholar
  12. Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinforma 8(Suppl 10):S7CrossRefGoogle Scholar
  13. Staden R (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 12:505–519PubMedCentralCrossRefPubMedGoogle Scholar
  14. Sun YF, Fan XD, Li YD (2003) Identifying splicing sites in eukaryotic RNA: support vector machine approach. Comput Biol Med 33:17–29CrossRefPubMedGoogle Scholar
  15. Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(7):873–881PubMedCentralCrossRefPubMedGoogle Scholar
  16. Yeo G, Burge CB (2004) Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 11(2–3):377–394CrossRefPubMedGoogle Scholar

Copyright information

© Society for Plant Biochemistry and Biotechnology 2014

Authors and Affiliations

  • Prabina Kumar Meher
    • 1
  • Tanmaya Kumar Sahu
    • 2
  • A. R. Rao
    • 2
    Email author
  • S. D. Wahi
    • 1
  1. 1.Division of Statistical GeneticsIndian Agricultural Statistics Research InstituteNew DelhiIndia
  2. 2.Centre for Agricultural BioinformaticsIndian Agricultural Statistics Research InstituteNew DelhiIndia

Personalised recommendations