Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification

  • A. K. M. A. Baten
  • S. K. Halgamuge
  • Bill Chang
  • Nalin Wickramarachchi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4492)


The increasing growth of biological sequence data demands better and efficient analysis methods. Effective detection of various regulatory signals in these sequences requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the surrounding region of the regulatory signals. A higher order Markov model is generally regarded as a useful technique for modeling higher order dependencies of the nucleotides. However, its implementation requires estimating a large number of computationally expensive parameters. In this paper, we propose a hybrid method consisting of a first order Markov model for sequence data preprocessing and a multilayer perceptron neural network for classification. The Markov model captures the compositional features and dependencies of nucleotides in terms of probabilistic parameters which are used as inputs to the classifier. The classifier combines the Markov probabilities nonlinearly for signal detection. When applied to the splice site detection problem using three widely used data sets, it is observed that the proposed hybrid method is able to model higher order dependencies with better classification accuracies.


Splice Site Radial Basis Function Network Donor Splice Site Acceptor Splice Site Splice Site Prediction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Burset, M., Seledtsov, A., Solovyeva, V.V.: Analysis of Canonical and Non-Canonical Splice Sites in Mammalian Genomes. Nucleic Acids Research 28, 4364–4375 (2000)CrossRefGoogle Scholar
  2. 2.
    Chen, T.M., Lu, C.C., Li, W.H.: Prediction of Splice Sites with Dependency Graphs and Their Expanded Bayesian Networks. Bioinformatics 21, 471–482 (2005)CrossRefGoogle Scholar
  3. 3.
    Burge, C., Karlin, S.: Prediction of Complete Gene Structure in Human Genomic DNA. Journal of Molecular Biology 268, 78–94 (1997)CrossRefGoogle Scholar
  4. 4.
    Pertea, M., Lin, X.Y., Salzberg, S.L.: GeneSplicer: A New Computational Method for Splice Site Detection. Nucleic Acids Research 29, 1185–1190 (2001)CrossRefGoogle Scholar
  5. 5.
    Marashi, S.A., Eslahchi, C., Pezeshk, H., Sadeghi, M.: Impact of RNA Structure on the Prediction of Donor and Acceptor Splice Sites. BMC Bioinformatics 7, 297 (2006)CrossRefGoogle Scholar
  6. 6.
    Salzberg, S.: A Method for Identifying Splice Sites and Translation Start Site in Eukaryotic mRNA. Computer Applications in the Biosciences 13, 384–390 (1997)Google Scholar
  7. 7.
    Zhang, M., Marr, T.: A Weight Array Method for Splicing Signal Analysis. Comput Appl. Biosci. 9, 499–509 (1993)Google Scholar
  8. 8.
    Castelo, R., Guigo, R.: Splice Site Identification by idlBNs. Bioinformatics 20, 69–76 (2004)CrossRefGoogle Scholar
  9. 9.
    Cai, D., Delcher, A., Kao, B., Kasif, S.: Modeling Splice Sites with Bayes Networks. Bioinformatics 16, 152–158 (2000)CrossRefGoogle Scholar
  10. 10.
    Staden, R.: The Current Status and Portability of Our Sequence Handling Software. Nucleic Acids Research 14, 217–231 (1986)CrossRefGoogle Scholar
  11. 11.
    Reese, M.G., Eeckman, F., Kupl, D., Haussler, D.: Improved Splice Site Detection in Genie. Journal of Computational Biology 4, 311–324 (1997)CrossRefGoogle Scholar
  12. 12.
    Brunak, S., Engelbrecht, J., Knudsen, S.: Prediction of mRNA Donor and Acceptor Sites From the DNA Sequence. Journal of Molecular Biology 220, 49–65 (1991)CrossRefGoogle Scholar
  13. 13.
    Zhang, X., Katherine, A.H., Ilana, H., Christina, S.L., Lawrence, A.C.: Sequence Information for the Splicing of Human Pre-mRNA Identified by Support Vector Machine Classification. Genome Research 13, 2637–2650 (2003)CrossRefGoogle Scholar
  14. 14.
    Sun, Y.F., Fan, X.D., Li, Y.D.: Identifying Splicing Sites in Eukaryotic RNA: Support Vector Machine Approach. Computers in biology and medicine 33, 17–29 (2003)CrossRefGoogle Scholar
  15. 15.
    Sonnenburg, S.: New Methods for Detecting Splice Junction Sites in DNA Sequence. Master’s Thesis, Humbold University, Germany (2002)Google Scholar
  16. 16.
    Chuang, J.S., Roth, D.: Splice Site Prediction using a Sparse Network of Winnows. Technical Report, University of Illinois, Urbana-Champaign (2001)Google Scholar
  17. 17.
    Zhang, L., et al.: Splice Site Prediction with Quadratic Discriminant Analysis using Diversity Measure. Nucleic Acids Research 31, 6214–6220 (2003)CrossRefGoogle Scholar
  18. 18.
    Arita, M., Tsuda, K., Asai, K.: Modeling Splicing Sites with Pairwise Correlations. Bioinformatics 18, 27–34 (2002)Google Scholar
  19. 19.
    Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouze, P., Brunak, S.: Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information. Nucleic Acids Research 24, 3439–3452 (1996)CrossRefGoogle Scholar
  20. 20.
    Rajapakse, J.C., Loi, S.H.: Markov Encoding for Eetecting Signals in Genomic Sequences. IEEE/ACM Trans. Computational Biology and Bioinformatics 2, 131–142 (2005)CrossRefGoogle Scholar
  21. 21.
    Loi, S.H., Rajapakse, J.C.: Splice Site Detection with a Higher-Order Markov Model Implemented on a Neural Network. Genome Informatics 14, 64–72 (2003)Google Scholar
  22. 22.
    Schukat, T.E., Gallwitz, F., Harbeck, S., Warnke, V.: Rational Interpolation of Maximum Likelihood Predictors in Stochastic Language Modeling. In: Proc. of European Conference on Speech Communications and Technology, vol. 5, pp. 2731–2734 (1997)Google Scholar
  23. 23.
    Pinkus, A.: Approximation Theory of the MLP Model in Neural Networks. Acta Numerica, 143–195 (1999)Google Scholar
  24. 24.
    Pollastro, P., Rampone, S.: HS3D-Homo Sapiens Splice Sites Dataset. Nucleic Acids Research 2003 (Annual Database Issue)Google Scholar
  25. 25.
    Baten, A.K.M., Chang, B.C.H., Halgamuge, S.K., Li, J.: Splice Site Identification using Probabilistic Parameters and SVM Classification. BMC Bioinformatics 7 (Suppl. 5), S15 (2006)Google Scholar
  26. 26.
    Halgamuge, S.K., Glesner, M.: Fuzzy Neural Networks Between Functional Equivalence and Applicability. Int. J. Neural Systems 6, 185–196 (1995)CrossRefGoogle Scholar
  27. 27.
    Halgamuge, S.K.: Trainable Transparent Universal Approximator for Defuzzification in Mamdani-type Neuro-Fuzzy Controllers. IEEE Trans. Fuzzy Systems 6, 304–314 (1998)CrossRefGoogle Scholar
  28. 28.
    Halgamuge, S.K., Glesner, M.: Neural Networks in Designing Fuzzy Systems for Real World Applications. Fuzzy Sets and Systems 65, 1–12 (1994)CrossRefGoogle Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • A. K. M. A. Baten
    • 1
  • S. K. Halgamuge
    • 1
  • Bill Chang
    • 1
  • Nalin Wickramarachchi
    • 2
  1. 1.Dynamic Systems and Control Research Group, DoMME, Faculty of Engineering, The University of Melbourne, Parkville 3010Australia
  2. 2.Department of Electrical Engineering, The University of MoratuwaSrilanka

Personalised recommendations