Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification
The increasing growth of biological sequence data demands better and efficient analysis methods. Effective detection of various regulatory signals in these sequences requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the surrounding region of the regulatory signals. A higher order Markov model is generally regarded as a useful technique for modeling higher order dependencies of the nucleotides. However, its implementation requires estimating a large number of computationally expensive parameters. In this paper, we propose a hybrid method consisting of a first order Markov model for sequence data preprocessing and a multilayer perceptron neural network for classification. The Markov model captures the compositional features and dependencies of nucleotides in terms of probabilistic parameters which are used as inputs to the classifier. The classifier combines the Markov probabilities nonlinearly for signal detection. When applied to the splice site detection problem using three widely used data sets, it is observed that the proposed hybrid method is able to model higher order dependencies with better classification accuracies.
Unable to display preview. Download preview PDF.
- 6.Salzberg, S.: A Method for Identifying Splice Sites and Translation Start Site in Eukaryotic mRNA. Computer Applications in the Biosciences 13, 384–390 (1997)Google Scholar
- 7.Zhang, M., Marr, T.: A Weight Array Method for Splicing Signal Analysis. Comput Appl. Biosci. 9, 499–509 (1993)Google Scholar
- 15.Sonnenburg, S.: New Methods for Detecting Splice Junction Sites in DNA Sequence. Master’s Thesis, Humbold University, Germany (2002)Google Scholar
- 16.Chuang, J.S., Roth, D.: Splice Site Prediction using a Sparse Network of Winnows. Technical Report, University of Illinois, Urbana-Champaign (2001)Google Scholar
- 18.Arita, M., Tsuda, K., Asai, K.: Modeling Splicing Sites with Pairwise Correlations. Bioinformatics 18, 27–34 (2002)Google Scholar
- 21.Loi, S.H., Rajapakse, J.C.: Splice Site Detection with a Higher-Order Markov Model Implemented on a Neural Network. Genome Informatics 14, 64–72 (2003)Google Scholar
- 22.Schukat, T.E., Gallwitz, F., Harbeck, S., Warnke, V.: Rational Interpolation of Maximum Likelihood Predictors in Stochastic Language Modeling. In: Proc. of European Conference on Speech Communications and Technology, vol. 5, pp. 2731–2734 (1997)Google Scholar
- 23.Pinkus, A.: Approximation Theory of the MLP Model in Neural Networks. Acta Numerica, 143–195 (1999)Google Scholar
- 24.Pollastro, P., Rampone, S.: HS3D-Homo Sapiens Splice Sites Dataset. Nucleic Acids Research 2003 (Annual Database Issue)Google Scholar
- 25.Baten, A.K.M., Chang, B.C.H., Halgamuge, S.K., Li, J.: Splice Site Identification using Probabilistic Parameters and SVM Classification. BMC Bioinformatics 7 (Suppl. 5), S15 (2006)Google Scholar