Detection of Splice Sites Using Support Vector Machine

  • Pritish Varadwaj
  • Neetesh Purohit
  • Bhumika Arora
Part of the Communications in Computer and Information Science book series (CCIS, volume 40)


Automatic identification and annotation of exon and intron region of gene, from DNA sequences has been an important research area in field of computational biology. Several approaches viz. Hidden Markov Model (HMM), Artificial Intelligence (AI) based machine learning and Digital Signal Processing (DSP) techniques have extensively and independently been used by various researchers to cater this challenging task. In this work, we propose a Support Vector Machine based kernel learning approach for detection of splice sites (the exon-intron boundary) in a gene. Electron-Ion Interaction Potential (EIIP) values of nucleotides have been used for mapping character sequences to corresponding numeric sequences. Radial Basis Function (RBF) SVM kernel is trained using EIIP numeric sequences. Furthermore this was tested on test gene dataset for detection of splice site by window (of 12 residues) shifting. Optimum values of window size, various important parameters of SVM kernel have been optimized for a better accuracy. Receiver Operating Characteristic (ROC) curves have been utilized for displaying the sensitivity rate of the classifier and results showed 94.82% accuracy for splice site detection on test dataset.


Splice site Support vector machine Electron-ion interaction potential 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Uberbacher, E.C., Xu, Y., Mural, R.J.: Discovering and understanding genes in human DNA sequence using GRAIL. Methods Enzymol. 266, 259–281 (1996)CrossRefPubMedGoogle Scholar
  2. 2.
    Fickett, J.W., Tung, C.-S.: Assessment of protein coding measures. Nucleic Acids Res. 20, 6441–6450 (1992)CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Fogel, G.B., Chellapilla, K., Corne, D.W.: Identification of coding regions in DNA sequences using evolved neural networks. In: Fogel, G.B., Corne, D.W. (eds.) Evolutionary Computation in Bioinformatics, pp. 195–218. Morgan Kaufmann, San Francisco (2002)Google Scholar
  4. 4.
    Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouze, P., Brunak, S.: Splice site prediction in Arabidopsis thaliana pre mRNA by combining local and global sequence information. Nucleic Acids Res. 24(17), 3439–3452 (1996)CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Reese, M.G.: Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. Comput. Chem. 26(1), 51–56 (2001)CrossRefPubMedGoogle Scholar
  6. 6.
    Ranawana, R., Palade, V.: A neural network based multi-classifier system for gene identification in DNA sequences. Neural Comput. Appl. 14(2), 122–131 (2005)CrossRefGoogle Scholar
  7. 7.
    Sherriff, A., Ott, J.: Applications of neural networks for gene finding. Adv. Genet. 42, 287–297 (2001)PubMedGoogle Scholar
  8. 8.
    Bandyopadhyay, S., Maulik, U., Roy, D.: Gene Identification: Classical and computational Intelligence approaches. IEEE Trasaction on systems, man and cybernatics 38(1) (January 2008)Google Scholar
  9. 9.
    Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.R., Sommer, R., Schölkopf, B.: Improving the C. elegans genome annotation using machine learning. PLoS Computational Biology 3(2), e20 (2007)CrossRefGoogle Scholar
  10. 10.
    Jaakkola, T., Haussler, D.: Exploiting Generative Models in Discriminative Classifiers. In: Kearns, M., Solla, S., Cohn, D. (eds.) Advances in Neural Information Processing Systems, vol. 11, pp. 487–493. MIT Press, Cambridge (1999)Google Scholar
  11. 11.
    Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.R.: Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics 16(9), 799–807 (2000)CrossRefPubMedGoogle Scholar
  12. 12.
    Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares, J.M., Haussler, D.: Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS 97, 262–267 (2000)CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Tsuda, K., Kawanabe, M., Rätsch, G., Sonnenburg, S., Müller, K.: A New Discriminative Kernel from Probabilistic Models. Advances in Neural information processings systems 14, 977 (2002)Google Scholar
  14. 14.
    Sonnenburg, S., Rätsch, G., Jagota, A., Müller, K.R.: New Methods for Splice-Site Recognition. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, p. 329. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. 15.
    Sonnenburg, S.: New Methods for Splice Site Recognition. Master’s thesis Humboldt University (Supervised by Müller, K.-R., Burkhard, H.-D., Rätsch, G.) (2002) Google Scholar
  16. 16.
    Lorena, A., de Carvalho, A.: Human Splice Site Identifications with Multiclass Support Vector Machines and Bagging. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN 2003. LNCS, vol. 2714. Springer, Heidelberg (2003)Google Scholar
  17. 17.
    Yamamura, M., Gotoh, O.: Detection of the Splicing Sites with Kernel Method Approaches Dealing with Nucleotide Doublets. Genome Informatics 14, 426–427 (2003)Google Scholar
  18. 18.
    Rätsch, G., Sonnenburg, S.: Accurate Splice Site Detection for Caenorhabditis elegans. In: Schölkopf, B., Tsuda, K., Vert, J.P. (eds.) Kernel Methods in Computational Biology. MIT Press, Cambridge (2004)Google Scholar
  19. 19.
    Degroeve, S., Saeys, Y., Baets, B.D., Rouzé, P., de Peer, Y.V.: SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 21(8), 1332–1338 (2005)CrossRefPubMedGoogle Scholar
  20. 20.
    Huang, J., Li, T., Chen, K., Wu, J.: An approach of encoding for predictionof splice sites using SVM. Biochimie 88, 923–929 (2006)CrossRefPubMedGoogle Scholar
  21. 21.
    Zhang, Y., Chu, C.H., Chen, Y., Zha, H., Ji, X.: Splice site prediction using support vector machines with a Bayes kernel. Expert Systems with Applications 30, 73–81 (2006)CrossRefGoogle Scholar
  22. 22.
    Baten, A., Chang, B., Halgamuge, S., Li, J.: Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics 7(suppl. 5), S15 (2006)CrossRefGoogle Scholar
  23. 23.
    Anastassiou, D.: Genomic signal processing. IEEE Signal Process. Mag. 18(4), 8–20 (2001)CrossRefGoogle Scholar
  24. 24.
    Zhang, X., Chen, F., Zhang, Y., Agner, S.C., Akay, M., Lu, Z., Waye, M.M.Y., Tsui, S.K.: Signal processing techniques in genomic engineering. Proc. IEEE 90(12), 1822–1833 (2002)CrossRefGoogle Scholar
  25. 25.
    Voss, R.F.: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phy. Rev. Lett. 68(25), 3805–3808 (1992)CrossRefGoogle Scholar
  26. 26.
    Silverman, B.D., Linsker, R.: A measure of DNA periodicity. J. Theor. Biol. 118, 295–300 (1986)CrossRefPubMedGoogle Scholar
  27. 27.
    Ning, J., Moore, C.N., Nelson, J.C.: Preliminary wavelet analysis of genomic sequences. In: Proc. IEEE Bioinformatics Conf., pp. 509–510 (2003)Google Scholar
  28. 28.
    deergha Rao, K., Swamy, M.N.S.: Analysis of Genomics and proteomics using DSP Techniques. IEEE Transactions on circuits abd systems 55(1) (Feburary 2008)Google Scholar
  29. 29.
    Akhtar, M., Epps, J., Ambikairajah, E.: Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE journal of selected topics in signal processing 2(3) (June 2008)Google Scholar
  30. 30.
    Li, W.: The study of correlation structure of DNA sequences: A critical review. Comput. Chem. 21(4), 257–271 (1997)CrossRefPubMedGoogle Scholar
  31. 31.
    Anastassiou, D.: Genomic signal processing. IEEE Signal Process. Mag. 18(4), 8–20 (2001)CrossRefGoogle Scholar
  32. 32.
    Tiwari, S., Ramaswamy, S., Bhattacharya, A., Bhattacharya, S., Ramaswamy, R.: Prediction of probable genes by Fourier analysis of genomic sequences. Comput. Appl. Biosci. 13, 263–270 (1997)PubMedGoogle Scholar
  33. 33.
    Kotlar, D., Lavner, Y.: Gene prediction by spectral rotation measure: A new method for identifying protein-coding regions. Genome Res. 18, 1930–1937 (2003)Google Scholar
  34. 34.
    Rao, N., Shepherd, S.J.: Detection of 3-periodicity for small genomic sequences based on AR techniques. In: Proc. IEEE Int. Conf. Comm., Circuits Syst., vol. 2, pp. 1032–1036 (2004)Google Scholar
  35. 35.
    Vaidyanathan, P.P., Yoon, B.-J.: Gene and exon prediction using allpass-based filters. Presented at the IEEE Workshop Genomic Signal Processing and Statistics, Raleigh, NC (2002)Google Scholar
  36. 36.
    Saxonov, S., Daizadeh, I., Fedorov, A., Gilbert, W.: An exhaustive database of protein-coding intron-containing genes. Nucleic Acids Res. 28(1), 185–190 (2000)CrossRefPubMedPubMedCentralGoogle Scholar
  37. 37.
    Burge, C.B., et al.: Splicing precursors to mRNAs by the spliceosomes. In: Gesteland, R.F., Cech, T.R., Atkins, J.F. (eds.) The RNA World, pp. 525–560. Cold Spring Harbor Lab. Press (1999)Google Scholar
  38. 38.
    Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3) (1995)Google Scholar
  39. 39.
    Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001),

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Pritish Varadwaj
    • 1
  • Neetesh Purohit
    • 1
  • Bhumika Arora
    • 1
  1. 1.Indian Institute of Information TechnologyAllahabad

Personalised recommendations