Intelligent Data Recognition of DNA Sequences Using Statistical Models

  • Jitimon Keinduangjun
  • Punpiti Piamsa-nga
  • Yong Poovorawan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3776)


The intelligent data acquisition in biological sequences is a hard and challenge problem since most biological sequences contain unknowledgeable, diverse and huge data. However, the intelligent data acquisition reduces a demand on the use of high computation methods because the data are more compact and more precise. We propose a novel approach for discovering sequence signatures, which are sufficiently distinctive information in identifying the sequences. The signatures are derived from the best combination of the n-grams and the statistical scoring models. From our experiments in applying them to identify the Influenza virus, we found that the identifiers constructed by too short n-gram signatures and inappropriate scoring models get low efficiency since the inappropriate combinations of n-gram signatures and scoring models bring about unbalanced class and pattern score distribution. However, the other identifiers provide accuracy over 80% and up to 100%, when they apply an appropriate combination. In addition to accomplishing in the signature recognition, our proposed approach also requires low computation time for the biological sequence identification.


Influenza Virus Information Gain Target Class Biological Sequence Candidate Signature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Aalbersberg, I.: A document retrieval model based on term frequency ranks. In: Proc. of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172 (1994)Google Scholar
  2. 2.
    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: GenBank. Nucleic Acids Research 28(1), 15–18 (2000)CrossRefGoogle Scholar
  3. 3.
    Brown, P.F., de Souza, P.V., Della Pietra, V.J., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)Google Scholar
  4. 4.
    Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature selection for genetic sequence classification. Bioinformatics Journal 14(2), 139–143 (1998)CrossRefGoogle Scholar
  5. 5.
    Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using BLAST for identifying gene and protein names in journal articles. Gene 259(1-2), 245–252 (2000)CrossRefGoogle Scholar
  6. 6.
    Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naïve bayes. In: Proc. of the 16th International Conference on Machine Learning, pp. 258–267 (1999)Google Scholar
  7. 7.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  8. 8.
    Wang, J.T.L., Rozen, S., Shapiro, B.A., Shasha, D., Wang, Z., Yin, M.: New techniques for DNA sequence classification. Journal of Computational Biology 6(2), 209–218 (1999)CrossRefGoogle Scholar
  9. 9.
    Xu, Y., Mural, R., Einstein, J., Shah, M., Uberbacher, E.: Grail: A multiagent neural network system for gene identification. Proc. of the IEEE 84(10), 1544–1552 (1996)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jitimon Keinduangjun
    • 1
  • Punpiti Piamsa-nga
    • 1
  • Yong Poovorawan
    • 2
  1. 1.Department of Computer Engineering, Faculty of EngineeringKasetsart UniversityBangkokThailand
  2. 2.Department of Pediatrics, Faculty of MedicineChulalongkorn UniversityBangkokThailand

Personalised recommendations