Signature Recognition Methods for Identifying Influenza Sequences

  • Jitimon Keinduangjun
  • Punpiti Piamsa-nga
  • Yong Poovorawan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3581)


Basically, one of the most important issues for identifying biological sequences is accuracy; however, since the exponential growth and excessive diversity of biological data, the requirement to compute within considerably appropriate time usually compromises with accuracy. We propose novel approaches for accurately identifying DNA sequences in shorter time by discovering sequence patterns – signatures, which are enough distinctive information for the sequence identification. The approaches are to find the best combination of n-gram patterns and six statistical scoring algorithms, which are regularly used in the research of Information Retrieval, and then employ the signatures to create a similarity scoring model for identifying the DNA. We generate two approaches to discover the signatures. For the first one, we use only statistical information extracted directly from the sequences to discover the signatures. For the second one, we use prior knowledge of the DNA in the signature discovery process. From our experiments on influenza virus, we found that: 1) our technique can identify the influenza virus at the accuracy of up to 99.69% when 11-gram is used and the prior knowledge is applied; 2) the use of too short or too long signatures produces lower efficiency; and 3) most scoring algorithms are good for identification except the “Rocchio algorithm” where its results are approximately 9% lower than the others. Moreover, this technique can be applied for identifying other organisms.


Influenza Virus Query Sequence Biological Sequence Cross Entropy Target Dataset 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aalbersberg, I.: A Document Retrieval Model Based on Term Frequency Ranks. In: Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172 (1994)Google Scholar
  2. 2.
    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: GenBank. Nucleic Acids Research 28(1), 15–18 (2000)CrossRefGoogle Scholar
  3. 3.
    Brown, P.F., de Souza, P.V., Della Pietra, V.J., Mercer, R.L.: Class-Based N-Gram Models of Natural Language. Computational Linguistics 18(4), 467–479 (1992)Google Scholar
  4. 4.
    Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature Selection for Genetic Sequence Classification. Bioinformatics Journal 14(2), 139–143 (1998)CrossRefGoogle Scholar
  5. 5.
    Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 143–151 (1997)Google Scholar
  6. 6.
    Keinduangjun, J., Piamsa-nga, P., Poovorawan, Y.: Models for Discovering Signatures in DNA Sequences. In: Proceedings of the 3rd IASTED International Conference on Biomedical Engineering, Innsbruck, Austria, pp. 548–553 (2005)Google Scholar
  7. 7.
    Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using BLAST for Identifying Gene and Protein Names in Journal Articles. Gene 259(1-2), 245–252 (2000)CrossRefGoogle Scholar
  8. 8.
    Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In: Working Notes of Learning from Text and the Web. Conference on Automated Learning and Discovery. Carnegie Mellon University, Pittsburgh (1998)Google Scholar
  9. 9.
    Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naïve Bayes. In: Proceedings of the 16th International Conference on Machine Learning, pp. 258–267 (1999)Google Scholar
  10. 10.
    Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25, 365–389 (1994)Google Scholar
  11. 11.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  12. 12.
    Spitters, M.: Comparing Feature Sets for Learning Text Categorization. In: Proceedings on RIAO (2000)Google Scholar
  13. 13.
    Wang, J.T.L., Rozen, S., Shapiro, B.A., Shasha, D., Wang, Z., Yin, M.: New Techniques for DNA Sequence Classification. Journal of Computational Biology 6(2), 209–218 (1999)CrossRefGoogle Scholar
  14. 14.
    Xu, Y., Mural, R., Einstein, J., Shah, M., Uberbacher, E.: Grail: A Multiagent Neural Network System for Gene Identification. Proceedings of IEEE 84(10), 1544–1552 (1996)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jitimon Keinduangjun
    • 1
  • Punpiti Piamsa-nga
    • 1
  • Yong Poovorawan
    • 2
  1. 1.Department of Computer Engineering, Faculty of EngineeringKasetsart UniversityBangkokThailand
  2. 2.Department of Pediatrics, Faculty of MedicineChulalongkorn UniversityBangkokThailand

Personalised recommendations