Signature Recognition Methods for Identifying Influenza Sequences
Basically, one of the most important issues for identifying biological sequences is accuracy; however, since the exponential growth and excessive diversity of biological data, the requirement to compute within considerably appropriate time usually compromises with accuracy. We propose novel approaches for accurately identifying DNA sequences in shorter time by discovering sequence patterns – signatures, which are enough distinctive information for the sequence identification. The approaches are to find the best combination of n-gram patterns and six statistical scoring algorithms, which are regularly used in the research of Information Retrieval, and then employ the signatures to create a similarity scoring model for identifying the DNA. We generate two approaches to discover the signatures. For the first one, we use only statistical information extracted directly from the sequences to discover the signatures. For the second one, we use prior knowledge of the DNA in the signature discovery process. From our experiments on influenza virus, we found that: 1) our technique can identify the influenza virus at the accuracy of up to 99.69% when 11-gram is used and the prior knowledge is applied; 2) the use of too short or too long signatures produces lower efficiency; and 3) most scoring algorithms are good for identification except the “Rocchio algorithm” where its results are approximately 9% lower than the others. Moreover, this technique can be applied for identifying other organisms.
KeywordsInfluenza Virus Query Sequence Biological Sequence Cross Entropy Target Dataset
Unable to display preview. Download preview PDF.
- 1.Aalbersberg, I.: A Document Retrieval Model Based on Term Frequency Ranks. In: Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172 (1994)Google Scholar
- 3.Brown, P.F., de Souza, P.V., Della Pietra, V.J., Mercer, R.L.: Class-Based N-Gram Models of Natural Language. Computational Linguistics 18(4), 467–479 (1992)Google Scholar
- 5.Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 143–151 (1997)Google Scholar
- 6.Keinduangjun, J., Piamsa-nga, P., Poovorawan, Y.: Models for Discovering Signatures in DNA Sequences. In: Proceedings of the 3rd IASTED International Conference on Biomedical Engineering, Innsbruck, Austria, pp. 548–553 (2005)Google Scholar
- 8.Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In: Working Notes of Learning from Text and the Web. Conference on Automated Learning and Discovery. Carnegie Mellon University, Pittsburgh (1998)Google Scholar
- 9.Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naïve Bayes. In: Proceedings of the 16th International Conference on Machine Learning, pp. 258–267 (1999)Google Scholar
- 10.Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25, 365–389 (1994)Google Scholar
- 12.Spitters, M.: Comparing Feature Sets for Learning Text Categorization. In: Proceedings on RIAO (2000)Google Scholar