Intelligent Data Recognition of DNA Sequences Using Statistical Models
The intelligent data acquisition in biological sequences is a hard and challenge problem since most biological sequences contain unknowledgeable, diverse and huge data. However, the intelligent data acquisition reduces a demand on the use of high computation methods because the data are more compact and more precise. We propose a novel approach for discovering sequence signatures, which are sufficiently distinctive information in identifying the sequences. The signatures are derived from the best combination of the n-grams and the statistical scoring models. From our experiments in applying them to identify the Influenza virus, we found that the identifiers constructed by too short n-gram signatures and inappropriate scoring models get low efficiency since the inappropriate combinations of n-gram signatures and scoring models bring about unbalanced class and pattern score distribution. However, the other identifiers provide accuracy over 80% and up to 100%, when they apply an appropriate combination. In addition to accomplishing in the signature recognition, our proposed approach also requires low computation time for the biological sequence identification.
KeywordsInfluenza Virus Information Gain Target Class Biological Sequence Candidate Signature
- 1.Aalbersberg, I.: A document retrieval model based on term frequency ranks. In: Proc. of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–172 (1994)Google Scholar
- 3.Brown, P.F., de Souza, P.V., Della Pietra, V.J., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)Google Scholar
- 6.Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naïve bayes. In: Proc. of the 16th International Conference on Machine Learning, pp. 258–267 (1999)Google Scholar