Recent Experiences in Parameter-Free Data Mining
Recent results supporting the usefulness of the normalized compression distance for the task to classify genome sequences of virus data are reported. Specifically, the problem to cluster the hemagglutinin (HA) sequences of in uenza virus data for the HA gene in dependence on the host and subtype of the virus, and the classification of dengue virus genome data with respect to their four serotypes are studied. A comparison is made with respect to hierarchical and spectral clustering via the kLine algorithm by Fischer and Poland (2004), respectively, and with respect to the standard compressors bzlip, ppmd, and zlib. Our results are very promising and show that one can obtain an (almost) perfect clustering for all the problems studied.
KeywordsDengue Virus Spectral Cluster Dengue Hemorrhagic Fever Kolmogorov Complexity Kernel Width
Unable to display preview. Download preview PDF.
- 3.D. S. Burke, G. Kuno, and T. P. Monath. Flaviviruses. In D. M. Knipe and P. M. Howley et al., editors, Fields’ Virology, pages 1153–1252. Lippincott Williams & Wilkins, Philadelphia, fifth edition, 2007.Google Scholar
- 4.R. Cilibrasi. The CompLearn Toolkit, 2003-. http://www.complearn.org/.
- 5.R. Cilibrasi and P. Vitányi. Automatic meaning discovery using Google. Manuscript, CWI, Amsterdam, 2006.Google Scholar
- 6.R. Cilibrasi and P. Vitanyi. Similarity of objects and the meaning of words. In Theory and Applications of Models of Computation, Third International Conference, TAMC 2006, Beijing, China, May 2006, Proceedings, volume 3959 of Lecture Notes in Computer Science, pages 21–45, Berlin, 2006. Springer.Google Scholar
- 7.R. Cilibrasi and P. M. Vitányi. A new quartet tree heuristic for hierarchical clustering. In D. V. Arnold, T. Jansen, M. D. Vose, and J. E. Rowe, editors, Theory of Evolutionary Algorithms, number 06061 in Dagstuhl Seminar Proceedings. Internationales Begegnungs- und Forschungszentrum fur Informatik (IBFI), Schloss Dagstuhl, Germany, 2006.Google Scholar
- 9.I. Fischer and J. Poland. New methods for spectral clustering. Technical Report IDSIA-12-04, IDSIA/USI-SUPSI, Manno, Switzerland, 2004.Google Scholar
- 10.S. B. Halstead. Pathogenesis of dengue: Challenges to molecular biology. Science, 239 (4839):476–481, 1988.Google Scholar
- 11.K. Ito, T. Zeugmann, and Y. Zhu. Clustering the normalized compression distance for inuenza virus data. In T. Elomaa, H. Mannila, and P. Orponen, editors, Algorithms and Applications, Essays Dedicated to Esko Ukkonen on the Occasion of His 60th Birthday, volume 6060 of Lecture Notes in Computer Science, pages 130–146. Springer, Heidelberg, 2010.Google Scholar
- 12.E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 206–215. ACM Press, 2004.Google Scholar
- 14.M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer, 3rd edition, 2008.Google Scholar
- 15.National Center for Biotechnology Information. In uenza Virus Resource, information, search and analysis. http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html.
- 16.P. Palese and M. L. Shaw. Orthomyxoviridae: The viruses and their replication. In D. M. Knipe and P. M. Howley et al., editors, Fields’ Virology, pages 1647–1689. Lippincott Williams & Wilkins, Philadelphia, fifth edition, 2007.Google Scholar
- 17.P. M. B. Vitányi, F. J. Balbach, R. L. Cilibrasi, and M. Li. Normalized information distance. In Information Theory and Statistical Learning, pages 45–82. Springer, New York, 2008.Google Scholar
- 18.P. F. Wright, G. Neumann, and Y. Kawaoka. Orthomyxoviruses. In D. M. Knipe and P. M. Howley et al., editors, Fields’ Virology, pages 1691–1740. Lippincott Williams & Wilkins, Philadelphia, fifth edition, 2007.Google Scholar