Advertisement

Recent Experiences in Parameter-Free Data Mining

  • Kimihito Ito
  • Thomas Zeugmann
  • Yu Zhu
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 62)

Abstract

Recent results supporting the usefulness of the normalized compression distance for the task to classify genome sequences of virus data are reported. Specifically, the problem to cluster the hemagglutinin (HA) sequences of in uenza virus data for the HA gene in dependence on the host and subtype of the virus, and the classification of dengue virus genome data with respect to their four serotypes are studied. A comparison is made with respect to hierarchical and spectral clustering via the kLine algorithm by Fischer and Poland (2004), respectively, and with respect to the standard compressors bzlip, ppmd, and zlib. Our results are very promising and show that one can obtain an (almost) perfect clustering for all the problems studied.

Keywords

Dengue Virus Spectral Cluster Dengue Hemorrhagic Fever Kolmogorov Complexity Kernel Width 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. Phys. Rev. Lett., 88(4):048702-1-048702-4, 2002.CrossRefGoogle Scholar
  2. 2.
    C. H. Bennett, P. Gács, M. Li, P. M. B. Vitányi, and W. H. Zurek. Information distance. IEEE Transactions on Information Theory, 44(4):1407–1423, 1998.zbMATHCrossRefGoogle Scholar
  3. 3.
    D. S. Burke, G. Kuno, and T. P. Monath. Flaviviruses. In D. M. Knipe and P. M. Howley et al., editors, Fields’ Virology, pages 1153–1252. Lippincott Williams & Wilkins, Philadelphia, fifth edition, 2007.Google Scholar
  4. 4.
    R. Cilibrasi. The CompLearn Toolkit, 2003-. http://www.complearn.org/.
  5. 5.
    R. Cilibrasi and P. Vitányi. Automatic meaning discovery using Google. Manuscript, CWI, Amsterdam, 2006.Google Scholar
  6. 6.
    R. Cilibrasi and P. Vitanyi. Similarity of objects and the meaning of words. In Theory and Applications of Models of Computation, Third International Conference, TAMC 2006, Beijing, China, May 2006, Proceedings, volume 3959 of Lecture Notes in Computer Science, pages 21–45, Berlin, 2006. Springer.Google Scholar
  7. 7.
    R. Cilibrasi and P. M. Vitányi. A new quartet tree heuristic for hierarchical clustering. In D. V. Arnold, T. Jansen, M. D. Vose, and J. E. Rowe, editors, Theory of Evolutionary Algorithms, number 06061 in Dagstuhl Seminar Proceedings. Internationales Begegnungs- und Forschungszentrum fur Informatik (IBFI), Schloss Dagstuhl, Germany, 2006.Google Scholar
  8. 8.
    R. Cilibrasi and P. M. B. Vitányi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523–1545, 2005.CrossRefGoogle Scholar
  9. 9.
    I. Fischer and J. Poland. New methods for spectral clustering. Technical Report IDSIA-12-04, IDSIA/USI-SUPSI, Manno, Switzerland, 2004.Google Scholar
  10. 10.
    S. B. Halstead. Pathogenesis of dengue: Challenges to molecular biology. Science, 239 (4839):476–481, 1988.Google Scholar
  11. 11.
    K. Ito, T. Zeugmann, and Y. Zhu. Clustering the normalized compression distance for inuenza virus data. In T. Elomaa, H. Mannila, and P. Orponen, editors, Algorithms and Applications, Essays Dedicated to Esko Ukkonen on the Occasion of His 60th Birthday, volume 6060 of Lecture Notes in Computer Science, pages 130–146. Springer, Heidelberg, 2010.Google Scholar
  12. 12.
    E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 206–215. ACM Press, 2004.Google Scholar
  13. 13.
    M. Li, X. Chen, X. Li, B. Ma, and P. M. Vitányi. The similarity metric. IEEE Transactions on Information Theory, 50(12):3250–3264, 2004.CrossRefGoogle Scholar
  14. 14.
    M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer, 3rd edition, 2008.Google Scholar
  15. 15.
    National Center for Biotechnology Information. In uenza Virus Resource, information, search and analysis. http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html.
  16. 16.
    P. Palese and M. L. Shaw. Orthomyxoviridae: The viruses and their replication. In D. M. Knipe and P. M. Howley et al., editors, Fields’ Virology, pages 1647–1689. Lippincott Williams & Wilkins, Philadelphia, fifth edition, 2007.Google Scholar
  17. 17.
    P. M. B. Vitányi, F. J. Balbach, R. L. Cilibrasi, and M. Li. Normalized information distance. In Information Theory and Statistical Learning, pages 45–82. Springer, New York, 2008.Google Scholar
  18. 18.
    P. F. Wright, G. Neumann, and Y. Kawaoka. Orthomyxoviruses. In D. M. Knipe and P. M. Howley et al., editors, Fields’ Virology, pages 1691–1740. Lippincott Williams & Wilkins, Philadelphia, fifth edition, 2007.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Research Center for Zoonosis ControlHokkaido UniversitySapporoJapan
  2. 2.Division of Computer ScienceHokkaido UniversitySapporoJapan

Personalised recommendations