Clustering the Normalized Compression Distance for Influenza Virus Data

  • Kimihito Ito
  • Thomas Zeugmann
  • Yu Zhu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6060)


The present paper analyzes the usefulness of the normalized compression distance for the problem to cluster the hemagglutinin (HA) sequences of influenza virus data for the HA gene in dependence on the available compressors. Using the CompLearn Toolkit, the built-in compressors zlib and bzip2 are compared.

Moreover, a comparison is made with respect to hierarchical and spectral clustering. For the hierarchical clustering, hclust from the R package is used, and the spectral clustering is done via the kLine algorithm proposed by Fischer and Poland (2004).

Our results are very promising and show that one can obtain an (almost) perfect clustering. It turned out that the zlib compressor allowed for better results than the bzip2 compressor and, if all data are concerned, then hierarchical clustering is a bit better than spectral clustering via kLines.


Hierarchical Cluster Distance Matrix Spectral Cluster Kolmogorov Complexity Language Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
  2. 2.
    The R project for statistical computing,
  3. 3.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702–1–048702–4 (2002)Google Scholar
  4. 4.
    Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Information distance. IEEE Transactions on Information Theory 44(4), 1407–1423 (1998)zbMATHCrossRefGoogle Scholar
  5. 5.
    Cilibrasi, R.: The CompLearn Toolkit (2003),
  6. 6.
    Cilibrasi, R., Vitányi, P.M.B.: Automatic meaning discovery using Google. CWI, Amsterdam (2006)Google Scholar
  7. 7.
    Cilibrasi, R., Vitányi, P.M.B.: Similarity of objects and the meaning of words. In: Cai, J.-Y., Cooper, S.B., Li, A. (eds.) TAMC 2006. LNCS, vol. 3959, pp. 21–45. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Cilibrasi, R., Vitányi, P.M.B.: A new quartet tree heuristic for hierarchical clustering. In: Arnold, D.V., Jansen, T., Vose, M.D., Rowe, J.E. (eds.) Theory of Evolutionary Algorithms. Dagstuhl Seminar Proceedings, Schloss Dagstuhl, Germany. Internationales Begegnungs- und Forschungszentrum für Informatik (IBFI), vol. (06061) (2006)Google Scholar
  9. 9.
    Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)CrossRefGoogle Scholar
  10. 10.
    Fischer, I., Poland, J.: New methods for spectral clustering. Technical Report IDSIA-12-04, IDSIA/USI-SUPSI, Manno, Switzerland (2004)Google Scholar
  11. 11.
    Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: KDD 2004: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 206–215. ACM Press, New York (2004)CrossRefGoogle Scholar
  12. 12.
    Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)CrossRefGoogle Scholar
  13. 13.
    Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and its Applications, 3rd edn. Springer, Heidelberg (2008)zbMATHGoogle Scholar
  14. 14.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)zbMATHGoogle Scholar
  15. 15.
    National Center for Biotechnology Information. Influenza Virus Resource, information, search and analysis,
  16. 16.
    Palese, P., Shaw, M.L.: Orthomyxoviridae: The viruses and their replication. In: Knipe, D.M., Howley, P.M., et al. (eds.) Fields’ Virology, 5th edn., pp. 1647–1689. Lippincott Williams & Wilkins, Philadelphia (2007)Google Scholar
  17. 17.
    Perona, P., Freeman, W.: A factorization approach to grouping. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 655–670. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  18. 18.
    Poland, J., Zeugmann, T.: Clustering pairwise distances with missing data: Maximum cuts versus normalized cuts. In: Todorovski, L., Lavrač, N., Jantke, K.P. (eds.) DS 2006. LNCS (LNAI), vol. 4265, pp. 197–208. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  19. 19.
    Poland, J., Zeugmann, T.: Clustering the google distance with eigenvectors and semidefinite programming. In: Knowledge Media Technologies, First International Core-to-Core Workshop. Diskussionsbeiträge, Institut für Medien und Kommunikationswisschaft, vol. 21, pp. 61–69. Technische Universität Ilmenau (2006)Google Scholar
  20. 20.
    Spielman, D.A., Teng, S.-H.: Spectral partitioning works: Planar graphs and finite element meshes. In: Proceedings of the 37th Annual IEEE Conference on Foundations of Computer Science, pp. 96–105. IEEE Computer Society, Los Alamitos (1996)CrossRefGoogle Scholar
  21. 21.
    Vitányi, P.M.B., Balbach, F.J., Cilibrasi, R.L., Li, M.: Normalized information distance. In: Information Theory and Statistical Learning, pp. 45–82. Springer, New York (2008)Google Scholar
  22. 22.
    von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007)CrossRefMathSciNetGoogle Scholar
  23. 23.
    Wright, P.F., Neumann, G., Kawaoka, Y.: Orthomyxoviruses. In: Knipe, D.M., Howley, P.M., et al. (eds.) Fields Virology, 5th edn., pp. 1691–1740. Lippincott Williams & Wilkins, Philadelphia (2007)Google Scholar
  24. 24.
    Yu, S.X., Shi, J.: Multiclass spectral clustering. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 2, pp. 313–319. IEEE Computer Society, Los Alamitos (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Kimihito Ito
    • 1
  • Thomas Zeugmann
    • 2
  • Yu Zhu
    • 2
  1. 1.Research Center for Zoonosis ControlHokkaido UniversitySapporoJapan
  2. 2.Division of Computer ScienceHokkaido UniversityJapan

Personalised recommendations