Abstract
The present paper analyzes the usefulness of the normalized compression distance for the problem to cluster the hemagglutinin (HA) sequences of influenza virus data for the HA gene in dependence on the available compressors. Using the CompLearn Toolkit, the built-in compressors zlib and bzip2 are compared.
Moreover, a comparison is made with respect to hierarchical and spectral clustering. For the hierarchical clustering, hclust from the R package is used, and the spectral clustering is done via the kLine algorithm proposed by Fischer and Poland (2004).
Our results are very promising and show that one can obtain an (almost) perfect clustering. It turned out that the zlib compressor allowed for better results than the bzip2 compressor and, if all data are concerned, then hierarchical clustering is a bit better than spectral clustering via kLines.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
GNU Octave, http://www.gnu.org/software/octave/
The R project for statistical computing, http://www.r-project.org/
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702–1–048702–4 (2002)
Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Information distance. IEEE Transactions on Information Theory 44(4), 1407–1423 (1998)
Cilibrasi, R.: The CompLearn Toolkit (2003), http://www.complearn.org/
Cilibrasi, R., Vitányi, P.M.B.: Automatic meaning discovery using Google. CWI, Amsterdam (2006)
Cilibrasi, R., Vitányi, P.M.B.: Similarity of objects and the meaning of words. In: Cai, J.-Y., Cooper, S.B., Li, A. (eds.) TAMC 2006. LNCS, vol. 3959, pp. 21–45. Springer, Heidelberg (2006)
Cilibrasi, R., Vitányi, P.M.B.: A new quartet tree heuristic for hierarchical clustering. In: Arnold, D.V., Jansen, T., Vose, M.D., Rowe, J.E. (eds.) Theory of Evolutionary Algorithms. Dagstuhl Seminar Proceedings, Schloss Dagstuhl, Germany. Internationales Begegnungs- und Forschungszentrum für Informatik (IBFI), vol. (06061) (2006)
Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Fischer, I., Poland, J.: New methods for spectral clustering. Technical Report IDSIA-12-04, IDSIA/USI-SUPSI, Manno, Switzerland (2004)
Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: KDD 2004: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 206–215. ACM Press, New York (2004)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and its Applications, 3rd edn. Springer, Heidelberg (2008)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
National Center for Biotechnology Information. Influenza Virus Resource, information, search and analysis, http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html
Palese, P., Shaw, M.L.: Orthomyxoviridae: The viruses and their replication. In: Knipe, D.M., Howley, P.M., et al. (eds.) Fields’ Virology, 5th edn., pp. 1647–1689. Lippincott Williams & Wilkins, Philadelphia (2007)
Perona, P., Freeman, W.: A factorization approach to grouping. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 655–670. Springer, Heidelberg (1998)
Poland, J., Zeugmann, T.: Clustering pairwise distances with missing data: Maximum cuts versus normalized cuts. In: Todorovski, L., Lavrač, N., Jantke, K.P. (eds.) DS 2006. LNCS (LNAI), vol. 4265, pp. 197–208. Springer, Heidelberg (2006)
Poland, J., Zeugmann, T.: Clustering the google distance with eigenvectors and semidefinite programming. In: Knowledge Media Technologies, First International Core-to-Core Workshop. Diskussionsbeiträge, Institut für Medien und Kommunikationswisschaft, vol. 21, pp. 61–69. Technische Universität Ilmenau (2006)
Spielman, D.A., Teng, S.-H.: Spectral partitioning works: Planar graphs and finite element meshes. In: Proceedings of the 37th Annual IEEE Conference on Foundations of Computer Science, pp. 96–105. IEEE Computer Society, Los Alamitos (1996)
Vitányi, P.M.B., Balbach, F.J., Cilibrasi, R.L., Li, M.: Normalized information distance. In: Information Theory and Statistical Learning, pp. 45–82. Springer, New York (2008)
von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007)
Wright, P.F., Neumann, G., Kawaoka, Y.: Orthomyxoviruses. In: Knipe, D.M., Howley, P.M., et al. (eds.) Fields Virology, 5th edn., pp. 1691–1740. Lippincott Williams & Wilkins, Philadelphia (2007)
Yu, S.X., Shi, J.: Multiclass spectral clustering. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 2, pp. 313–319. IEEE Computer Society, Los Alamitos (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Ito, K., Zeugmann, T., Zhu, Y. (2010). Clustering the Normalized Compression Distance for Influenza Virus Data. In: Elomaa, T., Mannila, H., Orponen, P. (eds) Algorithms and Applications. Lecture Notes in Computer Science, vol 6060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12476-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-12476-1_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12475-4
Online ISBN: 978-3-642-12476-1
eBook Packages: Computer ScienceComputer Science (R0)