Information Distance and Its Extensions

Part of the Lecture Notes in Computer Science book series (LNCS, volume 6926)

Abstract

Consider, in the most general sense, the space of all information carrying objects: a book, an article, a name, a definition, a genome, a letter, an image, an email, a webpage, a Google query, an answer, a movie, a music score, a Facebook blog, a short message, or even an abstract concept. Over the past 20 years, we have been developing a general theory of information distance in this space and applications of this theory. The theory is object-independent and application-independent. The theory is also unique, in the sense that no other theory is “better”. During the past 10 years, such a theory has found many applications. Recently we have introduced two extensions to this theory concerning multiple objects and irrelevant information. This expository article will focus on explaining the main ideas behind this theory, especially these recent extensions, and their applications. We will also discuss some very preliminary applications.

Keywords

Triangle Inequality Irrelevant Information Information Distance Question Answering Kolmogorov Complexity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ané, C., Sanderson, M.J.: Missing the forest for the trees: Phylogenetic compression and its implications for inferring complex evolutionary histories. Systematic Biology 54(1), 146–157 (2005)CrossRefGoogle Scholar
  2. 2.
    Arbuckle, T., Balaban, A., Peters, D.K., Lawford, M.: Software documents: comparison and measurement. In: Proc. 18 Int’l Conf. on Software Engineering and Knowledge Engineering 2007 (SEKE 2007), pp. 740–745 (2007)Google Scholar
  3. 3.
    Arbuckle, T.: Studying software evolution using artefacts’ shared information content. Sci. of Comput. Programming 76(2), 1078–1097 (2011)CrossRefGoogle Scholar
  4. 4.
    Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.: Information Distance. IEEE Trans. Inform. Theory 44(4), 1407–1423 (1993) (STOC 1993)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Bennett, C.H., Li, M., Ma, B.: Chain letters and evolutionary histories. Scientific American 288(6), 76–81 (2003) (feature article)CrossRefGoogle Scholar
  6. 6.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)CrossRefGoogle Scholar
  7. 7.
    Bu, F., Zhu, X., Li, M.: A new multiword expression metric and its applications. J. Comput. Sci. Tech. 26(1), 3–13 (2011); also in COLING 2010CrossRefMATHGoogle Scholar
  8. 8.
    Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Information Theory 50(7), 1545–1550 (2004)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Cilibrasi, R., Vitányi, P., de Wolf Algorithmic, R.: clustring of music based on string compression. Comput. Music J. 28(4), 49–67 (2004)CrossRefGoogle Scholar
  10. 10.
    Cilibrasi, R., Vitányi, P.: Automatic semantics using Google (2005) (manuscript), http://arxiv.org/abs/cs.CL/0412098 (2004)
  11. 11.
    Cilibrasi, R., Vitányi, P.: Clustering by compression. IEEE Trans. Inform. Theory 51(4), 1523–1545 (2005)MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Cuturi, M., Vert, J.P.: The context-tree kernel for strings. Neural Networks 18(4), 1111–1123 (2005)CrossRefGoogle Scholar
  13. 13.
    Emanuel, K., Ravela, S., Vivant, E., Risi, C.: A combined statistical-deterministic approach of hurricane risk assessment. In: Program in Atmospheres, Oceans, and Climate. MIT, Cambridge (2005) (manuscript)Google Scholar
  14. 14.
    Fagin, R., Stockmeyer, L.: Relaxing the triangle inequality in pattern matching. Int’l J. Comput. Vision 28(3), 219–231 (1998)CrossRefGoogle Scholar
  15. 15.
    Kirk, S.R., Jenkins, S.: Information theory-baed software metrics and obfuscation. J. Systems and Software 72, 179–186 (2004)CrossRefGoogle Scholar
  16. 16.
    Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: KDD 2004, pp. 206–215 (2004)Google Scholar
  17. 17.
    Kocsor, A., Kertesz-Farkas, A., Kajan, L., Pongor, S.: Application of compression-based distance measures to protein sequence classification: a methodology study. Bioinformatics 22(4), 407–412 (2006)CrossRefGoogle Scholar
  18. 18.
    Krasnogor, N., Pelta, D.A.: Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 20(7), 1015–1021 (2004)CrossRefGoogle Scholar
  19. 19.
    Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)CrossRefGoogle Scholar
  20. 20.
    Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004)MathSciNetCrossRefMATHGoogle Scholar
  21. 21.
    Li, M.: Information distance and its applications. Int’l J. Found. Comput. Sci. 18(4), 669–681 (2007)MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Li, M., Ma, B.: Notes on information distance among many entities, March 23 (2008) (unpublished notes)Google Scholar
  23. 23.
    Li, M., Tang, Y., Wang, D.: Information distance between what I said and what it heard (manuscript, 2011)Google Scholar
  24. 24.
    Li, M., Vitányi, P.: An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer, Heidelberg (2008)CrossRefMATHGoogle Scholar
  25. 25.
    Long, C., Zhu, X.Y., Li, M., Ma, B.: Information shared by many objects. In: ACM 17th Conf. Info. and Knowledge Management (CIKM 2008), Napa Valley, California, October 26-30 (2008)Google Scholar
  26. 26.
    Long, C., Huang, M., Zhu, X., Li, M.: Multi-document summarization by information distance. In: IEEE Int’l Conf. Data Mining, 2009 (ICDM 2009), Miami, Florida, December 6-9 (2009)Google Scholar
  27. 27.
    Nikvand, N., Wang, Z.: Generic image similarity based on Kolmogorov complexity. In: IEEE Int’l Conf. Image Processing, Hong Kong, China, September 26-29 (2010)Google Scholar
  28. 28.
    Nykter, M., Price, N.D., Larjo, A., Aho, T., Kauffman, S.A., Yli-Harja, O., Shmulevich, I.: Critical networks exhibit maximal information diversity in structure-dynamics relationships. Phy. Rev. Lett. 100, 058702(4) (2008)Google Scholar
  29. 29.
    Nykter, M., Price, N.D., Aldana, M., Ramsey, S.A., Kauffman, S.A., Hood, L.E., Yli-Harja, O., Shmulevich, I.: Gene expression dynamics in the macrophage exhibit criticality. Proc. Nat. Acad. Sci. USA 105(6), 1897–1900 (2008)CrossRefGoogle Scholar
  30. 30.
    Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(6), 2122–2130 (2003)CrossRefGoogle Scholar
  31. 31.
    Pao, H.K., Case, J.: Computing entropy for ortholog detection. In: Int’l Conf. Comput. Intell., Istanbul, Turkey, December 17-19 (2004)Google Scholar
  32. 32.
    Parry, D.: Use of Kolmogorov distance identification of web page authorship, topic and domain. In: Workshop on Open Source Web Inf. Retrieval (2005), http://www.emse.fr/OSWIR05/
  33. 33.
    Costa Santos, C., Bernardes, J., Vitányi, P., Antunes, L.: Clustering fetal heart rate tracings by compression. In: Proc. 19th IEEE Intn’l Symp. Computer-Based Medical Systems, Salt Lake City, Utah, June 22-23 (2006)Google Scholar
  34. 34.
    Taha, W., Crosby, S., Swadi, K.: A new approach to data mining for software design, Rice Univ. (2006) (manuscript)Google Scholar
  35. 35.
    Varre, J.S., Delahaye, J.P., Rivals, E.: Transformation distances: a family of dissimilarity measures based on movements of segments. Bioinformatics 15(3), 194–202 (1999)CrossRefGoogle Scholar
  36. 36.
    Veltkamp, R.C.: Shape Matching: Similarity Measures and Algorithms. In: Proc. Int ’l Conf. Shape Modeling Applications, Italy, pp. 188–197 (2001) (invited talk)Google Scholar
  37. 37.
    Vitanyi, P.M.B.: Information distance in multiples. IEEE Trans. Inform. Theory 57(4), 2451–2456 (2011)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Wehner, S.: Analyzing worms and network traffice using compression. J. Comput. Security 15(3), 303–320 (2007)CrossRefGoogle Scholar
  39. 39.
    Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: 13th ACM SIGKDD Int’l Conf. Knowledge Discovery Data Mining, San Jose, CA, August 12-15 (2007)Google Scholar
  40. 40.
    Zhang, X., Hao, Y., Zhu, X.Y., Li, M.: New information measure and its application in question answering system. J. Comput. Sci. Tech. 23(4), 557–572 (2008); This is the final version of [39]MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ming Li
    • 1
  1. 1.School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations