Partial Match Distance

In Memoriam Ray Solomonoff 1926-2009
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7070)

Abstract

In this expository article, we discuss a complementary notion of information distance for partial matches. Information Distance D max (x,y) =  max { K(x|y), K(y|x) } measures the distance between two objects by the minimum number of bits that are needed to convert them to each other. However, in many applications, often some objects contain too much irrelevant information which overwhelms the relevant information we are interested in. Information distance based on partial information also should not satisfy the triangle inequality. To deal with such problems, we have introduced an information distance for partial matching. It turns out the new notion is precisely D min (x,y) =  min { K(x|y), K(y|x) }, complementary to the D max distance. I will give some recent applications of the D min distance.

Keywords

Irrelevant Information Partial Match Information Distance Question Answering Kolmogorov Complexity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ané, C., Sanderson, M.J.: Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories. Systematic Biology 54(1), 146–157 (2005)CrossRefGoogle Scholar
  2. 2.
    Bennett, C.H., Gacs, P., Li, M., Vitanyi, P., Zurek, W.: Information Distance. IEEE Trans. Inform. Theory 44(4), 1407–1423 (1998) (STOC 1993)Google Scholar
  3. 3.
    Bennett, C.H., Li, M., Ma, B.: Chain letters and evolutionary histories. Scientific American 288(6), 76–81 (2003) (feature article)Google Scholar
  4. 4.
    Chaitin, G.J.: On the Simplicity and Speed of Programs for Computing Infinite Sets of Natural Numbers. Journal of the ACM 16(3), 407Google Scholar
  5. 5.
    Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Information Theory 50(7), 1545–1550 (2004)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Chernov, A.V., Muchnik, A.A., Romashchenko, A.E., Shen, A.K., Vereshchagin, N.K.: Upper semi-lattice of binary strings with the relation “x is simple conditional to y”. Theoret. Comput. Sci. 271, 69–95 (2002)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Cilibrasi, R., Vitányi, P.M.B., de Wolf Algorithmic, R.: clustring of music based on string compression. Comput. Music J. 28(4), 49–67 (2004)CrossRefGoogle Scholar
  8. 8.
    Cilibrasi, R., Vitányi, P.M.B.: The Google similarity distance. IEEE Trans. Knowledge and Data Engineering 19(3), 370–383 (2007)CrossRefGoogle Scholar
  9. 9.
    Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. Inform. Theory 51(4), 1523–1545 (2005)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Cuturi, M., Vert, J.P.: The context-tree kernel for strings. Neural Networks 18(4), 1111–1123 (2005)CrossRefGoogle Scholar
  11. 11.
    Fagin, R., Stockmeyer, L.: Relaxing the triangle inequality in pattern matching. Int’l J. Comput. Vision 28(3), 219–231 (1998)CrossRefGoogle Scholar
  12. 12.
    Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: KDD 2004, pp. 206–215 (2004)Google Scholar
  13. 13.
    Kirk, S.R., Jenkins, S.: Information theory-baed software metrics and obfuscation. J. Systems and Software 72, 179–186 (2004)CrossRefGoogle Scholar
  14. 14.
    Kolmogorov, A.N.: Three Approaches to the Quantitative Definition of Information. Problems Inform. Transmission 1(1), 1–7 (1965)MathSciNetGoogle Scholar
  15. 15.
    Kraskov, A., Stögbauer, H., Andrzejak, R.G., Grassberger, P.: Hierarchical clustering using mutual information. Europhys. Lett. 70(2), 278–284 (2005)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Kocsor, A., Kertesz-Farkas, A., Kajan, L., Pongor, S.: Application of compression-based distance measures to protein sequence classification: a methodology study. Bioinformatics 22(4), 407–412 (2006)CrossRefGoogle Scholar
  17. 17.
    Krasnogor, N., Pelta, D.A.: Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 20(7), 1015–1021 (2004)CrossRefGoogle Scholar
  18. 18.
    Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)CrossRefGoogle Scholar
  19. 19.
    Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Li, M.: Information distance and its applications. Int’l J. Found. Comput. Sci. 18(4), 669–681 (2007)CrossRefMATHGoogle Scholar
  21. 21.
    Li, M., Vitanyi, P.: An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer (2008)Google Scholar
  22. 22.
    Muchnik, A.A.: Conditional comlexity and codes. Theoretical Computer Science 271(1), 97–109 (2002)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Muchnik, A.A., Vereshchagin, N.K.: Logical operations and Kolmogorov complexity II. In: Proc. 16th Conf. Comput. Complexity, pp. 256–265 (2001)Google Scholar
  24. 24.
    Nykter, M., Price, N.D., Larjo, A., Aho, T., Kauffman, S.A., Yli-Harja, O., Shmulevich, I.: Critical networks exhibit maximal information diversity in structure-dynamics relationships. Phy. Rev. Lett. 100, 058702(4) (2008)Google Scholar
  25. 25.
    Nykter, M., Price, N.D., Aldana, M., Ramsey, S.A., Kauffman, S.A., Hood, L.E., Yli-Harja, O., Shmulevich, I.: Gene expression dynamics in the macrophage exhibit criticality. Proc. Nat. Acad. Sci. USA 105(6), 1897–1900 (2008)CrossRefGoogle Scholar
  26. 26.
    Otu, H.H., Sayood, K.: Bioinformatics 19(6), 2122–2130 (2003); A new sequence distance measure for phylogenetic tree constructionGoogle Scholar
  27. 27.
    Pao, H.K., Case, J.: Computing entropy for ortholog detection. In: Int’l Conf. Comput. Intell., Istanbul, Turkey, December 17-19 (2004)Google Scholar
  28. 28.
    Parry, D.: Use of Kolmogorov distance identification of web page authorship, topic and domain. In: Workshop on Open Source Web Inf. Retrieval (2005), http://www.emse.fr/OSWIR05
  29. 29.
    Costa Santos, C., Bernardes, J., Vitányi, P.M.B., Antunes, L.: Clustering fetal heart rate tracings by compression. In: Proc. 19th IEEE Intn’l Symp. Computer-Based Medical Systems, Salt Lake City, Utah, June 22-23 (2006)Google Scholar
  30. 30.
    Shen, A.K., Vereshchagin, N.K.: Logical operations and Kolmogorov complexity. Theoret. Comput. Sci. 271, 125–129 (2002)MathSciNetCrossRefMATHGoogle Scholar
  31. 31.
    Solomonoff, R.: A Formal Theory of Inductive Inference, Part I. d Information and Control 7(1), 1–22 (1964)MathSciNetCrossRefMATHGoogle Scholar
  32. 32.
    Solomonoff, R.: A Formal Theory of Inductive Inference, Part II. Information and Control 7(2), 224–254 (1964)MathSciNetCrossRefMATHGoogle Scholar
  33. 33.
    Varre, J.S., Delahaye, J.P., Rivals, E.: Transformation distances: a family of dissimilarity measures based on movements of segments. Bioinformatics 15(3), 194–202 (1999)CrossRefGoogle Scholar
  34. 34.
    Veltkamp, R.C.: Shape Matching: Similarity Measures and Algorithms, invited talk. In: Proc. Int’l Conf. Shape Modeling Applications 2001, Italy, pp. 188–197 (2001)Google Scholar
  35. 35.
    Vereshchagin, N.K., V’yugin, M.V.: Independent minimum length programs to translate between given strings. Theoret. Comput. Sci. 271, 131–143 (2002)MathSciNetCrossRefMATHGoogle Scholar
  36. 36.
    V’yugin, M.V.: Information distance and conditional complexities. Theoret. Comput. Sci. 271, 145–150 (2002)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Wallace, C.S., Dowe, D.L.: Minimum Message Length and Kolmogorov Complexity. Computer Journal 42(4) (1999)Google Scholar
  38. 38.
    Yang, T., Wang, D., Zhu, X., Li, M.: Information distance between what I said and what it heard. Manuscript in preparation (August. 2011)Google Scholar
  39. 39.
    Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: Proc. 13th ACM SIGKDD, August 12-15, pp. 874–883 (2007)Google Scholar
  40. 40.
    Zhang, X., Hao, Y., Zhu, X., Li, M.: New information measure and its application in question answering system. J. Comput. Sci. Tech. 23(4), 557–572 (2008)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Ming Li
    • 1
  1. 1.School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations