Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence pp 55-64 | Cite as
Partial Match Distance
Abstract
In this expository article, we discuss a complementary notion of information distance for partial matches. Information Distance D max (x,y) = max { K(x|y), K(y|x) } measures the distance between two objects by the minimum number of bits that are needed to convert them to each other. However, in many applications, often some objects contain too much irrelevant information which overwhelms the relevant information we are interested in. Information distance based on partial information also should not satisfy the triangle inequality. To deal with such problems, we have introduced an information distance for partial matching. It turns out the new notion is precisely D min (x,y) = min { K(x|y), K(y|x) }, complementary to the D max distance. I will give some recent applications of the D min distance.
Keywords
Irrelevant Information Partial Match Information Distance Question Answering Kolmogorov ComplexityPreview
Unable to display preview. Download preview PDF.
References
- 1.Ané, C., Sanderson, M.J.: Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories. Systematic Biology 54(1), 146–157 (2005)CrossRefGoogle Scholar
- 2.Bennett, C.H., Gacs, P., Li, M., Vitanyi, P., Zurek, W.: Information Distance. IEEE Trans. Inform. Theory 44(4), 1407–1423 (1998) (STOC 1993)Google Scholar
- 3.Bennett, C.H., Li, M., Ma, B.: Chain letters and evolutionary histories. Scientific American 288(6), 76–81 (2003) (feature article)Google Scholar
- 4.Chaitin, G.J.: On the Simplicity and Speed of Programs for Computing Infinite Sets of Natural Numbers. Journal of the ACM 16(3), 407Google Scholar
- 5.Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Information Theory 50(7), 1545–1550 (2004)MathSciNetCrossRefGoogle Scholar
- 6.Chernov, A.V., Muchnik, A.A., Romashchenko, A.E., Shen, A.K., Vereshchagin, N.K.: Upper semi-lattice of binary strings with the relation “x is simple conditional to y”. Theoret. Comput. Sci. 271, 69–95 (2002)MathSciNetCrossRefMATHGoogle Scholar
- 7.Cilibrasi, R., Vitányi, P.M.B., de Wolf Algorithmic, R.: clustring of music based on string compression. Comput. Music J. 28(4), 49–67 (2004)CrossRefGoogle Scholar
- 8.Cilibrasi, R., Vitányi, P.M.B.: The Google similarity distance. IEEE Trans. Knowledge and Data Engineering 19(3), 370–383 (2007)CrossRefGoogle Scholar
- 9.Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. Inform. Theory 51(4), 1523–1545 (2005)MathSciNetCrossRefGoogle Scholar
- 10.Cuturi, M., Vert, J.P.: The context-tree kernel for strings. Neural Networks 18(4), 1111–1123 (2005)CrossRefGoogle Scholar
- 11.Fagin, R., Stockmeyer, L.: Relaxing the triangle inequality in pattern matching. Int’l J. Comput. Vision 28(3), 219–231 (1998)CrossRefGoogle Scholar
- 12.Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: KDD 2004, pp. 206–215 (2004)Google Scholar
- 13.Kirk, S.R., Jenkins, S.: Information theory-baed software metrics and obfuscation. J. Systems and Software 72, 179–186 (2004)CrossRefGoogle Scholar
- 14.Kolmogorov, A.N.: Three Approaches to the Quantitative Definition of Information. Problems Inform. Transmission 1(1), 1–7 (1965)MathSciNetGoogle Scholar
- 15.Kraskov, A., Stögbauer, H., Andrzejak, R.G., Grassberger, P.: Hierarchical clustering using mutual information. Europhys. Lett. 70(2), 278–284 (2005)MathSciNetCrossRefGoogle Scholar
- 16.Kocsor, A., Kertesz-Farkas, A., Kajan, L., Pongor, S.: Application of compression-based distance measures to protein sequence classification: a methodology study. Bioinformatics 22(4), 407–412 (2006)CrossRefGoogle Scholar
- 17.Krasnogor, N., Pelta, D.A.: Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 20(7), 1015–1021 (2004)CrossRefGoogle Scholar
- 18.Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)CrossRefGoogle Scholar
- 19.Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004)MathSciNetCrossRefGoogle Scholar
- 20.Li, M.: Information distance and its applications. Int’l J. Found. Comput. Sci. 18(4), 669–681 (2007)CrossRefMATHGoogle Scholar
- 21.Li, M., Vitanyi, P.: An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer (2008)Google Scholar
- 22.Muchnik, A.A.: Conditional comlexity and codes. Theoretical Computer Science 271(1), 97–109 (2002)MathSciNetCrossRefMATHGoogle Scholar
- 23.Muchnik, A.A., Vereshchagin, N.K.: Logical operations and Kolmogorov complexity II. In: Proc. 16th Conf. Comput. Complexity, pp. 256–265 (2001)Google Scholar
- 24.Nykter, M., Price, N.D., Larjo, A., Aho, T., Kauffman, S.A., Yli-Harja, O., Shmulevich, I.: Critical networks exhibit maximal information diversity in structure-dynamics relationships. Phy. Rev. Lett. 100, 058702(4) (2008)Google Scholar
- 25.Nykter, M., Price, N.D., Aldana, M., Ramsey, S.A., Kauffman, S.A., Hood, L.E., Yli-Harja, O., Shmulevich, I.: Gene expression dynamics in the macrophage exhibit criticality. Proc. Nat. Acad. Sci. USA 105(6), 1897–1900 (2008)CrossRefGoogle Scholar
- 26.Otu, H.H., Sayood, K.: Bioinformatics 19(6), 2122–2130 (2003); A new sequence distance measure for phylogenetic tree constructionGoogle Scholar
- 27.Pao, H.K., Case, J.: Computing entropy for ortholog detection. In: Int’l Conf. Comput. Intell., Istanbul, Turkey, December 17-19 (2004)Google Scholar
- 28.Parry, D.: Use of Kolmogorov distance identification of web page authorship, topic and domain. In: Workshop on Open Source Web Inf. Retrieval (2005), http://www.emse.fr/OSWIR05
- 29.Costa Santos, C., Bernardes, J., Vitányi, P.M.B., Antunes, L.: Clustering fetal heart rate tracings by compression. In: Proc. 19th IEEE Intn’l Symp. Computer-Based Medical Systems, Salt Lake City, Utah, June 22-23 (2006)Google Scholar
- 30.Shen, A.K., Vereshchagin, N.K.: Logical operations and Kolmogorov complexity. Theoret. Comput. Sci. 271, 125–129 (2002)MathSciNetCrossRefMATHGoogle Scholar
- 31.Solomonoff, R.: A Formal Theory of Inductive Inference, Part I. d Information and Control 7(1), 1–22 (1964)MathSciNetCrossRefMATHGoogle Scholar
- 32.Solomonoff, R.: A Formal Theory of Inductive Inference, Part II. Information and Control 7(2), 224–254 (1964)MathSciNetCrossRefMATHGoogle Scholar
- 33.Varre, J.S., Delahaye, J.P., Rivals, E.: Transformation distances: a family of dissimilarity measures based on movements of segments. Bioinformatics 15(3), 194–202 (1999)CrossRefGoogle Scholar
- 34.Veltkamp, R.C.: Shape Matching: Similarity Measures and Algorithms, invited talk. In: Proc. Int’l Conf. Shape Modeling Applications 2001, Italy, pp. 188–197 (2001)Google Scholar
- 35.Vereshchagin, N.K., V’yugin, M.V.: Independent minimum length programs to translate between given strings. Theoret. Comput. Sci. 271, 131–143 (2002)MathSciNetCrossRefMATHGoogle Scholar
- 36.V’yugin, M.V.: Information distance and conditional complexities. Theoret. Comput. Sci. 271, 145–150 (2002)MathSciNetCrossRefGoogle Scholar
- 37.Wallace, C.S., Dowe, D.L.: Minimum Message Length and Kolmogorov Complexity. Computer Journal 42(4) (1999)Google Scholar
- 38.Yang, T., Wang, D., Zhu, X., Li, M.: Information distance between what I said and what it heard. Manuscript in preparation (August. 2011)Google Scholar
- 39.Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: Proc. 13th ACM SIGKDD, August 12-15, pp. 874–883 (2007)Google Scholar
- 40.Zhang, X., Hao, Y., Zhu, X., Li, M.: New information measure and its application in question answering system. J. Comput. Sci. Tech. 23(4), 557–572 (2008)MathSciNetCrossRefGoogle Scholar