Abstract
The problem of developing a similarity index for different objects is discussed. The limitations of current metrics are evaluated and discussed. The normalized compression distance, based on the non-computable Kolmogorov complexity, is examined and compared with two alternative measures. A case study consisting of a phylogenetic tree of different mammals is constructed applying this technique with a mitochondrial DNA database.
Similar content being viewed by others
References
Engineering and technology history wiki: History of lossless data compression algorithms. http://ethw.org/History_of_Lossless_Data_Compression_Algorithms. Accessed 19 Oct 2017
Phylip. http://evolution.genetics.washington.edu/phylip.html
On the Approximation of the Kolmogorov Complexity for DNA Sequences (2017). https://doi.org/10.1007/978-3-319-58838-4_29
Aziz, M., Alhadidi, D., Mohammed, N.: Secure approximation of edit distance on genomic data. BMC Med Genomics 10(Suppl 2), (2017). https://doi.org/10.1186/s12920-017-0279-9
Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)
Borbely, R.S.: On normalized compression distance and large malware. J. Comput. Virol. Hacking Tech. 12(4), 235–242 (2016). https://doi.org/10.1007/s11416-015-0260-0
Yin, C., Chen, Y., Sdddd, Y.: A measure of DNA sequence similarity by fourier transform with applications on hierarchical clustering complexity for DNA sequences. J. Theor. Biol. 359, 18–28 (2014). https://doi.org/10.1016/j.jtbi.2014.05.043
Carbone, A.: Information measure for long-range correlated sequences: the case of the 24 human chromosomes. Scientific Reports 3 (2013). https://doi.org/10.1038/srep02721
Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005)
Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch out for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005). https://doi.org/10.4310/CIS.2005.v5.n4.a1
Cilibrasi, R., Vitany, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005). https://doi.org/10.1109/TIT.2005.844059
Cohen, A.R., Vitányi, P.M.B.: Normalized compression distance of multisets with applications. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1602–1614 (2015). https://doi.org/10.1109/TPAMI.2014.2375175
Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2009)
Endres, D., Schindelin, J.: A new metric for probability distributions. IEEE Trans. Inf. Theory 49(7), 1858–1860 (2003). https://doi.org/10.1109/TIT.2003.813506
Fortnow, L., Lee, T., Vereshchagin, N.: Kolmogorov complexity with error. In: Durand, B., Thomas, W. (eds.) STACS 2006–23rd Annual Symposium on Theoretical Aspects of Computer Science, Marseille, France, February 23–25, 2006. Lecture Notes in Computer Science, pp. 137–148. Springer, Berlin (2006)
Gower, J.C., Dijksterhuis, G.B.: Procrustes Problems. Oxford University Press, Oxford (2004)
Glunčić, M., Paar, V.: Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Research 41(1) (2013). https://doi.org/10.1093/nar/gks721
Grünwald, P.D., Vitányi, P.M.B.: Kolmogorov complexity and information theory. J. Logic Lang. Inf. 12, 497–529 (2003)
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature Extraction: foundations and Applications. Springer, Berlin (2008)
Hautamaki, V., Pollanen, A., Kinnunen, T., Aik, K., Haizhou, L., Franti, L.: A Comparison of Categorical Attribute Data Clustering Methods, pp. 53–62. Springer, Berlin (2014). https://doi.org/10.1007/978-3-662-44415-3_6
Hu, L.Y., Huang, M.W., Ke, S.W., Tsai, C.F.: The distance function effect on k-nearest neighbor classification for medical datasets. Springer Plus 5, 1304 (2016). https://doi.org/10.1186/s40064-016-2941-7
Kalinowski, S.T., Leonard, M.J., Andrews, T.M.: Nothing in evolution makes sense except in the light of DNA. CBE Life Sci. Educ. 2(9), 87–97 (2010). https://doi.org/10.1187/cbe.09-12-0088
Kawakatsu, H.: Methods for evaluating pictures and extracting music by 2D DFA and 2D FFT. Procedia Comput. Sci. 60, 834–840 (2015). https://doi.org/10.1016/j.procs.2015.08.246
Kendall, D.G.: A survey of the statistical theory of shape. Stat. Sci. 4(12), 87–99 (1989)
Klenk, S., Thom, D., Heidemann, G.: The Normalized Compression Distance as a Distance Measure in Entity Identification. Springer, Berlin (2009)
Kolmogorov, A.: Three approaches to the quantitative definition of information. Int. J. Comput. Math. 2(1–4), 157–168 (1968)
Kubicova, V., Provaznik, I.: Relationship of bacteria using comparison of whole genome sequences in frequency domain. Inf. Technol. Biomed. 3, 397–408 (2014). https://doi.org/10.1007/978-3-319-06593-9_35
Kullback, S.: Information Theory and Statistics. Wiley, New York (1959)
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Li, M., Chen, X., Li, X., Ma, B., Vitány, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004). https://doi.org/10.1109/TIT.2004.838101
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991). https://doi.org/10.1109/18.61115
Machado, J.A.T.: Fractional order generalized information. Entropy 16(4), 2350–2361 (2014). https://doi.org/10.3390/e16042350
Machado, J.A.T.: Bond graph and memristor approach to DNA analysis. Nonlinear Dyn. 88(2), 1051–1057 (2017). https://doi.org/10.1007/s11071-016-3294-z
Machado, J.T.: Fractional order description of DNA. Appl. Math. Model. 39(14), 4095–4102 (2015). https://doi.org/10.1016/j.apm.2014.12.037
Machado, J.T., Costa, A., Quelhas, M.: Entropy analysis of DNA code dynamics in human chromosomes. Comput. Math. Appl. 62(3), 1612–1617 (2011). https://doi.org/10.1016/j.camwa.2011.03.005
Machado, J.T., Costa, A.C., Lima, M.F.M.: Dynamical analysis of compositions. Nonlinear Dyn. 65(4), 399–412 (2011). https://doi.org/10.1007/s11071-010-9900-6
Machado, J.T., Costa, A.C., Quelhas, M.D.: Fractional dynamics in DNA. Commun. Nonlinear Sci. Numer. Simul. 16(8), 2963–2969 (2011). https://doi.org/10.1016/j.cnsns.2010.11.007
MacKay, D.J.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003)
Moscato, P., Buriol, L., Cotta, C.: On the analysis of data derived from mitochondrial DNA distance matrices: Kolmogorov and a traveling salesman give their opinion (2002)
Pinho, A., Ferreira, P.: Image similarity using the normalized compression distance based on finite context models. In: Proceedings of IEEE International Conference on Image Processing (2011). https://doi.org/10.1109/ICIP.2011.6115866
Rajarajeswari, P., Apparao, A.: Normalized distance matrix method for construction of phylogenetic trees using new compressor - DNABIT compress. J. Adv. Bioinf. Appl. Res. 2(1), 89–97 (2011)
Ré, M.A., Azad, R.K.: Generalization of entropy based divergence measures for symbolic sequence analysis. PLoS ONE 9(4), e93,532 (2014). https://doi.org/10.1371/journal.pone.0093532
Russel, R., Sinha, P.: Perceptually based comparison of image similarity metrics. Perception 40, 1269–1281 (2011). https://doi.org/10.1068/p7063
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Sokal, R.R., Michener, C.D.: A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 38(22), 1409–1438 (1958)
Starr, T.N., Picton, L.K., Thornton, J.W.: Alternative evolutionary histories in the sequence space of an ancient protein. Nature 549, 409–413 (2017). https://doi.org/10.1038/nature23902
Vázquez, P.P., Marco, J.: Using normalized compression distance for image similarity measurement: an experimental study. J. Comput. Virol. Hacking Tech. 28(11), 1063–1084 (2012). https://doi.org/10.1007/s00371-011-0651-2
Walsh, B.: Estimating the time to the most recent common ancestor for the Y chromosome or mitochondrial DNA for a pair of individuals. Genetics 158(2), 897–912 (2001)
Wang, W., Wang, T.: Conditional LZ complexity and its application in mtDNA sequence analysis. MATCH Commun. Math. Comput. Chem. 66, 425–443 (2011)
Yianilos, P.N.: Normalized forms of two common metrics. Tech. Rep. Report 91-082-9027-1, NEC Research Institute (1991)
Yu, J., Amores, J., Sebe, N., Tian, Q.: A new study on distance metrics as similarity measurement. In: IEEE International Conference on Multimedia and Expo (2006). https://doi.org/10.1109/ICME.2006.262443
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Antão, R., Mota, A. & Machado, J.A.T. Kolmogorov complexity as a data similarity metric: application in mitochondrial DNA. Nonlinear Dyn 93, 1059–1071 (2018). https://doi.org/10.1007/s11071-018-4245-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11071-018-4245-7