Skip to main content
Log in

Kolmogorov complexity as a data similarity metric: application in mitochondrial DNA

  • Original Paper
  • Published:
Nonlinear Dynamics Aims and scope Submit manuscript

Abstract

The problem of developing a similarity index for different objects is discussed. The limitations of current metrics are evaluated and discussed. The normalized compression distance, based on the non-computable Kolmogorov complexity, is examined and compared with two alternative measures. A case study consisting of a phylogenetic tree of different mammals is constructed applying this technique with a mitochondrial DNA database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Engineering and technology history wiki: History of lossless data compression algorithms. http://ethw.org/History_of_Lossless_Data_Compression_Algorithms. Accessed 19 Oct 2017

  2. Phylip. http://evolution.genetics.washington.edu/phylip.html

  3. On the Approximation of the Kolmogorov Complexity for DNA Sequences (2017). https://doi.org/10.1007/978-3-319-58838-4_29

  4. Aziz, M., Alhadidi, D., Mohammed, N.: Secure approximation of edit distance on genomic data. BMC Med Genomics 10(Suppl 2), (2017). https://doi.org/10.1186/s12920-017-0279-9

  5. Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  6. Borbely, R.S.: On normalized compression distance and large malware. J. Comput. Virol. Hacking Tech. 12(4), 235–242 (2016). https://doi.org/10.1007/s11416-015-0260-0

    Article  Google Scholar 

  7. Yin, C., Chen, Y., Sdddd, Y.: A measure of DNA sequence similarity by fourier transform with applications on hierarchical clustering complexity for DNA sequences. J. Theor. Biol. 359, 18–28 (2014). https://doi.org/10.1016/j.jtbi.2014.05.043

    Article  Google Scholar 

  8. Carbone, A.: Information measure for long-range correlated sequences: the case of the 24 human chromosomes. Scientific Reports 3 (2013). https://doi.org/10.1038/srep02721

  9. Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005)

    MathSciNet  MATH  Google Scholar 

  10. Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch out for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005). https://doi.org/10.4310/CIS.2005.v5.n4.a1

    MathSciNet  MATH  Google Scholar 

  11. Cilibrasi, R., Vitany, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005). https://doi.org/10.1109/TIT.2005.844059

    Article  MathSciNet  MATH  Google Scholar 

  12. Cohen, A.R., Vitányi, P.M.B.: Normalized compression distance of multisets with applications. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1602–1614 (2015). https://doi.org/10.1109/TPAMI.2014.2375175

    Article  Google Scholar 

  13. Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2009)

    Book  MATH  Google Scholar 

  14. Endres, D., Schindelin, J.: A new metric for probability distributions. IEEE Trans. Inf. Theory 49(7), 1858–1860 (2003). https://doi.org/10.1109/TIT.2003.813506

    Article  MathSciNet  MATH  Google Scholar 

  15. Fortnow, L., Lee, T., Vereshchagin, N.: Kolmogorov complexity with error. In: Durand, B., Thomas, W. (eds.) STACS 2006–23rd Annual Symposium on Theoretical Aspects of Computer Science, Marseille, France, February 23–25, 2006. Lecture Notes in Computer Science, pp. 137–148. Springer, Berlin (2006)

  16. Gower, J.C., Dijksterhuis, G.B.: Procrustes Problems. Oxford University Press, Oxford (2004)

    Book  MATH  Google Scholar 

  17. Glunčić, M., Paar, V.: Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Research 41(1) (2013). https://doi.org/10.1093/nar/gks721

  18. Grünwald, P.D., Vitányi, P.M.B.: Kolmogorov complexity and information theory. J. Logic Lang. Inf. 12, 497–529 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  19. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature Extraction: foundations and Applications. Springer, Berlin (2008)

    Google Scholar 

  20. Hautamaki, V., Pollanen, A., Kinnunen, T., Aik, K., Haizhou, L., Franti, L.: A Comparison of Categorical Attribute Data Clustering Methods, pp. 53–62. Springer, Berlin (2014). https://doi.org/10.1007/978-3-662-44415-3_6

    Google Scholar 

  21. Hu, L.Y., Huang, M.W., Ke, S.W., Tsai, C.F.: The distance function effect on k-nearest neighbor classification for medical datasets. Springer Plus 5, 1304 (2016). https://doi.org/10.1186/s40064-016-2941-7

    Article  Google Scholar 

  22. Kalinowski, S.T., Leonard, M.J., Andrews, T.M.: Nothing in evolution makes sense except in the light of DNA. CBE Life Sci. Educ. 2(9), 87–97 (2010). https://doi.org/10.1187/cbe.09-12-0088

    Article  Google Scholar 

  23. Kawakatsu, H.: Methods for evaluating pictures and extracting music by 2D DFA and 2D FFT. Procedia Comput. Sci. 60, 834–840 (2015). https://doi.org/10.1016/j.procs.2015.08.246

    Article  Google Scholar 

  24. Kendall, D.G.: A survey of the statistical theory of shape. Stat. Sci. 4(12), 87–99 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  25. Klenk, S., Thom, D., Heidemann, G.: The Normalized Compression Distance as a Distance Measure in Entity Identification. Springer, Berlin (2009)

    Book  Google Scholar 

  26. Kolmogorov, A.: Three approaches to the quantitative definition of information. Int. J. Comput. Math. 2(1–4), 157–168 (1968)

    Article  MathSciNet  MATH  Google Scholar 

  27. Kubicova, V., Provaznik, I.: Relationship of bacteria using comparison of whole genome sequences in frequency domain. Inf. Technol. Biomed. 3, 397–408 (2014). https://doi.org/10.1007/978-3-319-06593-9_35

    Google Scholar 

  28. Kullback, S.: Information Theory and Statistics. Wiley, New York (1959)

    MATH  Google Scholar 

  29. Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  30. Li, M., Chen, X., Li, X., Ma, B., Vitány, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004). https://doi.org/10.1109/TIT.2004.838101

    Article  MathSciNet  MATH  Google Scholar 

  31. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991). https://doi.org/10.1109/18.61115

    Article  MathSciNet  MATH  Google Scholar 

  32. Machado, J.A.T.: Fractional order generalized information. Entropy 16(4), 2350–2361 (2014). https://doi.org/10.3390/e16042350

    Article  Google Scholar 

  33. Machado, J.A.T.: Bond graph and memristor approach to DNA analysis. Nonlinear Dyn. 88(2), 1051–1057 (2017). https://doi.org/10.1007/s11071-016-3294-z

    Article  Google Scholar 

  34. Machado, J.T.: Fractional order description of DNA. Appl. Math. Model. 39(14), 4095–4102 (2015). https://doi.org/10.1016/j.apm.2014.12.037

    Article  Google Scholar 

  35. Machado, J.T., Costa, A., Quelhas, M.: Entropy analysis of DNA code dynamics in human chromosomes. Comput. Math. Appl. 62(3), 1612–1617 (2011). https://doi.org/10.1016/j.camwa.2011.03.005

    Article  MathSciNet  MATH  Google Scholar 

  36. Machado, J.T., Costa, A.C., Lima, M.F.M.: Dynamical analysis of compositions. Nonlinear Dyn. 65(4), 399–412 (2011). https://doi.org/10.1007/s11071-010-9900-6

    Article  Google Scholar 

  37. Machado, J.T., Costa, A.C., Quelhas, M.D.: Fractional dynamics in DNA. Commun. Nonlinear Sci. Numer. Simul. 16(8), 2963–2969 (2011). https://doi.org/10.1016/j.cnsns.2010.11.007

    Article  MATH  Google Scholar 

  38. MacKay, D.J.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003)

    MATH  Google Scholar 

  39. Moscato, P., Buriol, L., Cotta, C.: On the analysis of data derived from mitochondrial DNA distance matrices: Kolmogorov and a traveling salesman give their opinion (2002)

  40. Pinho, A., Ferreira, P.: Image similarity using the normalized compression distance based on finite context models. In: Proceedings of IEEE International Conference on Image Processing (2011). https://doi.org/10.1109/ICIP.2011.6115866

  41. Rajarajeswari, P., Apparao, A.: Normalized distance matrix method for construction of phylogenetic trees using new compressor - DNABIT compress. J. Adv. Bioinf. Appl. Res. 2(1), 89–97 (2011)

    Google Scholar 

  42. Ré, M.A., Azad, R.K.: Generalization of entropy based divergence measures for symbolic sequence analysis. PLoS ONE 9(4), e93,532 (2014). https://doi.org/10.1371/journal.pone.0093532

    Article  Google Scholar 

  43. Russel, R., Sinha, P.: Perceptually based comparison of image similarity metrics. Perception 40, 1269–1281 (2011). https://doi.org/10.1068/p7063

    Article  Google Scholar 

  44. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)

    Google Scholar 

  45. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  MathSciNet  MATH  Google Scholar 

  46. Sokal, R.R., Michener, C.D.: A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 38(22), 1409–1438 (1958)

    Google Scholar 

  47. Starr, T.N., Picton, L.K., Thornton, J.W.: Alternative evolutionary histories in the sequence space of an ancient protein. Nature 549, 409–413 (2017). https://doi.org/10.1038/nature23902

    Article  Google Scholar 

  48. Vázquez, P.P., Marco, J.: Using normalized compression distance for image similarity measurement: an experimental study. J. Comput. Virol. Hacking Tech. 28(11), 1063–1084 (2012). https://doi.org/10.1007/s00371-011-0651-2

    Google Scholar 

  49. Walsh, B.: Estimating the time to the most recent common ancestor for the Y chromosome or mitochondrial DNA for a pair of individuals. Genetics 158(2), 897–912 (2001)

    Google Scholar 

  50. Wang, W., Wang, T.: Conditional LZ complexity and its application in mtDNA sequence analysis. MATCH Commun. Math. Comput. Chem. 66, 425–443 (2011)

    MathSciNet  Google Scholar 

  51. Yianilos, P.N.: Normalized forms of two common metrics. Tech. Rep. Report 91-082-9027-1, NEC Research Institute (1991)

  52. Yu, J., Amores, J., Sebe, N., Tian, Q.: A new study on distance metrics as similarity measurement. In: IEEE International Conference on Multimedia and Expo (2006). https://doi.org/10.1109/ICME.2006.262443

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. A. Tenreiro Machado.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Antão, R., Mota, A. & Machado, J.A.T. Kolmogorov complexity as a data similarity metric: application in mitochondrial DNA. Nonlinear Dyn 93, 1059–1071 (2018). https://doi.org/10.1007/s11071-018-4245-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11071-018-4245-7

Keywords

Navigation