Interpreting Shared Information Content in Software Engineering: What Does It Mean to Say Two Artefacts Are Similar and Why?

  • Tom Arbuckle
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7425)


Recent work has examined software evolution in terms of shared information content between artefacts representative of releases. These studies employ measurements based on the non-trivial concept of Kolmogorov complexity to calculate the quantity of shared information and have shown some of the promise of this characterisation of software evolution.

At the same time, however, questions and enquiries related to this work have shown that interpreting the results requires some care. What does it mean to say that two software artefacts are “similar” in an information theoretic sense? What does shared information content mean in a software engineering context?

Starting from questions that could be asked by software engineers about the results of this method, this paper will provide answers by reviewing and referring back to the underlying mathematics. It is hoped that this will serve both to dispel myths but also to encourage interest, curiosity and critique.


Mutual Information Software Evolution Binary String Information Distance Kolmogorov Complexity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adriaans, P.: A critical analysis of Floridi’s theory of semantic information. Knowledge, Technology & Policy 23, 41–56 (2010), CrossRefGoogle Scholar
  2. 2.
    Allen, E.B., Gottipati, S., Govindarajan, R.: Measuring size, complexity, and coupling of hypergraph abstractions of software: An information-theory approach. Software Quality Journal 15(2), 179–212 (2007)CrossRefGoogle Scholar
  3. 3.
    Arbuckle, T.: Visually summarising software change. In: IV 2008: Proc. 12th Int. Conf. on Information Visualisation, pp. 559–568 (2008)Google Scholar
  4. 4.
    Arbuckle, T.: Measure software – and its evolution – using information content. In: IWPSE-Evol 2009: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 129–134 (2009)Google Scholar
  5. 5.
    Arbuckle, T.: Measuring multi-language software evolution: A case study. In: IWPSE-EVOL 2011: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 91–95 (September 2011)Google Scholar
  6. 6.
    Arbuckle, T.: Studying software evolution using artefacts’ shared information content. Science of Computer Programming 76(12), 1078–1097 (2011), CrossRefGoogle Scholar
  7. 7.
    Arbuckle, T., Balaban, A., Peters, D.K., Lawford, M.: Software documents: Comparison and measurement. In: SEKE 2007: Proc. 18th Int. Conf. on Software Engineering & Knowledge Engineering, pp. 740–745 (July 2007)Google Scholar
  8. 8.
    Bennett, C.H.: Thermodynamically reversible computation. Phys. Rev. Lett. 53(12), 1202 (1984)CrossRefGoogle Scholar
  9. 9.
    Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Information Theory 44(4), 1407–1423 (1998)zbMATHCrossRefGoogle Scholar
  10. 10.
    Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Thermodynamics of computation and information distance. In: STOC 1993: Proc. 25th Annual ACM Symposium on Theory of Computing, pp. 21–30. ACM Press (1993)Google Scholar
  11. 11.
    Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Comms. Info. Sys. 5(4), 367–384 (2005)zbMATHGoogle Scholar
  12. 12.
    Cerra, D., Datcu, M.: Algorithmic relative complexity. Entropy 13, 902–914 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  13. 13.
    Chaitin, G.J.: On the length of programs for computing finite binary sequences: statistical considerations. J. ACM 16(1), 145–159 (1969)MathSciNetzbMATHCrossRefGoogle Scholar
  14. 14.
    Chaitin, G.J.: Information-theoretic computational complexity. IEEE Trans. Information Theory 20(1), 10–15 (1974)MathSciNetzbMATHCrossRefGoogle Scholar
  15. 15.
    Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its applications in genome comparison. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, RECOMB 2000, p. 107 (2000),
  16. 16.
    Cilibrasi, R.: The CompLearn Toolkit (2003),
  17. 17.
    Cilibrasi, R., Vitányi, P.: Clustering by compression. IEEE Trans. Information Theory 51(4), 1523–1545 (2005)CrossRefGoogle Scholar
  18. 18.
    Copeland, B.J.: The Church-Turing thesis. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy (2008),
  19. 19.
    Cover, T.M., Gács, P., Gray, R.M.: Kolmogorov’s contributions to information theory and algorithmic complexity. The Annals of Probability 17(3), 840–865 (1989)MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley (2006)Google Scholar
  21. 21.
    Fano, R.M.: The transmission of information. Tech. Rep. 65 and 149, Research Laboratory of Electronics. MIT (1949-1950)Google Scholar
  22. 22.
    Fenton, N.E., Pfleeger, S.L.: Software Metrics: A Rigorous and Practical Approach. PWS Publishing Co., Boston (1998)Google Scholar
  23. 23.
    Gács, P., Körner, J.: Common information is far less than mutual information. Prob. Control Info. Theory 2, 149–162 (1973)zbMATHGoogle Scholar
  24. 24.
    Gödel, K.: Über formal unentscheidbare Sätze der. Principia Mathematica und verwandter Systeme, I. Monatshefte für Mathematik und Physik 38, 173–198 (1931)Google Scholar
  25. 25.
    Gödel, K.: On undecidable propositions of formal mathematical systems (1934)Google Scholar
  26. 26.
    Grünwald, P.D., Vitányi, P.M.B.: Kolmogorov complexity and information theory with an interpretation in terms of questions and answers. J. of Logic, Lang. and Inf. 12(4), 497–529 (2003)zbMATHCrossRefGoogle Scholar
  27. 27.
    Herbrand, J.: Sur la non-contradiction de l’arithmetique. Journal für die Reine und Angewandte Mathematik 166, 1–8 (1932)Google Scholar
  28. 28.
    Kolivas, C.: lrzip – long range zip, based on rzip by Andrew Tridgell (2006),
  29. 29.
    Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Probl. Inform. Trans. 1(1), 1–7 (1965)MathSciNetGoogle Scholar
  30. 30.
    Kolmogorov, A.N.: Combinatorial foundations of information theory and the calculus of probabilities. Russian Mathematical Surveys 38(4), 29–40 (1983)zbMATHCrossRefGoogle Scholar
  31. 31.
    Kolmogorov, A.N.: Logical basis for information theory and probability theory. IEEE Transactions on Information Theory IT-14(5), 662–664 (1968)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Koschke, R.: Survey of research on software clones. In: Duplication, Redundancy, and Similarity in Software Dagstuhl Seminar Proceedings, vol. 06301 (2006)Google Scholar
  33. 33.
    Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)CrossRefGoogle Scholar
  34. 34.
    Li, M.: Information distance and its applications. International Journal of Foundations of Computer Science 18(4), 669–681 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  35. 35.
    Li, M.: Information Distance and Its Extensions. In: Elomaa, T., Hollmén, J., Mannila, H. (eds.) DS 2011. LNCS, vol. 6926, pp. 18–28. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  36. 36.
    Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004)CrossRefGoogle Scholar
  37. 37.
    Lutz, J.H.: The dimensions of individual strings and sequences. Information and Computation 187(1), 49–79 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  38. 38.
    Martin-Löf, P.: The definition of random sequences. Information and Control 9(6), 602–619 (1966)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Mens, T., Demeyer, S. (eds.): Software Evolution. Springer (2008)Google Scholar
  40. 40.
    Muchnik, A., Vereshchagin, N.: Logical operations and Kolmogorov complexity. II. In: 16th Annual IEEE Conference on Computational Complexity, pp. 256–265 (2001)Google Scholar
  41. 41.
    Muchnik, A.: On common information. Theoretical Computer Science 207, 319–328 (1998)MathSciNetzbMATHCrossRefGoogle Scholar
  42. 42.
    Muchnik, A.A.: Kolmogorov complexity and cryptography. Proc. Steklov Institute of Mathematics 274, 193–203 (2011)CrossRefGoogle Scholar
  43. 43.
    Prowell, S.J., Poore, J.H.: Foundations of sequence-based software specification. IEEE Trans. Softw. Eng. 29(5), 417–429 (2003)CrossRefGoogle Scholar
  44. 44.
    Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74(7), 470–495 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  45. 45.
    Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 379–423, 623–656 (1948)Google Scholar
  46. 46.
    Solomonoff, R.J.: A formal theory of inductive inference. part I and part II. Information and Control 7(1, 2), 1–22, 224–254 (1964)Google Scholar
  47. 47.
    Terwijn, S., Torenvliet, L., Vitányi, P.: Nonapproximability of the normalized information distance. Journal of Computer and System Sciences 77(4), 738–742 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  48. 48.
    Turing, A.M.: On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 42, 230–265 (1936), correction in 43, 544-546 (1937)Google Scholar
  49. 49.
    Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research 11, 2837–2854 (2010)MathSciNetzbMATHGoogle Scholar
  50. 50.
    Vitányi, P.M.B.: Similarity and denoising. Philosophical Transactions of the Royal Society, A (to appear),, (accessed May 7, 2012)
  51. 51.
    Vitányi, P.M.: Quantum Kolmogorov complexity based on classical descriptions. IEEE Transactions on Information Theory 47(6), 2464–2479 (2001)zbMATHCrossRefGoogle Scholar
  52. 52.
    Vitányi, P.M., Balbach, F.J., Cilibrasi, R.L., Li, M.: Normalised information distance. In: Emmert-Streib, F., Dehmer, M. (eds.) Information Theory and Statistical Learning, pp. 45–82. Springer (2009)Google Scholar
  53. 53.
    Vitányi, P.: Compression-based similarity. In: First Int. Conf. on Data Compression, Communications and Processing (CCP), pp. 111–118 (June 2011)Google Scholar
  54. 54.
    Vitányi, P.: Information distance in multiples. IEEE Transactions on Information Theory 57(4), 2451–2456 (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Tom Arbuckle
    • 1
  1. 1.Computer Science and Information SystemsUniversity of LimerickIreland

Personalised recommendations