Skip to main content

Interpreting Shared Information Content in Software Engineering: What Does It Mean to Say Two Artefacts Are Similar and Why?

  • Conference paper
Convergence and Hybrid Information Technology (ICHIT 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7425))

Included in the following conference series:

  • 2317 Accesses

Abstract

Recent work has examined software evolution in terms of shared information content between artefacts representative of releases. These studies employ measurements based on the non-trivial concept of Kolmogorov complexity to calculate the quantity of shared information and have shown some of the promise of this characterisation of software evolution.

At the same time, however, questions and enquiries related to this work have shown that interpreting the results requires some care. What does it mean to say that two software artefacts are “similar” in an information theoretic sense? What does shared information content mean in a software engineering context?

Starting from questions that could be asked by software engineers about the results of this method, this paper will provide answers by reviewing and referring back to the underlying mathematics. It is hoped that this will serve both to dispel myths but also to encourage interest, curiosity and critique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adriaans, P.: A critical analysis of Floridi’s theory of semantic information. Knowledge, Technology & Policy 23, 41–56 (2010), http://dx.doi.org/10.1007/s12130-010-9097-5

    Article  Google Scholar 

  2. Allen, E.B., Gottipati, S., Govindarajan, R.: Measuring size, complexity, and coupling of hypergraph abstractions of software: An information-theory approach. Software Quality Journal 15(2), 179–212 (2007)

    Article  Google Scholar 

  3. Arbuckle, T.: Visually summarising software change. In: IV 2008: Proc. 12th Int. Conf. on Information Visualisation, pp. 559–568 (2008)

    Google Scholar 

  4. Arbuckle, T.: Measure software – and its evolution – using information content. In: IWPSE-Evol 2009: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 129–134 (2009)

    Google Scholar 

  5. Arbuckle, T.: Measuring multi-language software evolution: A case study. In: IWPSE-EVOL 2011: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 91–95 (September 2011)

    Google Scholar 

  6. Arbuckle, T.: Studying software evolution using artefacts’ shared information content. Science of Computer Programming 76(12), 1078–1097 (2011), http://dx.doi.org/10.1016/j.scico.2010.11.005

    Article  Google Scholar 

  7. Arbuckle, T., Balaban, A., Peters, D.K., Lawford, M.: Software documents: Comparison and measurement. In: SEKE 2007: Proc. 18th Int. Conf. on Software Engineering & Knowledge Engineering, pp. 740–745 (July 2007)

    Google Scholar 

  8. Bennett, C.H.: Thermodynamically reversible computation. Phys. Rev. Lett. 53(12), 1202 (1984)

    Article  Google Scholar 

  9. Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Information Theory 44(4), 1407–1423 (1998)

    Article  MATH  Google Scholar 

  10. Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Thermodynamics of computation and information distance. In: STOC 1993: Proc. 25th Annual ACM Symposium on Theory of Computing, pp. 21–30. ACM Press (1993)

    Google Scholar 

  11. Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Comms. Info. Sys. 5(4), 367–384 (2005)

    MATH  Google Scholar 

  12. Cerra, D., Datcu, M.: Algorithmic relative complexity. Entropy 13, 902–914 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  13. Chaitin, G.J.: On the length of programs for computing finite binary sequences: statistical considerations. J. ACM 16(1), 145–159 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  14. Chaitin, G.J.: Information-theoretic computational complexity. IEEE Trans. Information Theory 20(1), 10–15 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  15. Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its applications in genome comparison. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, RECOMB 2000, p. 107 (2000), http://doi.acm.org/10.1145/332306.332352

  16. Cilibrasi, R.: The CompLearn Toolkit (2003), http://complearn.sourceforge.net/

  17. Cilibrasi, R., Vitányi, P.: Clustering by compression. IEEE Trans. Information Theory 51(4), 1523–1545 (2005)

    Article  Google Scholar 

  18. Copeland, B.J.: The Church-Turing thesis. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy (2008), http://plato.stanford.edu/archives/fall2008/entries/church-turing/

  19. Cover, T.M., Gács, P., Gray, R.M.: Kolmogorov’s contributions to information theory and algorithmic complexity. The Annals of Probability 17(3), 840–865 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  20. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley (2006)

    Google Scholar 

  21. Fano, R.M.: The transmission of information. Tech. Rep. 65 and 149, Research Laboratory of Electronics. MIT (1949-1950)

    Google Scholar 

  22. Fenton, N.E., Pfleeger, S.L.: Software Metrics: A Rigorous and Practical Approach. PWS Publishing Co., Boston (1998)

    Google Scholar 

  23. Gács, P., Körner, J.: Common information is far less than mutual information. Prob. Control Info. Theory 2, 149–162 (1973)

    MATH  Google Scholar 

  24. Gödel, K.: Über formal unentscheidbare Sätze der. Principia Mathematica und verwandter Systeme, I. Monatshefte für Mathematik und Physik 38, 173–198 (1931)

    Google Scholar 

  25. Gödel, K.: On undecidable propositions of formal mathematical systems (1934)

    Google Scholar 

  26. Grünwald, P.D., Vitányi, P.M.B.: Kolmogorov complexity and information theory with an interpretation in terms of questions and answers. J. of Logic, Lang. and Inf. 12(4), 497–529 (2003)

    Article  MATH  Google Scholar 

  27. Herbrand, J.: Sur la non-contradiction de l’arithmetique. Journal für die Reine und Angewandte Mathematik 166, 1–8 (1932)

    Google Scholar 

  28. Kolivas, C.: lrzip – long range zip, based on rzip by Andrew Tridgell (2006), http://ck.kolivas.org/apps/lrzip/

  29. Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Probl. Inform. Trans. 1(1), 1–7 (1965)

    MathSciNet  Google Scholar 

  30. Kolmogorov, A.N.: Combinatorial foundations of information theory and the calculus of probabilities. Russian Mathematical Surveys 38(4), 29–40 (1983)

    Article  MATH  Google Scholar 

  31. Kolmogorov, A.N.: Logical basis for information theory and probability theory. IEEE Transactions on Information Theory IT-14(5), 662–664 (1968)

    Article  MathSciNet  Google Scholar 

  32. Koschke, R.: Survey of research on software clones. In: Duplication, Redundancy, and Similarity in Software Dagstuhl Seminar Proceedings, vol. 06301 (2006)

    Google Scholar 

  33. Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)

    Article  Google Scholar 

  34. Li, M.: Information distance and its applications. International Journal of Foundations of Computer Science 18(4), 669–681 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  35. Li, M.: Information Distance and Its Extensions. In: Elomaa, T., Hollmén, J., Mannila, H. (eds.) DS 2011. LNCS, vol. 6926, pp. 18–28. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  36. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004)

    Article  Google Scholar 

  37. Lutz, J.H.: The dimensions of individual strings and sequences. Information and Computation 187(1), 49–79 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  38. Martin-Löf, P.: The definition of random sequences. Information and Control 9(6), 602–619 (1966)

    Article  MathSciNet  Google Scholar 

  39. Mens, T., Demeyer, S. (eds.): Software Evolution. Springer (2008)

    Google Scholar 

  40. Muchnik, A., Vereshchagin, N.: Logical operations and Kolmogorov complexity. II. In: 16th Annual IEEE Conference on Computational Complexity, pp. 256–265 (2001)

    Google Scholar 

  41. Muchnik, A.: On common information. Theoretical Computer Science 207, 319–328 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  42. Muchnik, A.A.: Kolmogorov complexity and cryptography. Proc. Steklov Institute of Mathematics 274, 193–203 (2011)

    Article  Google Scholar 

  43. Prowell, S.J., Poore, J.H.: Foundations of sequence-based software specification. IEEE Trans. Softw. Eng. 29(5), 417–429 (2003)

    Article  Google Scholar 

  44. Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74(7), 470–495 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  45. Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 379–423, 623–656 (1948)

    Google Scholar 

  46. Solomonoff, R.J.: A formal theory of inductive inference. part I and part II. Information and Control 7(1, 2), 1–22, 224–254 (1964)

    Google Scholar 

  47. Terwijn, S., Torenvliet, L., Vitányi, P.: Nonapproximability of the normalized information distance. Journal of Computer and System Sciences 77(4), 738–742 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  48. Turing, A.M.: On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 42, 230–265 (1936), correction in 43, 544-546 (1937)

    Google Scholar 

  49. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research 11, 2837–2854 (2010)

    MathSciNet  MATH  Google Scholar 

  50. Vitányi, P.M.B.: Similarity and denoising. Philosophical Transactions of the Royal Society, A (to appear), http://homepages.cwi.nl/~paulv/selection.html , (accessed May 7, 2012)

  51. Vitányi, P.M.: Quantum Kolmogorov complexity based on classical descriptions. IEEE Transactions on Information Theory 47(6), 2464–2479 (2001)

    Article  MATH  Google Scholar 

  52. Vitányi, P.M., Balbach, F.J., Cilibrasi, R.L., Li, M.: Normalised information distance. In: Emmert-Streib, F., Dehmer, M. (eds.) Information Theory and Statistical Learning, pp. 45–82. Springer (2009)

    Google Scholar 

  53. Vitányi, P.: Compression-based similarity. In: First Int. Conf. on Data Compression, Communications and Processing (CCP), pp. 111–118 (June 2011)

    Google Scholar 

  54. Vitányi, P.: Information distance in multiples. IEEE Transactions on Information Theory 57(4), 2451–2456 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Arbuckle, T. (2012). Interpreting Shared Information Content in Software Engineering: What Does It Mean to Say Two Artefacts Are Similar and Why?. In: Lee, G., Howard, D., Kang, J.J., Ślęzak, D. (eds) Convergence and Hybrid Information Technology. ICHIT 2012. Lecture Notes in Computer Science, vol 7425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32645-5_79

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32645-5_79

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32644-8

  • Online ISBN: 978-3-642-32645-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics