Interpreting Shared Information Content in Software Engineering: What Does It Mean to Say Two Artefacts Are Similar and Why?

Arbuckle, Tom

doi:10.1007/978-3-642-32645-5_79

Tom Arbuckle²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7425))

Included in the following conference series:

International Conference on Hybrid Information Technology

2317 Accesses

Abstract

Recent work has examined software evolution in terms of shared information content between artefacts representative of releases. These studies employ measurements based on the non-trivial concept of Kolmogorov complexity to calculate the quantity of shared information and have shown some of the promise of this characterisation of software evolution.

At the same time, however, questions and enquiries related to this work have shown that interpreting the results requires some care. What does it mean to say that two software artefacts are “similar” in an information theoretic sense? What does shared information content mean in a software engineering context?

Starting from questions that could be asked by software engineers about the results of this method, this paper will provide answers by reviewing and referring back to the underlying mathematics. It is hoped that this will serve both to dispel myths but also to encourage interest, curiosity and critique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adriaans, P.: A critical analysis of Floridi’s theory of semantic information. Knowledge, Technology & Policy 23, 41–56 (2010), http://dx.doi.org/10.1007/s12130-010-9097-5
Article Google Scholar
Allen, E.B., Gottipati, S., Govindarajan, R.: Measuring size, complexity, and coupling of hypergraph abstractions of software: An information-theory approach. Software Quality Journal 15(2), 179–212 (2007)
Article Google Scholar
Arbuckle, T.: Visually summarising software change. In: IV 2008: Proc. 12th Int. Conf. on Information Visualisation, pp. 559–568 (2008)
Google Scholar
Arbuckle, T.: Measure software – and its evolution – using information content. In: IWPSE-Evol 2009: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 129–134 (2009)
Google Scholar
Arbuckle, T.: Measuring multi-language software evolution: A case study. In: IWPSE-EVOL 2011: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 91–95 (September 2011)
Google Scholar
Arbuckle, T.: Studying software evolution using artefacts’ shared information content. Science of Computer Programming 76(12), 1078–1097 (2011), http://dx.doi.org/10.1016/j.scico.2010.11.005
Article Google Scholar
Arbuckle, T., Balaban, A., Peters, D.K., Lawford, M.: Software documents: Comparison and measurement. In: SEKE 2007: Proc. 18th Int. Conf. on Software Engineering & Knowledge Engineering, pp. 740–745 (July 2007)
Google Scholar
Bennett, C.H.: Thermodynamically reversible computation. Phys. Rev. Lett. 53(12), 1202 (1984)
Article Google Scholar
Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Information Theory 44(4), 1407–1423 (1998)
Article MATH Google Scholar
Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Thermodynamics of computation and information distance. In: STOC 1993: Proc. 25th Annual ACM Symposium on Theory of Computing, pp. 21–30. ACM Press (1993)
Google Scholar
Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Comms. Info. Sys. 5(4), 367–384 (2005)
MATH Google Scholar
Cerra, D., Datcu, M.: Algorithmic relative complexity. Entropy 13, 902–914 (2011)
Article MathSciNet MATH Google Scholar
Chaitin, G.J.: On the length of programs for computing finite binary sequences: statistical considerations. J. ACM 16(1), 145–159 (1969)
Article MathSciNet MATH Google Scholar
Chaitin, G.J.: Information-theoretic computational complexity. IEEE Trans. Information Theory 20(1), 10–15 (1974)
Article MathSciNet MATH Google Scholar
Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its applications in genome comparison. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, RECOMB 2000, p. 107 (2000), http://doi.acm.org/10.1145/332306.332352
Cilibrasi, R.: The CompLearn Toolkit (2003), http://complearn.sourceforge.net/
Cilibrasi, R., Vitányi, P.: Clustering by compression. IEEE Trans. Information Theory 51(4), 1523–1545 (2005)
Article Google Scholar
Copeland, B.J.: The Church-Turing thesis. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy (2008), http://plato.stanford.edu/archives/fall2008/entries/church-turing/
Cover, T.M., Gács, P., Gray, R.M.: Kolmogorov’s contributions to information theory and algorithmic complexity. The Annals of Probability 17(3), 840–865 (1989)
Article MathSciNet MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley (2006)
Google Scholar
Fano, R.M.: The transmission of information. Tech. Rep. 65 and 149, Research Laboratory of Electronics. MIT (1949-1950)
Google Scholar
Fenton, N.E., Pfleeger, S.L.: Software Metrics: A Rigorous and Practical Approach. PWS Publishing Co., Boston (1998)
Google Scholar
Gács, P., Körner, J.: Common information is far less than mutual information. Prob. Control Info. Theory 2, 149–162 (1973)
MATH Google Scholar
Gödel, K.: Über formal unentscheidbare Sätze der. Principia Mathematica und verwandter Systeme, I. Monatshefte für Mathematik und Physik 38, 173–198 (1931)
Google Scholar
Gödel, K.: On undecidable propositions of formal mathematical systems (1934)
Google Scholar
Grünwald, P.D., Vitányi, P.M.B.: Kolmogorov complexity and information theory with an interpretation in terms of questions and answers. J. of Logic, Lang. and Inf. 12(4), 497–529 (2003)
Article MATH Google Scholar
Herbrand, J.: Sur la non-contradiction de l’arithmetique. Journal für die Reine und Angewandte Mathematik 166, 1–8 (1932)
Google Scholar
Kolivas, C.: lrzip – long range zip, based on rzip by Andrew Tridgell (2006), http://ck.kolivas.org/apps/lrzip/
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Probl. Inform. Trans. 1(1), 1–7 (1965)
MathSciNet Google Scholar
Kolmogorov, A.N.: Combinatorial foundations of information theory and the calculus of probabilities. Russian Mathematical Surveys 38(4), 29–40 (1983)
Article MATH Google Scholar
Kolmogorov, A.N.: Logical basis for information theory and probability theory. IEEE Transactions on Information Theory IT-14(5), 662–664 (1968)
Article MathSciNet Google Scholar
Koschke, R.: Survey of research on software clones. In: Duplication, Redundancy, and Similarity in Software Dagstuhl Seminar Proceedings, vol. 06301 (2006)
Google Scholar
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Article Google Scholar
Li, M.: Information distance and its applications. International Journal of Foundations of Computer Science 18(4), 669–681 (2007)
Article MathSciNet MATH Google Scholar
Li, M.: Information Distance and Its Extensions. In: Elomaa, T., Hollmén, J., Mannila, H. (eds.) DS 2011. LNCS, vol. 6926, pp. 18–28. Springer, Heidelberg (2011)
Chapter Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004)
Article Google Scholar
Lutz, J.H.: The dimensions of individual strings and sequences. Information and Computation 187(1), 49–79 (2003)
Article MathSciNet MATH Google Scholar
Martin-Löf, P.: The definition of random sequences. Information and Control 9(6), 602–619 (1966)
Article MathSciNet Google Scholar
Mens, T., Demeyer, S. (eds.): Software Evolution. Springer (2008)
Google Scholar
Muchnik, A., Vereshchagin, N.: Logical operations and Kolmogorov complexity. II. In: 16th Annual IEEE Conference on Computational Complexity, pp. 256–265 (2001)
Google Scholar
Muchnik, A.: On common information. Theoretical Computer Science 207, 319–328 (1998)
Article MathSciNet MATH Google Scholar
Muchnik, A.A.: Kolmogorov complexity and cryptography. Proc. Steklov Institute of Mathematics 274, 193–203 (2011)
Article Google Scholar
Prowell, S.J., Poore, J.H.: Foundations of sequence-based software specification. IEEE Trans. Softw. Eng. 29(5), 417–429 (2003)
Article Google Scholar
Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74(7), 470–495 (2009)
Article MathSciNet MATH Google Scholar
Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 379–423, 623–656 (1948)
Google Scholar
Solomonoff, R.J.: A formal theory of inductive inference. part I and part II. Information and Control 7(1, 2), 1–22, 224–254 (1964)
Google Scholar
Terwijn, S., Torenvliet, L., Vitányi, P.: Nonapproximability of the normalized information distance. Journal of Computer and System Sciences 77(4), 738–742 (2011)
Article MathSciNet MATH Google Scholar
Turing, A.M.: On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 42, 230–265 (1936), correction in 43, 544-546 (1937)
Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research 11, 2837–2854 (2010)
MathSciNet MATH Google Scholar
Vitányi, P.M.B.: Similarity and denoising. Philosophical Transactions of the Royal Society, A (to appear), http://homepages.cwi.nl/~paulv/selection.html , (accessed May 7, 2012)
Vitányi, P.M.: Quantum Kolmogorov complexity based on classical descriptions. IEEE Transactions on Information Theory 47(6), 2464–2479 (2001)
Article MATH Google Scholar
Vitányi, P.M., Balbach, F.J., Cilibrasi, R.L., Li, M.: Normalised information distance. In: Emmert-Streib, F., Dehmer, M. (eds.) Information Theory and Statistical Learning, pp. 45–82. Springer (2009)
Google Scholar
Vitányi, P.: Compression-based similarity. In: First Int. Conf. on Data Compression, Communications and Processing (CCP), pp. 111–118 (June 2011)
Google Scholar
Vitányi, P.: Information distance in multiples. IEEE Transactions on Information Theory 57(4), 2451–2456 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Information Systems, University of Limerick, Ireland
Tom Arbuckle

Authors

Tom Arbuckle
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept of Computer Engineering, Hannam University, Korea
Geuk Lee
Computer Science and Information System, University of Limerick, Limerick, Ireland
Daniel Howard
Department of Information and Communication, Dong Seoul University, 423 Bokjeong-Dong, Sujeong-Gu, Seongnam, Gyunggi, Korea
Jeong Jin Kang
Institute of Mathematics, University of Warsaw, ul. Banacha 2, 02-097, Warsaw, Poland
Dominik Ślęzak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arbuckle, T. (2012). Interpreting Shared Information Content in Software Engineering: What Does It Mean to Say Two Artefacts Are Similar and Why?. In: Lee, G., Howard, D., Kang, J.J., Ślęzak, D. (eds) Convergence and Hybrid Information Technology. ICHIT 2012. Lecture Notes in Computer Science, vol 7425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32645-5_79

Download citation

DOI: https://doi.org/10.1007/978-3-642-32645-5_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32644-8
Online ISBN: 978-3-642-32645-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics