Abstract
Recent work has examined software evolution in terms of shared information content between artefacts representative of releases. These studies employ measurements based on the non-trivial concept of Kolmogorov complexity to calculate the quantity of shared information and have shown some of the promise of this characterisation of software evolution.
At the same time, however, questions and enquiries related to this work have shown that interpreting the results requires some care. What does it mean to say that two software artefacts are “similar” in an information theoretic sense? What does shared information content mean in a software engineering context?
Starting from questions that could be asked by software engineers about the results of this method, this paper will provide answers by reviewing and referring back to the underlying mathematics. It is hoped that this will serve both to dispel myths but also to encourage interest, curiosity and critique.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adriaans, P.: A critical analysis of Floridi’s theory of semantic information. Knowledge, Technology & Policy 23, 41–56 (2010), http://dx.doi.org/10.1007/s12130-010-9097-5
Allen, E.B., Gottipati, S., Govindarajan, R.: Measuring size, complexity, and coupling of hypergraph abstractions of software: An information-theory approach. Software Quality Journal 15(2), 179–212 (2007)
Arbuckle, T.: Visually summarising software change. In: IV 2008: Proc. 12th Int. Conf. on Information Visualisation, pp. 559–568 (2008)
Arbuckle, T.: Measure software – and its evolution – using information content. In: IWPSE-Evol 2009: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 129–134 (2009)
Arbuckle, T.: Measuring multi-language software evolution: A case study. In: IWPSE-EVOL 2011: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 91–95 (September 2011)
Arbuckle, T.: Studying software evolution using artefacts’ shared information content. Science of Computer Programming 76(12), 1078–1097 (2011), http://dx.doi.org/10.1016/j.scico.2010.11.005
Arbuckle, T., Balaban, A., Peters, D.K., Lawford, M.: Software documents: Comparison and measurement. In: SEKE 2007: Proc. 18th Int. Conf. on Software Engineering & Knowledge Engineering, pp. 740–745 (July 2007)
Bennett, C.H.: Thermodynamically reversible computation. Phys. Rev. Lett. 53(12), 1202 (1984)
Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Information Theory 44(4), 1407–1423 (1998)
Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Thermodynamics of computation and information distance. In: STOC 1993: Proc. 25th Annual ACM Symposium on Theory of Computing, pp. 21–30. ACM Press (1993)
Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Comms. Info. Sys. 5(4), 367–384 (2005)
Cerra, D., Datcu, M.: Algorithmic relative complexity. Entropy 13, 902–914 (2011)
Chaitin, G.J.: On the length of programs for computing finite binary sequences: statistical considerations. J. ACM 16(1), 145–159 (1969)
Chaitin, G.J.: Information-theoretic computational complexity. IEEE Trans. Information Theory 20(1), 10–15 (1974)
Chen, X., Kwong, S., Li, M.: A compression algorithm for dna sequences and its applications in genome comparison. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, RECOMB 2000, p. 107 (2000), http://doi.acm.org/10.1145/332306.332352
Cilibrasi, R.: The CompLearn Toolkit (2003), http://complearn.sourceforge.net/
Cilibrasi, R., Vitányi, P.: Clustering by compression. IEEE Trans. Information Theory 51(4), 1523–1545 (2005)
Copeland, B.J.: The Church-Turing thesis. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy (2008), http://plato.stanford.edu/archives/fall2008/entries/church-turing/
Cover, T.M., Gács, P., Gray, R.M.: Kolmogorov’s contributions to information theory and algorithmic complexity. The Annals of Probability 17(3), 840–865 (1989)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley (2006)
Fano, R.M.: The transmission of information. Tech. Rep. 65 and 149, Research Laboratory of Electronics. MIT (1949-1950)
Fenton, N.E., Pfleeger, S.L.: Software Metrics: A Rigorous and Practical Approach. PWS Publishing Co., Boston (1998)
Gács, P., Körner, J.: Common information is far less than mutual information. Prob. Control Info. Theory 2, 149–162 (1973)
Gödel, K.: Über formal unentscheidbare Sätze der. Principia Mathematica und verwandter Systeme, I. Monatshefte für Mathematik und Physik 38, 173–198 (1931)
Gödel, K.: On undecidable propositions of formal mathematical systems (1934)
Grünwald, P.D., Vitányi, P.M.B.: Kolmogorov complexity and information theory with an interpretation in terms of questions and answers. J. of Logic, Lang. and Inf. 12(4), 497–529 (2003)
Herbrand, J.: Sur la non-contradiction de l’arithmetique. Journal für die Reine und Angewandte Mathematik 166, 1–8 (1932)
Kolivas, C.: lrzip – long range zip, based on rzip by Andrew Tridgell (2006), http://ck.kolivas.org/apps/lrzip/
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Probl. Inform. Trans. 1(1), 1–7 (1965)
Kolmogorov, A.N.: Combinatorial foundations of information theory and the calculus of probabilities. Russian Mathematical Surveys 38(4), 29–40 (1983)
Kolmogorov, A.N.: Logical basis for information theory and probability theory. IEEE Transactions on Information Theory IT-14(5), 662–664 (1968)
Koschke, R.: Survey of research on software clones. In: Duplication, Redundancy, and Similarity in Software Dagstuhl Seminar Proceedings, vol. 06301 (2006)
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Li, M.: Information distance and its applications. International Journal of Foundations of Computer Science 18(4), 669–681 (2007)
Li, M.: Information Distance and Its Extensions. In: Elomaa, T., Hollmén, J., Mannila, H. (eds.) DS 2011. LNCS, vol. 6926, pp. 18–28. Springer, Heidelberg (2011)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004)
Lutz, J.H.: The dimensions of individual strings and sequences. Information and Computation 187(1), 49–79 (2003)
Martin-Löf, P.: The definition of random sequences. Information and Control 9(6), 602–619 (1966)
Mens, T., Demeyer, S. (eds.): Software Evolution. Springer (2008)
Muchnik, A., Vereshchagin, N.: Logical operations and Kolmogorov complexity. II. In: 16th Annual IEEE Conference on Computational Complexity, pp. 256–265 (2001)
Muchnik, A.: On common information. Theoretical Computer Science 207, 319–328 (1998)
Muchnik, A.A.: Kolmogorov complexity and cryptography. Proc. Steklov Institute of Mathematics 274, 193–203 (2011)
Prowell, S.J., Poore, J.H.: Foundations of sequence-based software specification. IEEE Trans. Softw. Eng. 29(5), 417–429 (2003)
Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74(7), 470–495 (2009)
Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 379–423, 623–656 (1948)
Solomonoff, R.J.: A formal theory of inductive inference. part I and part II. Information and Control 7(1, 2), 1–22, 224–254 (1964)
Terwijn, S., Torenvliet, L., Vitányi, P.: Nonapproximability of the normalized information distance. Journal of Computer and System Sciences 77(4), 738–742 (2011)
Turing, A.M.: On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 42, 230–265 (1936), correction in 43, 544-546 (1937)
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research 11, 2837–2854 (2010)
Vitányi, P.M.B.: Similarity and denoising. Philosophical Transactions of the Royal Society, A (to appear), http://homepages.cwi.nl/~paulv/selection.html , (accessed May 7, 2012)
Vitányi, P.M.: Quantum Kolmogorov complexity based on classical descriptions. IEEE Transactions on Information Theory 47(6), 2464–2479 (2001)
Vitányi, P.M., Balbach, F.J., Cilibrasi, R.L., Li, M.: Normalised information distance. In: Emmert-Streib, F., Dehmer, M. (eds.) Information Theory and Statistical Learning, pp. 45–82. Springer (2009)
Vitányi, P.: Compression-based similarity. In: First Int. Conf. on Data Compression, Communications and Processing (CCP), pp. 111–118 (June 2011)
Vitányi, P.: Information distance in multiples. IEEE Transactions on Information Theory 57(4), 2451–2456 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Arbuckle, T. (2012). Interpreting Shared Information Content in Software Engineering: What Does It Mean to Say Two Artefacts Are Similar and Why?. In: Lee, G., Howard, D., Kang, J.J., Ślęzak, D. (eds) Convergence and Hybrid Information Technology. ICHIT 2012. Lecture Notes in Computer Science, vol 7425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32645-5_79
Download citation
DOI: https://doi.org/10.1007/978-3-642-32645-5_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32644-8
Online ISBN: 978-3-642-32645-5
eBook Packages: Computer ScienceComputer Science (R0)