Redundancy-free analysis of multi-revision software artifacts

Alexandru, Carol V.; Panichella, Sebastiano; Proksch, Sebastian; Gall, Harald C.

doi:10.1007/s10664-018-9630-9

Redundancy-free analysis of multi-revision software artifacts

Published: 05 July 2018

Volume 24, pages 332–380, (2019)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Carol V. Alexandru ORCID: orcid.org/0000-0002-2995-4954¹,
Sebastiano Panichella¹,
Sebastian Proksch¹ &
…
Harald C. Gall¹

623 Accesses
12 Citations
6 Altmetric
Explore all metrics

Abstract

Researchers often analyze several revisions of a software project to obtain historical data about its evolution. For example, they statically analyze the source code and monitor the evolution of certain metrics over multiple revisions. The time and resource requirements for running these analyses often make it necessary to limit the number of analyzed revisions, e.g., by only selecting major revisions or by using a coarse-grained sampling strategy, which could remove significant details of the evolution. Most existing analysis techniques are not designed for the analysis of multi-revision artifacts and they treat each revision individually. However, the actual difference between two subsequent revisions is typically very small. Thus, tools tailored for the analysis of multiple revisions should only analyze these differences, thereby preventing re-computation and storage of redundant data, improving scalability and enabling the study of a larger number of revisions. In this work, we propose the Lean Language-Independent Software Analyzer (LISA), a generic framework for representing and analyzing multi-revisioned software artifacts. It employs a redundancy-free, multi-revision representation for artifacts and avoids re-computation by only analyzing changed artifact fragments across thousands of revisions. The evaluation of our approach consists of measuring the effect of each individual technique incorporated, an in-depth study of LISA resource requirements and a large-scale analysis over 7 million program revisions of 4,000 software projects written in four languages. We show that the time and space requirements for multi-revision analyses can be reduced by multiple orders of magnitude, when compared to traditional, sequential approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Listing 1

Software provenance tracking at the scale of public source code

Article 29 May 2020

An Investigation of Entropy and Refactoring in Software Evolution

Towards Better Understanding of Software Maintainability Evolution

Notes

http://www.eclipse.org/
https://doi.org/10.5281/zenodo.1211549
https://bitbucket.org/sealuzh/lisa
The top 500 sites on the web. http://web.archive.org/web/20170626103223/http://www.alexa.com/topsites. Accessed 26 June 2017
For example, https://github.com/antlr/grammars-v4 contains grammar files for over 60 structured file formats.
See http://goo.gl/7grx8L for more information on the performance of Scala collections.
SuperMicro SuperServer 4048B-TR4FT, 64 Intel Xeon E7-4850 CPUs with 128 threads, 3TB memory in total
The example projects for each language, ending in ‘-example-repository‘, can be found online at https://bitbucket.org/account/user/sealuzh/projects/LISA
Awesome python. https://github.com/vinta/awesome-python. Accessed 20 June 2017
We analyzed 30,000 GitHub projects – here are the top 100 libraries in Java, JS and Ruby. http://blog.takipi.com/we-analyzed-30000-github-projects-here-are-the-top-100-libraries-in-java-js-and-ruby/. Accessed 20 March 2016
The JSON representation can be found in the repository, here: https://goo.gl/oMDxzv
All parser integrations and mappings can be found here: https://goo.gl/6pT7sG
Infusion by Intooitus s.r.l. http://www.intooitus.com/products/infusion. Accessed 30 March 2014
These are the handicap-* branches visible in the LISA open source repository
This demonstration is part of the lisa-quickstart repository: https://bitbucket.org/sealuzh/lisa-quickstart/
https://neo4j.com/, http://rdf4j.org/
http://goo.gl/aWWCkN
http://goo.gl/VftJUW
OMG: KDM. http://www.omg.org/spec/KDM/1.3/. Accessed 6 October 2016
OMG: ASTM. http://www.omg.org/spec/ASTM/1.0/. Accessed 6 October 2016

References

Alexandru CV, Gall HC (2015) Rapid multi-purpose, multi-commit code analysis. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering (ICSE), vol 2, pp 635–638
Alexandru CV, Panichella S, Gall HC (2017) Reducing redundancies in multi-revision code analysis. In: IEEE 24th international conference on software analysis, evolution and reengineering, SANER 2017, Klagenfurt, Austria, 2017
Allamanis M, Sutton CA (2013) Mining source code repositories at massive scale using language modeling. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13, San Francisco, CA, USA, 2013
Arbuckle T (2011) Measuring multi-language software evolution: a case study. In: Proceedings of the 12th international workshop on principles of software evolution and the 7th annual ERCIM workshop on software evolution, pp 91–95
Bavota G, Canfora G, Di Penta M, Oliveto R, Panichella S (2013) The evolution of project inter-dependencies in a software ecosystem: the case of Apache. In: 2013 IEEE international conference on software maintenance, pp 280–289
Bavota G, Canfora G, Di Penta M, Oliveto R, Panichella S (2014) How the Apache community upgrades dependencies: an evolutionary study. Empir Softw Eng 20:1–43
Google Scholar
Bavota G, Qusef A, Oliveto R, Lucia AD, Binkley D (2012) An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In: 28th IEEE international conference on software maintenance, ICSM 2012, Trento, Italy, September 23–28, 2012, pp 56–65
Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: Software maintenance
Bevan J, Whitehead EJ Jr, Kim S, Godfrey M (2005) Facilitating software evolution research with Kenyon. In: Proceedings of the 13th ACM SIGSOFT international symposium on foundations of software engineering, pp 177–186
Binkley D, Gold N, Islam S, Krinke J, Yoo S (2017) Tree-oriented vs. line-oriented observation-based slicing. In: 2017 IEEE 17th international working conference on source code analysis and manipulation (SCAM), pp 21–30
Bird C, Nagappan N, Devanbu PT, Gall HC, Murphy B (2009) Does distributed development affect software quality? an empirical case study of Windows Vista. In: 31st international conference on software engineering, ICSE 2009, May 16–24, 2009, Vancouver, Canada, Proceedings, pp 518–528
Bird C, Pattison D, D’Souza R, Filkov V, Devanbu P (2008) Latent social structure in open source project. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, SIGSOFT ’08/FSE-16. ACM, New York, pp 24–35
Bois BD, Gorp PV, Amsel A, Eetvelde NV, Stenten H, Demeyer S (2004) A discussion of refactoring in research and practice. Technical report
Boughanmi F (2010) Multi-language and heterogeneously-licensed software analysis. In: 2010 17th working conference on reverse engineering, pp 293–296
Chacon S, Straub B (2014) Pro Git. Apress, New York
Book Google Scholar
Chawathe SS, Rajaraman A, Garcia-Molina H, Widom J (1996) Change detection in hierarchically structured information. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, SIGMOD ’96, pp 493–504
D’Ambros M, Gall HC, Lanza M, Pinzger M (2008) Analysing software repositories to understand software evolution. In: Software evolution, pp 37–67
D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17:531–577
Article Google Scholar
Deissenboeck F, Juergens E, Hummel B, Wagner S, y Parareda BM, Pizka M (2008) Tool support for continuous quality control. IEEE Softw 25:60–67
Article Google Scholar
Deruelle L, Melab N, Bouneffa M, Basson H (2001) Analysis and manipulation of distributed multi-language software code. In: Proceedings first IEEE international workshop on source code analysis and manipulation, pp 43–54
Dyer R (2013) Bringing ultra-large-scale software repository mining to the masses with boa. PhD thesis, Ames, IA, USA. AAI3610634
Dyer R, Rajan H, Nguyen TN (2013) Declarative visitors to ease fine-grained source code mining with full history on billions of ast nodes. In: Proceedings of the 12th international conference on generative programming: concepts & experiences, pp 23–32
Fischer M, Pinzger M, Gall H (2003) Populating a release history database from version control and bug tracking systems. In: International conference on software maintenance, 2003. ICSM 2003. Proceedings, pp 23–32
Fluri B, Wuersch M, Pinzger M, Gall H (2007) Change distilling: tree differencing for fine-grained source code change extraction. IEEE Trans Softw Eng 33 (11):725–743
Article Google Scholar
Gall H, Fluri B, Pinzger M (2009) Change analysis with Evolizer and ChangeDistiller. IEEE Softw 26(1):26–33
Article Google Scholar
Gall HC, Jazayeri M, Klösch R, Trausmuth G (1997) Software evolution observations based on product release history. In: 1997 international conference on software maintenance (ICSM ’97), Proceedings, p 160
Ghezzi G, Gall H (2011) Sofas: a lightweight architecture for software analysis as a service. In: 2011 9th working IEEE/IFIP conference on software architecture (WICSA), pp 93–102
Ghezzi G, Gall H (2013) Replicating mining studies with SOFAS. In: 2013 10th IEEE working conference on mining software repositories (MSR), pp 363–372
Gîrba T, Ducasse S (2006) Modeling history to analyze software evolution. J Softw Maint Evol Res Pract 18(3):207–236
Article Google Scholar
González-Barahona JM, Robles G (2012) On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir Softw Eng 17(1):75–89
Article Google Scholar
Hadjidj R, Yang X, Tlili S, Debbabi M (2008) Model-checking for software vulnerabilities detection with multi-language support. In: 2008 sixth annual conference on privacy, security and trust, pp 133– 142
Hernandez L, Costa H (2015) Identifying similarity of software in Apache ecosystem – an exploratory study. In: 2015 12th international conference on information technology - new generations, pp 397–402
Hills M, Klint P, Vinju JJ (2012) Program analysis scenarios in rascal. Springer, Berlin, pp 10–30
Google Scholar
Izmaylova A, Klint P, Shahi A, Vinju JJ (2013) M3: an open model for measuring code artifacts. CoRR, arXiv:1312.1188
Juergens E, Deissenboeck F, Hummel B (2010) Code similarities beyond copy & paste. In: 2010 14th european conference on software maintenance and reengineering (CSMR)
Kästner C, Giarrusso PG, Rendel T, Erdweg S, Ostermann K, Berger T (2011) Variability-aware parsing in the presence of lexical macros and conditional compilation. In: Proceedings of the 2011 ACM international conference on object oriented programming systems languages and applications, OOPSLA ’11. ACM, New York, pp 805–824
Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: Proceedings of the 33rd international conference on software engineering, ICSE 2011, Waikiki, Honolulu, HI, USA, May 21–28, 2011, pp 351–360
Kienle HM, Müller HA (2010) Rigi—an environment for software reverse engineering, exploration, visualization, and redocumentation. Sci Comput Program 75 (4):247–263
Article MathSciNet MATH Google Scholar
Kim M, Nam J, Yeon J, Choi S, Kim S (2010) Remi: defect prediction for efficient API testing. In: Proceedings of the IEEE/ACM international conference on automated software engineering. ACM, To appear
Kim M, Notkin D (2006) Program element matching for multi-version program analyses. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06. ACM, New York, pp 58–64
Kim S, Pan K, Whitehead EEJ Jr (2006) Memories of bug fixes. In: Proceedings of the 14th ACM SIGSOFT international symposium on foundations of software engineering, SIGSOFT ’06/FSE-14. ACM, pp 35–45
Kocaguneli E, Menzies T, Keung J (2012) On the value of ensemble effort estimation. IEEE Trans Softw Eng 38(6):1403–1416
Article Google Scholar
Kontogiannis K, Linos PK, Wong K (2006) Comprehension and maintenance of large-scale multi-language software applications. In: 22nd IEEE international conference on software maintenance (ICSM 2006), 24–27 September 2006, Philadelphia, Pennsylvania, USA, pp 497–500
Lam P, Bodden E, Lhotak O, Hendren L (2011) The Soot framework for Java program analysis: a retrospective. In: Cetus users and compiler infastructure workshop, CETUS’11
Lanza M, Ducasse S, Gall H, Pinzger M (2005) Codecrawler - an information visualization tool for program comprehension. In: 27th international conference on software engineering, 2005. ICSE 2005. Proceedings, pp 672–673
Lanza M, Marinescu R, Ducasse S (2005) Object-oriented metrics in practice. Springer, New York
Google Scholar
Laval J, Denier S, Ducasse S, Falleri J-R (2011) Supporting simultaneous versions for software evolution assessment. Sci Comput Program 76(12):1177–1193
Article Google Scholar
Le W, Pattison SD (2014) Patch verification via multiversion interprocedural control flow graphs. In: Proceedings of the 36th international conference on software engineering, ICSE 2014. ACM, New York, pp 1047–1058
Lundberg J, Löwe W (2012) Points-to analysis: a fine-grained evaluation. Journal of Universal Computer Science 18:2851–2878
Google Scholar
Marinescu R (2004) Detection strategies: metrics-based rules for detecting design flaws. In: 20th IEEE international conference on software maintenance, 2004. Proceedings. pp 350–359
McCabe T (1976) A complexity measure. IEEE Trans Softw Eng SE-2(4):308–320
Article MathSciNet MATH Google Scholar
Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the 5th international conference on predictor models in software engineering, PROMISE ’09. ACM, pp 7:1–7:10
Mens T (2008) Introduction and roadmap: history and challenges of software evolution. In: Software evolution. Springer, Berlin, pp 1–11
Mens T, Claes M, Grosjean P, Serebrenik A (2014) Studying evolving software ecosystems based on ecological models. In: Evolving software systems, pp 297–326
Mens T, Tourwe T (2004) A survey of software refactoring. IEEE Trans Softw Eng 30(2):126–139
Article Google Scholar
Menzies T, Krishna R, Pryor D (2015) The promise repository of empirical software engineering data
Moha N, Guéhéneuc Y, Duchien L, Meur AL (2010) DECOR: a method for the specification and detection of code and design smells. IEEE Trans Softw Eng 36 (1):20–36
Article MATH Google Scholar
Munro M (2005) Product metrics for automatic identification of “bad smell” design problems in Java source-code. In: 11th IEEE international symposium on software metrics, 2005, pp 15–15
Nagappan M, Zimmermann T, Bird C (2012) Representativeness in software engineering research. Technical report, Microsoft Research
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering, ICSE ’06. ACM, pp 452–461
Nguyen AT, Hilton M, Codoban M, Nguyen HA, Mast L, Rademacher E, Nguyen TN, Dig D (2016) API code recommendation using statistical learning from fine-grained changes. In: International symposium on foundations of software engineering. ACM
Nguyen HA, Nguyen AT, Nguyen TT, Nguyen TN, Rajan H (2013) A study of repetitiveness of code changes in software evolution. In: 2013 28th IEEE/ACM international conference on automated software engineering (ASE)
Oosterman J, Irwin W, Churcher N (2011) EvoJava: A tool for measuring evolving software. In: Proceedings of the thirty-fourth Australasian computer science conference, ACSC ’11, vol 113. Australian Computer Society, Inc, pp 117–126
Panichella S, Arnaoudova V, Penta MD, Antoniol G (2015) Would static analysis tools help developers with code reviews?. In: 22nd IEEE international conference on software analysis, evolution, and reengineering, SANER 2015, Montreal, QC, Canada, March 2–6, 2015, pp 161–170
Picazo JJM (2016) Analisis y busqueda de idioms procedentes de repositorios escritos en python. Master’s thesis, Universidad Rey Juan Carlos, Madrid, Spain
Proksch S, Amann S, Nadi S, Mezini M (2016) A dataset of simplified syntax trees for c#. In: International conference on mining software repositories. ACM
Proksch S, Lerch J, Mezini M (2015) Intelligent code completion with Bayesian networks. ACM Trans Softw Eng Methodol 25:1–31
Article Google Scholar
Proksch S, Nadi S, Amann S, Mezini M (2017) Enriching in-IDE process information with fine-grained source code history. In: International conference on software analysis, evolution, and reengineering
Rakić G, Budimac Z, Savić M (2013) Language independent framework for static code analysis. In: Proceedings of the 6th Balkan Conference in Informatics, BCI ’13. ACM, New York, pp 236– 243
Ray B, Nagappan M, Bird C, Nagappan N, Zimmermann T (2015) The uniqueness of changes: characteristics and applications. In: 2015 IEEE/ACM 12th working conference on mining software repositories, pp 34–44
Rompaey BV, Bois BD, Demeyer S, Rieger M (2007) On the detection of test smells: a metrics-based approach for general fixture and eager test. IEEE Trans Softw Eng 33(12):800–817
Article Google Scholar
Strein D, Kratz H, Lowe W (2006) Cross-language program analysis and refactoring. In: Sixth IEEE international workshop on source code analysis and manipulation, 2006. SCAM ’06, pp 207–216
Stutz P, Bernstein A, Cohen W (2010) Signal/collect graph algorithms for the (semantic) web. In: Proceedings of the 9th international semantic web conference on the semantic web - volume Part I, ISWC’10. Springer, pp 764–780
Szőke G, Nagy C, Ferenc R, Gyimóthy T (2014) A case study of refactoring large-scale industrial systems to efficiently improve source code quality. In: Computational science and its applications – ICCSA 2014, vol 8583 of Lecture notes in computer science. Springer, pp 524–540
Tempero E, Anslow C, Dietrich J, Han T, Li J, Lumpe M, Melton H, Noble J (2010) Qualitas corpus: a curated collection of Java code for empirical studies. In: 2010 Asia Pacific software engineering conference (APSEC2010)
Tichelaar S, Ducasse S, Demeyer S, Nierstrasz O (2000) A meta-model for language-independent refactoring. In: International symposium on principles of software evolution, 2000. Proceedings, pp 154–164
Tsantalis N, Chatzigeorgiou A (2009) Identification of move method refactoring opportunities. IEEE Trans Softw Eng 35(3):347–367
Article Google Scholar
Tufano M, Palomba F, Bavota G, Oliveto R, Di Penta M, De Lucia A, Poshyvanyk D (2015) When and why your code starts to smell bad. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering (ICSE), vol 1, pp 403–414
VanHilst M, Huang S, Mulcahy J, Ballantyne W, Suarez-Rivero E, Harwood D (2011) Measuring effort in a corporate repository. In: IRI. IEEE Systems, Man, and Cybernetics Society, pp 246–252
Winter A, Kullbach B, Riediger V (2002) An overview of the GXL graph exchange language. In: Revised lectures on software visualization, international seminar. Springer, London, pp 324–336
Wu W, Khomh F, Adams B, Guéhéneuc Y-G, Antoniol G (2016) An exploratory study of API changes and usages based on Apache and Eclipse ecosystems. Empir Softw Eng 21(6):2366–2412
Article Google Scholar
Yang W, Horwitz S, Reps T (1992) A program integration algorithm that accommodates semantics-preserving transformations. ACM Trans Softw Eng Methodol 1(3):310–354
Article Google Scholar
Yu Y, Tun TT, Nuseibeh B (2011) Specifying and detecting meaningful changes in programs. In: Proceedings of the 2011 26th IEEE/ACM international conference on automated software engineering, ASE ’11. IEEE Computer Society, Washington, pp 273–282
Zaidman A, Rompaey BV, van Deursen A, Demeyer S (2011) Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining. Empir Softw Eng 16(3):325–364
Article Google Scholar
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE ’09. ACM, New York, pp 91–100
Zimmermann T, Zeller A, Weissgerber P, Diehl S (2005) Mining version histories to guide software changes. IEEE Trans Softw Eng 31(6):429–445
Article Google Scholar

Download references

Acknowledgements

We thank the reviewers for their valuable feedback. This research is partially supported by the Swiss National Science Foundation (Projects No. 149450 – “Whiteboard” and No. 166275 – “SURF-MobileAppsData”) and the Swiss Group for Original and Outside-the-box Software Engineering (CHOOSE).

Author information

Authors and Affiliations

Software Evolution and Architecture Lab - s.e.a.l., Binzmühlestrasse 14, CH-8050, Zürich, Switzerland
Carol V. Alexandru, Sebastiano Panichella, Sebastian Proksch & Harald C. Gall

Authors

Carol V. Alexandru
View author publications
You can also search for this author in PubMed Google Scholar
Sebastiano Panichella
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Proksch
View author publications
You can also search for this author in PubMed Google Scholar
Harald C. Gall
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carol V. Alexandru.

Additional information

Communicated by: Gabriele Bavota and Andrian Marcus

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alexandru, C.V., Panichella, S., Proksch, S. et al. Redundancy-free analysis of multi-revision software artifacts. Empir Software Eng 24, 332–380 (2019). https://doi.org/10.1007/s10664-018-9630-9

Download citation

Published: 05 July 2018
Issue Date: 15 February 2019
DOI: https://doi.org/10.1007/s10664-018-9630-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Redundancy-free analysis of multi-revision software artifacts

Abstract

Access this article

Similar content being viewed by others

Software provenance tracking at the scale of public source code

An Investigation of Entropy and Refactoring in Software Evolution

Towards Better Understanding of Software Maintainability Evolution

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Redundancy-free analysis of multi-revision software artifacts

Abstract

Access this article

Similar content being viewed by others

Software provenance tracking at the scale of public source code

An Investigation of Entropy and Refactoring in Software Evolution

Towards Better Understanding of Software Maintainability Evolution

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation