Labeling source code with information retrieval methods: an empirical study

De Lucia, Andrea; Di Penta, Massimiliano; Oliveto, Rocco; Panichella, Annibale; Panichella, Sebastiano

doi:10.1007/s10664-013-9285-5

Labeling source code with information retrieval methods: an empirical study

Published: 13 November 2013

Volume 19, pages 1383–1420, (2014)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Andrea De Lucia¹,
Massimiliano Di Penta²,
Rocco Oliveto³,
Annibale Panichella¹ &
…
Sebastiano Panichella²

1017 Accesses
29 Citations
Explore all metrics

Abstract

To support program comprehension, software artifacts can be labeled—for example within software visualization tools—with a set of representative words, hereby referred to as labels. Such labels can be obtained using various approaches, including Information Retrieval (IR) methods or other simple heuristics. They provide a bird-eye’s view of the source code, allowing developers to look over software components fast and make more informed decisions on which parts of the source code they need to analyze in detail. However, few empirical studies have been conducted to verify whether the extracted labels make sense to software developers. This paper investigates (i) to what extent various IR techniques and other simple heuristics overlap with (and differ from) labeling performed by humans; (ii) what kinds of source code terms do humans use when labeling software artifacts; and (iii) what factors—in particular what characteristics of the artifacts to be labeled—influence the performance of automatic labeling techniques. We conducted two experiments in which we asked a group of students (38 in total) to label 20 classes from two Java software systems, JHotDraw and eXVantage. Then, we analyzed to what extent the words identified with an automated technique—including Vector Space Models, Latent Semantic Indexing (LSI), latent Dirichlet allocation (LDA), as well as customized heuristics extracting words from specific source code elements—overlap with those identified by humans. Results indicate that, in most cases, simpler automatic labeling techniques—based on the use of words extracted from class and method names as well as from class comments—better reflect human-based labeling. Indeed, clustering-based approaches (LSI and LDA) are more worthwhile to be used for source code artifacts having a high verbosity, as well as for artifacts requiring more effort to be manually labeled. The obtained results help to define guidelines on how to build effective automatic labeling techniques, and provide some insights on the actual usefulness of automatic labeling techniques during program comprehension tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Article 08 April 2024

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Notes

http://www.jhotdraw.org/
http://www.research.avayalabs.com/
http://distat.unimol.it/reports/labeling/
http://www.jhotdraw.org
http://www.research.avayalabs.com
The number of unique terms ranges from 26 to 186, while the number of documents, i.e., methods, from 4 to 37.
Note that both LSI and LDA were used in the same way by other authors to support different software engineering tasks. For instance, both the techniques have been applied at class level when computing class cohesion/coupling exhibiting good results (Liu et al. 2009; Marcus and Poshyvanyk 2005; Poshyvanyk and Marcus 2006).
Note that in our case the asymmetric Jaccard overlap coincides with the precision measure (Baeza-Yates and Ribeiro-Neto 1999). Assuming that K(C _i) represents the set of “correct” keywords, the overlap measures the number of identified keywords that are actually correct, i.e., precision.

References

Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983
Article Google Scholar
Asuncion HU, Asuncion A, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM Press, Cape Town, South Africa, pp 95–104
Chapter Google Scholar
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley
Baker RD (1995) Modern permutation test software. In: Edgington E (ed) Randomization tests. Marcel Decker
Baldi P, Lopes CV, Linstead E, Bajracharya SK (2008) A theory of aspects as latent topics. In: Proceedings of the 23rd annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications. ACM Press, Nashville, TN, USA, pp 543–562
Google Scholar
Binkley D, Feild H, Lawrie D, Pighin M (2007) Software fault prediction using language processing. In: Proceedings of the testing: academic and industrial conference practice and research techniques. IEEE Computer Society, pp 99–110
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Buse RPL, Weimer W (2010) Automatically documenting program changes. In: Proceedings of the 25th IEEE/ACM international conference on automated software engineering. ACM Press, Antwerp, Belgium, pp 33–42
Chapter Google Scholar
Canfora G, Cerulo L (2005) Impact analysis by mining software and change request repositories. In: Proceedings of 11th IEEE international symposium on software metrics. IEEE CS Press, Como, Italy, pp 20–29
Google Scholar
Chang J, Blei DM (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150
Google Scholar
Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: Proc. of ICSE, pp 155–164
Cullum JK, Willoughby RA (1998) Lanczos algorithms for large symmetric eigenvalue computations, vol 1, chapter Real rectangular matrices. Birkhauser, Boston
Google Scholar
De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 2(37):205–227
Article Google Scholar
De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artefact management systems using information retrieval methods. ACM Trans Soft Eng Methodol 16(4), article no. 13
Google Scholar
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Detienne F (2002) Software design: cognitive aspects. Springer Verlag
Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: Proceedings of the 27th international conference on software maintenance. IEEE Press, Williamsburg, USA, pp 133–142
Google Scholar
Gethers M, Savage T, Di Penta M, Oliveto R, Poshyvanyk D, De Lucia A (2011) Codetopics: which topic am i coding now? In: Proceedings of the 33rd International conference on software engineering, ICSE 2011, Waikiki, Honolulu, HI, USA, 21–28 May 2011. ACM, pp 1034–1036
Grissom RJ, Kim JJ (2005) Effect sizes for research: a broad practical approach, 2nd edn. Lawrence Earlbaum Associates
Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc Y-G (2011) TIDIER: an identifier splitting approach using speech recognition techniques. J Softw Evol Process 25(6):575–599
Google Scholar
Haiduc S, Aponte J, Marcus A (2010) Supporting program comprehension with source code summarization. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM Press, Cape Town, South Africa, pp 223–226
Chapter Google Scholar
Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th working conference on reverse engineering. IEEE Computer Society, Beverly, MA, USA, pp 35–44
Google Scholar
Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: The study of methods. IEEE Trans Softw Eng 32(1):4–19
Article Google Scholar
Hindle A, Bird C, Zimmermann T, Nagappan N (2012) Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers? In Proceedings of the 28th international conference on software maintenance. IEEE CS Press, Riva del Garda, Italy
Google Scholar
Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: Proceedings of the 8th international working conference on mining software repositories. IEEE CS Press, Waikiki, Honolulu, USA, pp 163–172
Chapter Google Scholar
Holm S (1979) A simple sequentially rejective Bonferroni test procedure. Scand J Stat 6:65–70
MathSciNet MATH Google Scholar
Ko AJ, Myers BA, Coblenz MJ, Aung HH (2006) An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Trans Softw Eng 32(12):971–987
Article Google Scholar
Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243
Article Google Scholar
Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243
Article Google Scholar
LaToza TD, Venolia G, DeLine R (2006) Maintaining mental models: a study of developer work habits. In: Proceedings of the 28th international conference on software engineering. ACM Press, Shanghai, China, pp 492–501
Google Scholar
Lavrenko V (2009) A generative theory of relevance, vol 26. Springer
Lawrie D, Feild H, Binkley D (2007) An empirical study of rules for well-formed identifiers. J Softw Maint 19(4):205–229
Article Google Scholar
Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: Proceedings of the 18th annual workshop on psychology of programming. University of Sussex, Brighton, UK
Google Scholar
Linstead E, Lopes CV, Baldi P (2008) An application of latent dirichlet allocation to analyzing software evolution. In: Proceedings of the 7th international conference on machine learning and applications. IEEE CS Press, San Diego, California, USA, pp 813–818
Google Scholar
Liu Y, Poshyvanyk D, Ferenc R, Gyimóthy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proc. of ICSM, pp 233–242
Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proceedings of 23rd international conference on software engineering. IEEE CS Press, Toronto, Ontario, Canada, pp 103–112
Google Scholar
Marcus A, Maletic JI (2001) Identification of high-level concept clones in source code. In: Proceedings of 16th IEEE international conference on automated software engineering. IEEE CS Press, San Diego, California, USA, pp 107–114
Google Scholar
Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of 25th international conference on software engineering. IEEE CS Press, Portland, Oregon, USA, pp 125–135
Google Scholar
Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: Proceedings of 21st IEEE international conference on software maintenance. IEEE CS Press, Budapest, Hungary, pp 133–142
Chapter Google Scholar
Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300
Article Google Scholar
Medini S, Antoniol G, Guéhéneuc Y-G, Di Penta M, Tonella P (2012) Scan: an approach to label and relate execution trace segments. In: Proceedings of the 19th working conference on reverse engineering. IEEE Press, Kingston, Ontario, Canada
Google Scholar
Murphy G (1996) Lightweight structural summarization as an aid to software evolution. PhD thesis, University of Washington
Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 569–577
Chapter Google Scholar
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Article Google Scholar
Poshyvanyk D, Gael-Gueheneuc Y, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432
Article Google Scholar
Poshyvanyk D, Marcus A (2006) The conceptual coupling metrics for object-oriented systems. In: Proceedings of 22nd IEEE international conference on software maintenance. IEEE CS Press, Philadelphia, PA, USA, pp 469–478
Google Scholar
Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proceedings of the 10th international workshop on program comprehension. IEEE Computer Society, Paris, France, pp 271–280
Chapter Google Scholar
Rastkar S (2010) Summarizing software concerns. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering – student research competition. ACM Press, Cape Town, South Africa, pp 527–528
Google Scholar
Rastkar S, Murphy GC, Murray G (2010) Summarizing software artifacts: a case study of bug reports. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM Press, Cape Town, South Africa, pp 505–514
Chapter Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 625–56
Article MathSciNet MATH Google Scholar
Sridhara G, Hill E, Muppaneni D, LL Pollock, Vijay-Shanker K (2010) Towards automatically generating summary comments for java methods. In: Proceedings of the 25th IEEE/ACM international conference on automated software engineering. ACM Press, Antwerp, Belgium, pp 43–52
Chapter Google Scholar
Sridhara G, Pollock LL, Vijay-Shanker K (2011) Automatically detecting and describing high level actions within methods. In Proceedings of the 33rd International conference on software engineering. ACM Press, Honolulu, HI, USA, pp 101–110
Google Scholar
Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications. Chapman & Hall/CRC
Storey M-AD (2006) Theories, tools and research methods in program comprehension: past, present and future. SQJ 14(3):187–208
Google Scholar
Takang A, Grubb P, Macredie R (1996) The effects of comments and identifier names on program comprehensibility: an experiential study. J Program Lang 4(3):143–167
Google Scholar
Teh YW, Newman D, Welling M (2006) A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In: NIPS, pp 1353–1360
Thomas SW, Adams B, Hassan AE, Blostein D (2010) Validating the use of topic models for software evolution. In: Tenth IEEE international working conference on source code analysis and manipulation, SCAM 2010. IEEE Computer Society, Timisoara, Romania, 12–13 Sept 2010, pp 55–64
Chapter Google Scholar
Thomas SW, Adams B, Hassan AE, Blostein D (2011) Modeling the evolution of topics in source code histories. In: Proceedings of the 8th international working conference on mining software repositories. IEEE Press, Honolulu, HI, USA, pp 173–182
Chapter Google Scholar

Download references

Acknowledgements

We would like to thank all the students that participated in our study. We would also like to thank anonymous reviewers for their careful reading of our manuscript and high-quality feedback. Their detailed comments have helped us to improve the original version of this paper.

Author information

Authors and Affiliations

Software Engineering Lab, University of Salerno, Via Ponte don Melillo, 84084, Fisciano (SA), Italy
Andrea De Lucia & Annibale Panichella
RCOST, University of Sannio, Palazzo ex Poste, Viale Traiano, 82100, Benevento, Italy
Massimiliano Di Penta & Sebastiano Panichella
Department of Bioscience and Territory, University of Molise, C.da Fonte Lappone, 86090, Pesche (IS), Italy
Rocco Oliveto

Authors

Andrea De Lucia
View author publications
You can also search for this author in PubMed Google Scholar
Massimiliano Di Penta
View author publications
You can also search for this author in PubMed Google Scholar
Rocco Oliveto
View author publications
You can also search for this author in PubMed Google Scholar
Annibale Panichella
View author publications
You can also search for this author in PubMed Google Scholar
Sebastiano Panichella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rocco Oliveto.

Additional information

Communicated By: Michael Godfrey and Arie van Deursen

This paper is an extension of the work “Using IR Methods for Labeling Source Code Artifacts: Is It Worthwhile?” appeared in the Proceedings of the 20th IEEE International Conference on Program Comprehension, Passau, Bavaria, Germany, pp. 193–202, 2012. IEEE Press.

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Lucia, A., Di Penta, M., Oliveto, R. et al. Labeling source code with information retrieval methods: an empirical study. Empir Software Eng 19, 1383–1420 (2014). https://doi.org/10.1007/s10664-013-9285-5

Download citation

Published: 13 November 2013
Issue Date: October 2014
DOI: https://doi.org/10.1007/s10664-013-9285-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Labeling source code with information retrieval methods: an empirical study

Abstract

Access this article

Similar content being viewed by others

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

How different are different diff algorithms in Git?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Labeling source code with information retrieval methods: an empirical study

Abstract

Access this article

Similar content being viewed by others

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

How different are different diff algorithms in Git?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation