Advertisement

Empirical Software Engineering

, Volume 19, Issue 5, pp 1383–1420 | Cite as

Labeling source code with information retrieval methods: an empirical study

  • Andrea De Lucia
  • Massimiliano Di Penta
  • Rocco Oliveto
  • Annibale Panichella
  • Sebastiano Panichella
Article

Abstract

To support program comprehension, software artifacts can be labeled—for example within software visualization tools—with a set of representative words, hereby referred to as labels. Such labels can be obtained using various approaches, including Information Retrieval (IR) methods or other simple heuristics. They provide a bird-eye’s view of the source code, allowing developers to look over software components fast and make more informed decisions on which parts of the source code they need to analyze in detail. However, few empirical studies have been conducted to verify whether the extracted labels make sense to software developers. This paper investigates (i) to what extent various IR techniques and other simple heuristics overlap with (and differ from) labeling performed by humans; (ii) what kinds of source code terms do humans use when labeling software artifacts; and (iii) what factors—in particular what characteristics of the artifacts to be labeled—influence the performance of automatic labeling techniques. We conducted two experiments in which we asked a group of students (38 in total) to label 20 classes from two Java software systems, JHotDraw and eXVantage. Then, we analyzed to what extent the words identified with an automated technique—including Vector Space Models, Latent Semantic Indexing (LSI), latent Dirichlet allocation (LDA), as well as customized heuristics extracting words from specific source code elements—overlap with those identified by humans. Results indicate that, in most cases, simpler automatic labeling techniques—based on the use of words extracted from class and method names as well as from class comments—better reflect human-based labeling. Indeed, clustering-based approaches (LSI and LDA) are more worthwhile to be used for source code artifacts having a high verbosity, as well as for artifacts requiring more effort to be manually labeled. The obtained results help to define guidelines on how to build effective automatic labeling techniques, and provide some insights on the actual usefulness of automatic labeling techniques during program comprehension tasks.

Keywords

Program comprehension Software artifact labeling Information retrieval Empirical studies 

Notes

Acknowledgements

We would like to thank all the students that participated in our study. We would also like to thank anonymous reviewers for their careful reading of our manuscript and high-quality feedback. Their detailed comments have helped us to improve the original version of this paper.

References

  1. Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983CrossRefGoogle Scholar
  2. Asuncion HU, Asuncion A, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM Press, Cape Town, South Africa, pp 95–104CrossRefGoogle Scholar
  3. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-WesleyGoogle Scholar
  4. Baker RD (1995) Modern permutation test software. In: Edgington E (ed) Randomization tests. Marcel DeckerGoogle Scholar
  5. Baldi P, Lopes CV, Linstead E, Bajracharya SK (2008) A theory of aspects as latent topics. In: Proceedings of the 23rd annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications. ACM Press, Nashville, TN, USA, pp 543–562Google Scholar
  6. Binkley D, Feild H, Lawrie D, Pighin M (2007) Software fault prediction using language processing. In: Proceedings of the testing: academic and industrial conference practice and research techniques. IEEE Computer Society, pp 99–110Google Scholar
  7. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  8. Buse RPL, Weimer W (2010) Automatically documenting program changes. In: Proceedings of the 25th IEEE/ACM international conference on automated software engineering. ACM Press, Antwerp, Belgium, pp 33–42CrossRefGoogle Scholar
  9. Canfora G, Cerulo L (2005) Impact analysis by mining software and change request repositories. In: Proceedings of 11th IEEE international symposium on software metrics. IEEE CS Press, Como, Italy, pp 20–29Google Scholar
  10. Chang J, Blei DM (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150Google Scholar
  11. Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: Proc. of ICSE, pp 155–164Google Scholar
  12. Cullum JK, Willoughby RA (1998) Lanczos algorithms for large symmetric eigenvalue computations, vol 1, chapter Real rectangular matrices. Birkhauser, BostonGoogle Scholar
  13. De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 2(37):205–227CrossRefGoogle Scholar
  14. De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artefact management systems using information retrieval methods. ACM Trans Soft Eng Methodol 16(4), article no. 13Google Scholar
  15. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRefGoogle Scholar
  16. Detienne F (2002) Software design: cognitive aspects. Springer VerlagGoogle Scholar
  17. Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: Proceedings of the 27th international conference on software maintenance. IEEE Press, Williamsburg, USA, pp 133–142Google Scholar
  18. Gethers M, Savage T, Di Penta M, Oliveto R, Poshyvanyk D, De Lucia A (2011) Codetopics: which topic am i coding now? In: Proceedings of the 33rd International conference on software engineering, ICSE 2011, Waikiki, Honolulu, HI, USA, 21–28 May 2011. ACM, pp 1034–1036Google Scholar
  19. Grissom RJ, Kim JJ (2005) Effect sizes for research: a broad practical approach, 2nd edn. Lawrence Earlbaum AssociatesGoogle Scholar
  20. Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc Y-G (2011) TIDIER: an identifier splitting approach using speech recognition techniques. J Softw Evol Process 25(6):575–599Google Scholar
  21. Haiduc S, Aponte J, Marcus A (2010) Supporting program comprehension with source code summarization. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM Press, Cape Town, South Africa, pp 223–226CrossRefGoogle Scholar
  22. Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th working conference on reverse engineering. IEEE Computer Society, Beverly, MA, USA, pp 35–44Google Scholar
  23. Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: The study of methods. IEEE Trans Softw Eng 32(1):4–19CrossRefGoogle Scholar
  24. Hindle A, Bird C, Zimmermann T, Nagappan N (2012) Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers? In Proceedings of the 28th international conference on software maintenance. IEEE CS Press, Riva del Garda, ItalyGoogle Scholar
  25. Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: Proceedings of the 8th international working conference on mining software repositories. IEEE CS Press, Waikiki, Honolulu, USA, pp 163–172CrossRefGoogle Scholar
  26. Holm S (1979) A simple sequentially rejective Bonferroni test procedure. Scand J Stat 6:65–70MathSciNetMATHGoogle Scholar
  27. Ko AJ, Myers BA, Coblenz MJ, Aung HH (2006) An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Trans Softw Eng 32(12):971–987CrossRefGoogle Scholar
  28. Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243CrossRefGoogle Scholar
  29. Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243CrossRefGoogle Scholar
  30. LaToza TD, Venolia G, DeLine R (2006) Maintaining mental models: a study of developer work habits. In: Proceedings of the 28th international conference on software engineering. ACM Press, Shanghai, China, pp 492–501Google Scholar
  31. Lavrenko V (2009) A generative theory of relevance, vol 26. SpringerGoogle Scholar
  32. Lawrie D, Feild H, Binkley D (2007) An empirical study of rules for well-formed identifiers. J Softw Maint 19(4):205–229CrossRefGoogle Scholar
  33. Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: Proceedings of the 18th annual workshop on psychology of programming. University of Sussex, Brighton, UKGoogle Scholar
  34. Linstead E, Lopes CV, Baldi P (2008) An application of latent dirichlet allocation to analyzing software evolution. In: Proceedings of the 7th international conference on machine learning and applications. IEEE CS Press, San Diego, California, USA, pp 813–818Google Scholar
  35. Liu Y, Poshyvanyk D, Ferenc R, Gyimóthy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proc. of ICSM, pp 233–242Google Scholar
  36. Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proceedings of 23rd international conference on software engineering. IEEE CS Press, Toronto, Ontario, Canada, pp 103–112Google Scholar
  37. Marcus A, Maletic JI (2001) Identification of high-level concept clones in source code. In: Proceedings of 16th IEEE international conference on automated software engineering. IEEE CS Press, San Diego, California, USA, pp 107–114Google Scholar
  38. Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of 25th international conference on software engineering. IEEE CS Press, Portland, Oregon, USA, pp 125–135Google Scholar
  39. Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: Proceedings of 21st IEEE international conference on software maintenance. IEEE CS Press, Budapest, Hungary, pp 133–142CrossRefGoogle Scholar
  40. Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300CrossRefGoogle Scholar
  41. Medini S, Antoniol G, Guéhéneuc Y-G, Di Penta M, Tonella P (2012) Scan: an approach to label and relate execution trace segments. In: Proceedings of the 19th working conference on reverse engineering. IEEE Press, Kingston, Ontario, CanadaGoogle Scholar
  42. Murphy G (1996) Lightweight structural summarization as an aid to software evolution. PhD thesis, University of WashingtonGoogle Scholar
  43. Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 569–577CrossRefGoogle Scholar
  44. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRefGoogle Scholar
  45. Poshyvanyk D, Gael-Gueheneuc Y, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432CrossRefGoogle Scholar
  46. Poshyvanyk D, Marcus A (2006) The conceptual coupling metrics for object-oriented systems. In: Proceedings of 22nd IEEE international conference on software maintenance. IEEE CS Press, Philadelphia, PA, USA, pp 469–478Google Scholar
  47. Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proceedings of the 10th international workshop on program comprehension. IEEE Computer Society, Paris, France, pp 271–280CrossRefGoogle Scholar
  48. Rastkar S (2010) Summarizing software concerns. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering – student research competition. ACM Press, Cape Town, South Africa, pp 527–528Google Scholar
  49. Rastkar S, Murphy GC, Murray G (2010) Summarizing software artifacts: a case study of bug reports. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM Press, Cape Town, South Africa, pp 505–514CrossRefGoogle Scholar
  50. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 625–56MathSciNetMATHCrossRefGoogle Scholar
  51. Sridhara G, Hill E, Muppaneni D, LL Pollock, Vijay-Shanker K (2010) Towards automatically generating summary comments for java methods. In: Proceedings of the 25th IEEE/ACM international conference on automated software engineering. ACM Press, Antwerp, Belgium, pp 43–52CrossRefGoogle Scholar
  52. Sridhara G, Pollock LL, Vijay-Shanker K (2011) Automatically detecting and describing high level actions within methods. In Proceedings of the 33rd International conference on software engineering. ACM Press, Honolulu, HI, USA, pp 101–110Google Scholar
  53. Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications. Chapman & Hall/CRCGoogle Scholar
  54. Storey M-AD (2006) Theories, tools and research methods in program comprehension: past, present and future. SQJ 14(3):187–208Google Scholar
  55. Takang A, Grubb P, Macredie R (1996) The effects of comments and identifier names on program comprehensibility: an experiential study. J Program Lang 4(3):143–167Google Scholar
  56. Teh YW, Newman D, Welling M (2006) A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In: NIPS, pp 1353–1360Google Scholar
  57. Thomas SW, Adams B, Hassan AE, Blostein D (2010) Validating the use of topic models for software evolution. In: Tenth IEEE international working conference on source code analysis and manipulation, SCAM 2010. IEEE Computer Society, Timisoara, Romania, 12–13 Sept 2010, pp 55–64CrossRefGoogle Scholar
  58. Thomas SW, Adams B, Hassan AE, Blostein D (2011) Modeling the evolution of topics in source code histories. In: Proceedings of the 8th international working conference on mining software repositories. IEEE Press, Honolulu, HI, USA, pp 173–182CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Andrea De Lucia
    • 1
  • Massimiliano Di Penta
    • 2
  • Rocco Oliveto
    • 3
  • Annibale Panichella
    • 1
  • Sebastiano Panichella
    • 2
  1. 1.Software Engineering LabUniversity of SalernoFisciano (SA)Italy
  2. 2.RCOSTUniversity of SannioBeneventoItaly
  3. 3.Department of Bioscience and TerritoryUniversity of MolisePesche (IS)Italy

Personalised recommendations