Empirical Software Engineering

, Volume 20, Issue 6, pp 1666–1720 | Cite as

Link analysis algorithms for static concept location: an empirical assessment

  • Giuseppe Scanniello
  • Andrian Marcus
  • Daniele Pascale
Article

Abstract

During software evolution, one of the most important comprehension activities is concept location in source code, as it identifies the places in the code where changes are to be made in response to a modification request. Change requests (such as, bug fixing or new feature requests) are usually formulated in natural language, while the source code also includes large amounts of text. In consequence, many of the existing concept location techniques are based on text search or text retrieval. Such approaches reformulate concept location as a document retrieval problem. We refine and improve such solutions by leveraging dependencies between source code elements. Dependency information is used by a link analysis algorithm to rank the document space and to improve concept location based on text retrieval. We implemented our solution to concept location using the PageRank algorithm, used in web document retrieval applications. The results of an empirical evaluation indicate that the new approach leads to better retrieval performance than baseline approaches that use text retrieval and clustering. In addition, we present the results of a controlled experiment and of a differentiated replication to assess whether the new technique supports users in identifying the places in the code where changes are to be made. The results of these experiments revealed that the users exploiting our technique were significantly better supported in the identification of the code to be changed in response to a bug fixing request compared to the users who did not use this technique.

Keywords

Concept location Controlled experiments Information retrieval Experiments Empirical study 

Notes

Acknowledgments

We would like to thank Michele Brescia, who developed some of the software modules of the prototype used in the experimentation presented here, and Pasquale Ricciardi for helping us in the execution of the replication. We also thank the participants in the controlled experiments. Andrian Marcus was supported in part by grants from the US National Science Foundation: CCF-1017263 and CCF-0845706.

References

  1. Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: International conference on program comprehension. IEEE CS Press, Washington, DC, pp 103–112Google Scholar
  2. Abrahão S, Gravino C, Pelozo EI, Scanniello G, Tortora G (2013) Assessing the effectiveness of sequence diagrams in the comprehension of functional requirements: results from a family of five experiments. IEEE Trans Soft Eng 39 (3):327–342CrossRefGoogle Scholar
  3. Ali N, Sabane A, Guéhéneuc Y-G, Antoniol G (2012) Improving bug location using binary class relationships. In: Proceedings of international working conference on source code analysis and manipulation (SCAM). IEEE Computer Society, Washington, DC, p 174–183Google Scholar
  4. Aranda J, Ernst N, Horkoff J, Easterbrook S (2007) A framework for empirical evaluation of model comprehensibility. In: Proceedings of modeling in software engineering. ICSE Workshop, pp 7–13. IEEEGoogle Scholar
  5. Arisholm E, Briand LC, Hove SE, Labiche Y (2006) The impact of UML documentation on software maintenance: an experimental evaluation. IEEE Trans Soft Eng 32:365–381CrossRefGoogle Scholar
  6. Bajracharya SK, Ngo TC, Linstead E, Dou Y, Rigor P, Baldi P, Lopes CV (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: Tarr PL, Cook WR (eds) Companion to the 21th annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications (OOPSLA), Portland, pp 681–682. ACMGoogle Scholar
  7. Basili V, Caldiera G, Rombach DH (1994) The goal question metric paradigm, encyclopedia of software engineering. WileyGoogle Scholar
  8. Basili VR, Shull F, Lanubile F (1999) Building knoledge through families of experiments. In: IEEE Transactions on Software Engineering, IEEEGoogle Scholar
  9. Beard M, Kraft N, Etzkorn L, Lukins S (2011) Measuring the accuracy of information retrieval based bug localization techniques. In: Proceedings of working conference on reverse engineering (WCRE). IEEE Computer Society, Washington, DC, pp 124–128Google Scholar
  10. Briand LC, Labiche Y, Di Penta M, Yan-Bondoc H (2005) An experimental investigation of formality in UML-based development. IEEE Trans Soft Eng 31 (10):833–849CrossRefGoogle Scholar
  11. Brien MPO, Buckley J (2005) Modelling the information-seeking behaviour of programmers - an empirical approach. In: Proceedings of workshop on program comprehension (IWPC). IEEE Computer Society, pp 125–134Google Scholar
  12. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the seventh international conference on World Wide Web 7, (WWW7). Elsevier, Amsterdam, pp 107–117Google Scholar
  13. Buckner J, Buchta J, Petrenko M, Jripples V (2005) Rajlich: a tool for program comprehension during incremental change. In: Proceedings of international workshop on program comprehension, (IWPC). IEEE Computer Society, pp 149–152Google Scholar
  14. Carver J, Jaccheri L, Morasca S, Shull F (2003) Issues in using students in empirical studies in software engineering education. In: Proceedings of international symposium on software metrics. IEEE Computer Society, Washington, DC, pp 239–250Google Scholar
  15. Chan W-K, Cheng H, Lo D (2012) Searching connected API subgraph via text phrases. In: Proceedings of symposium on the foundations of software engineering. SIGSOFT FSE. ACM, p 10Google Scholar
  16. Chen K, Rajlich V (2000) Case study of feature location using dependence graph. In: Proc. of 8th international workshop on program comprehension, pp 241–247Google Scholar
  17. Ciolkowski M, Muthig D, Rech J (2004) Using academic courses for empirical validation of software development processes. In: Proceedings of EUROMICRO Conference. IEEE Computer Society, Washington, DC, pp 354–361Google Scholar
  18. Cliff N (1993) Dominance statistics: ordinal analyses to answer ordinal questions. Psychol Bull 114 (3):494–509MathSciNetCrossRefGoogle Scholar
  19. Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn., Lawrence Earlbaum Associates, HillsdaleGoogle Scholar
  20. Colosimo M, De Lucia A, Scanniello G, Tortora G (2009) Evaluating legacy system migration technologies through empirical studies. Inf Soft Technol 51 (12):433–447CrossRefGoogle Scholar
  21. Conover WJ (1998) Practical Nonparametric Statistics, 3rd edn. WileyGoogle Scholar
  22. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41 (6):391–407CrossRefGoogle Scholar
  23. Devore JL, Farnum N (1999) Applied statistics for engineers and scientists. DuxburyGoogle Scholar
  24. De Lucia A, Oliveto R, Tortora G (2009) Assessing ir-based traceability recovery tools through controlled experiments. Empirical Softw Eng 14 (1):57–92CrossRefGoogle Scholar
  25. Dit B, Revelle M, Poshyvanyk D (2013a) Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empirical Softw Engg 18(2):277–309. doi: 10.1007/s10664-011-9194-4
  26. Dit B, Revelle M, GethersM, Poshyvanyk D (2013b) Feature location in source code: a taxonomy and survey. Journal of Software: Evolution and Process 25(1):53–95. doi: 10.1002/smr.567
  27. Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64MATHMathSciNetCrossRefGoogle Scholar
  28. Eaddy M, Aho AV, Antoniol G, Guéhéneuc Y-G (2008) Cerberus: tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. In: Proceedings of international conference on program comprehension, ICPC ’08. IEEE Computer Society, Washington, DC, pp 53–62Google Scholar
  29. Ellis P (2010) The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results. Cambridge University PressGoogle Scholar
  30. Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in IR-based concept location. In: Proceedings of international conference on software maintenance. IEEE Computer Society, Washington, DC, pp 351–360Google Scholar
  31. Gold N, Harman M, Li Z, Mahdavi K (2006) Allowing overlapping boundaries in source code using a search based approach to concept binding. In: Proceedings of international conference on software maintenance, (ICSM). IEEE Computer Society, Washington, DC, pp 310–319Google Scholar
  32. Grant S, Cordy JR, Skillicorn D, Automated concept location using independent component analysis. In: Proceedings of working conference on reverse engineering WCRE (2008). IEEE Computer Society, Washington, DC, pp 138–142Google Scholar
  33. Gravino C, Risi M, Scanniello G, Tortora G (2012) Do professional developers benefit from design pattern documentation? A replication in the context of source code comprehension. In: Proceedings of conference on model driven engineering languages and systems, lecture notes in computer science, Springer, pp 185–201Google Scholar
  34. Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. In: Proceedings of international conference on software engineering, ICSE, vol 1, ACM, New YorkGoogle Scholar
  35. Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013) Automatic query reformulations for text retrieval in software engineering. In: Proceedings of international conference on software engineering, ICSE. IEEE Press, Piscataway, pp 842–851Google Scholar
  36. Hannay J, Jørgensen M (2008) The role of deliberate artificial design elements in software engineering experiments. IEEE Trans Softw Eng 34 (2):242–259CrossRefGoogle Scholar
  37. Harman M, Gold N, Hierons RM, Binkley D (2002) Code extraction algorithms which unify slicing and concept assignment. In: Proceedings of working conference on reverse engineering, WCRE. IEEE Computer Society, Richmond, pp 11–21Google Scholar
  38. Hill E, Pollock L, Vijay-Shanker K (2007) Exploring the neighborhood with dora to expedite software maintenance. In: Proceedings of international conference on automated software engineering, ASE, ACM, New YorkGoogle Scholar
  39. Inoue K, Yokomori R, Yamamoto T, Matsushita M, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Softw Eng 31 (3):213–225CrossRefGoogle Scholar
  40. Juristo N, Moreno A (2001) Basics of software engineering experimentation. Kluwer Academic Publishers, Englewood CliffsMATHCrossRefGoogle Scholar
  41. Kampenes VB, Dybå T, Hannay JE, Sjøberg DIK (2007) A systematic review of effect size in software engineering experiments. Inf Soft Technol 49 (11–12):1073–1086CrossRefGoogle Scholar
  42. Kitchenham B, Al-Khilidar H, Babar M, Berry M, Cox K, Keung J, Kurniawati F, Staples M, Zhang H, Zhu L (2008) Evaluating guidelines for reporting empirical software engineering studies. Empir Soft Eng 13:97–121CrossRefGoogle Scholar
  43. Ko AJ, Myers BA, Coblenz MJ, Aung HH (2006) An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Trans Soft Eng 32 (12):971–987CrossRefGoogle Scholar
  44. Li Z (2009) Identifying high-level dependence structures using slice-based dependence analysis. In: 25th IEEE international conference on software maintenance (ICSM). Edmonton, pp 457–460. IEEEGoogle Scholar
  45. Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: Proceedings of working conference on reverse engineering, WCRE. IEEE Computer Society, Washington, DC, pp 155–164Google Scholar
  46. Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent dirichlet allocation. Inf Softw Technol 52 (9):972–990CrossRefGoogle Scholar
  47. Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkMATHCrossRefGoogle Scholar
  48. Marcus A, Haiduc S (2013) Text retrieval approaches for concept location in source code. In: Software engineering, volume 7171 of lecture notes in computer science. Springer, pp 126–158Google Scholar
  49. Marcus A, Maletic J (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of international conference on software engineering, ICSE. IEEE Computer Society, Portland, pp 124–135Google Scholar
  50. Marcus A, Sergeyev A, Rajlich V, Maletic JI (2004) An information retrieval approach to concept location in source code. In: Proceedings of working conference on reverse engineering, WCRE’ 04. IEEE Computer Society, Washington, DC, pp 214–223Google Scholar
  51. McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usage. In: Proceedings of International Conference on Software Engineering, ICSE, ACM, New YorkGoogle Scholar
  52. McMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Soft Eng 38 (5):1069–1087CrossRefGoogle Scholar
  53. Moreno L, Bandara W, Haiduc S, Marcus A (2013) On the relationship between the vocabulary of bug reports and source code. In: International conference on software maintenance, ICSM, IEEE Computer SocietyGoogle Scholar
  54. Ngomo ACN (2009) Low-bias extraction of domain-specific concepts. Ph.D ThesisGoogle Scholar
  55. Oppenheim AN (1992) Questionnaire design, interviewing and attitude measurement. Pinter, LondonGoogle Scholar
  56. Panichella A, McMillan C, Moritz E, Palmieri D, Oliveto R, Poshyvanyk D, De Lucia A (2013) When and how using structural information to improve ir-based traceability recovery. In: European conference on software maintenance and reengineering, CSMR. IEEE Computer Society, Washington, DC, pp 199– 208Google Scholar
  57. Petrenko M., Rajlich V. (2013) Concept location using program dependencies and information retrieval (depir). Inf Softw Technol 55 (4):651–659CrossRefGoogle Scholar
  58. Poshyvanyk D, Gethers M, Marcus A, Concept location using formal concept analysis and information retrieval (2013). ACM Trans Softw Eng Methodol 21 (4):23:1–23:34Google Scholar
  59. Poshyvanyk D., Marcus A (2007) Combining formal concept analysis with information retrieval for concept location in source code. In: Proceedings of the 15th ieee international conference on program comprehension, ICPC. IEEE Computer Society, Washington, DC, pp 37–48Google Scholar
  60. Puppin D, Silvestri F (2006) The social network of java classes. In: Proceedings of symposium on applied computing, (SAC), ACM, New YorkGoogle Scholar
  61. Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proceedings of international workshop on program comprehension, IWP. IEEE Computer Society, Washington, DC, pp 271–278CrossRefGoogle Scholar
  62. Revelle M, Dit B, Poshyvanyk D (2010) Using data fusion and web mining to support feature location in software. In: Proceedings of international conference on program comprehension, ICPC. IEEE Computer Society, Washington, DC, pp 14–23Google Scholar
  63. Ricca F, Di Penta M, Torchiano M, Tonella P, Ceccato M (2010) How developers’ experience and ability influence Web application comprehension tasks supported by UML stereotypes: a series of four experiments. IEEE Trans Soft Eng 36 (1):96–118CrossRefGoogle Scholar
  64. Robillard MP (2008) Topology analysis of software dependencies. ACM Trans Softw Eng Methodol 17 (4):18:1–18:36CrossRefGoogle Scholar
  65. Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: should we really be using t-test and cohen’s d for evaluating group differences on the nsse and other surveys? In: Annual meeting of the Florida association of institutional researchGoogle Scholar
  66. Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw Hill, New YorkMATHGoogle Scholar
  67. Scanniello G, D’Amico A, D’Amico C, D’Amico T (2010) Using the kleinberg algorithm and vector space model for software system clustering. In: International conference on program comprehension, ICPC. IEEE Computer Society, Washington, DC, pp 180–189Google Scholar
  68. Scanniello G, Gravino C, Genero M, Cruz-Lemus JA, Tortora G (2014) On the impact of UML analysis models on source code comprehensibility and modifiability. ACM Trans Sofw Eng Meth 23 (2):13:1–13:26Google Scholar
  69. Scanniello G, Gravino C, Tortora G (2010) Investigating the role of UML in the software modeling and maintenance - a preliminary industrial survey. In: Proceedings of the international conference on enterprise information systems. pp 141–148Google Scholar
  70. Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proceedings of international conference on program comprehension, ICPC. IEEE Computer Society, Washington, DC, pp 1–10Google Scholar
  71. Seaman CB (2002) The information gathering strategies of software maintainers. In: Proceedings of the international conference on software maintenance, ICSM. IEEE Computer Society, Washington, DC, pp 141–149Google Scholar
  72. Shapiro S, Wilk M (1965) An analysis of variance test for normality. Biometrika 52 (3–4):591–611MATHMathSciNetCrossRefGoogle Scholar
  73. Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Soft Eng 13 (2):211–218CrossRefGoogle Scholar
  74. Sjoberg DIK, Hannay JE, Hansen O, Kampenes VB, Karahasanovic A, Liborg N, Rekdal AC (2005) A survey of controlled experiments in software engineering. IEEE Trans Soft Eng 31 (9):733–753CrossRefGoogle Scholar
  75. Wang J, Peng X, Xing Z, Zhao W (2011) An exploratory study of feature location process: distinct phases, recurring patterns, and elementary actions. In: Proceedings of international conference on software maintenance, ICSM. IEEE Computer Society, pp 213–222Google Scholar
  76. Wang S, Lo D, Jiang L (2011) Code search via topic-enriched dependence graph matching. In: Working conference on reverse engineering, WCRE. IEEE Computer Society, pp 119–123Google Scholar
  77. Wang S., Lo D., Xing Z., Jiang L. (2011) Concern localization using information retrieval: an empirical study on linux kernel. In: Proceedings of working conference on reverse engineering, WCRE. IEEE Computer Society, pp 92–96Google Scholar
  78. Wohlin C, Runeson P, Höst M, Ohlsson M, Regnell B, Wesslén A (2012) Experimentation in software engineering. SpringerGoogle Scholar
  79. Zhao W, Zhang L, Liu Y, Sun J, Yang F (2004) Sniafl: towards a static non-interactive approach to feature location. In: Proceedings of international conference on software engineering, ICSE. IEEE Computer Society, Washington, DC, pp 293–303Google Scholar
  80. Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In: International conference on software engineering, ICSE. IEEE pp 14–24Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Giuseppe Scanniello
    • 1
  • Andrian Marcus
    • 2
  • Daniele Pascale
    • 1
  1. 1.Dipartimento di Matematica, Informatica e EconomiaUniversity of BasilicataPotenzaItaly
  2. 2.Department of Computer ScienceUniversity of Texas at DallasRichardsonUSA

Personalised recommendations