Empirical Software Engineering

, Volume 18, Issue 2, pp 277–309 | Cite as

Integrating information retrieval, execution and link analysis algorithms to improve feature location in software

Article

Abstract

Data fusion is the process of integrating multiple sources of information such that their combination yields better results than if the data sources are used individually. This paper applies the idea of data fusion to feature location, the process of identifying the source code that implements specific functionality in software. A data fusion model for feature location is presented which defines new feature location techniques based on combining information from textual, dynamic, and web mining or link analyses algorithms applied to software. A novel contribution of the proposed model is the use of advanced web mining algorithms to analyze execution information during feature location. The results of an extensive evaluation on three Java systems indicate that the new feature location techniques based on web mining improve the effectiveness of existing approaches by as much as 87%.

Keywords

Concept location Feature identification Information retrieval Web mining Program comprehension Software evolution and maintenance 

References

  1. Antoniol G, Guéhéneuc YG (2006) Feature identification: an epidemiological metaphor. IEEE Trans Software Eng 32(9):627–641CrossRefGoogle Scholar
  2. Biggerstaff TJ, Mitbander BG, Webster DE (1994) The concept assignment problem in program understanding. 15th IEEE/ACM International Conference on Software Engineering (ICSE’94) 482–498Google Scholar
  3. Binkley D, Gold G, Harman M, Li Z, Mahdavi K (2008) An empirical study of the relationship between the concepts expressed in source code and dependence. J Syst Software 81:2287–2298CrossRefGoogle Scholar
  4. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  5. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. 7th International Conference on World Wide Web, Brisbane, Australia, 107–117Google Scholar
  6. Bruntink M, van Deursen A, Tourwe T, van Engelen R (2004) An evaluation of clone detection techniques for identifying crosscutting concerns. 20th IEEE International Conference on Software Maintenance (ICSM’04), Chicago, Illinois, IEEE Computer Society: Los Alamitos CA, 200–209Google Scholar
  7. Bruntink M, van Deursen A, van Engelen R, Tourwe T (2005) On the use of clone detection for identifying crosscutting concern code. IEEE Trans Software Eng (TSE) 31(10):804–818CrossRefGoogle Scholar
  8. Chen K, Rajlich V (2000) Case study of feature location using dependence graph. 8th IEEE International Workshop on Program Comprehension (IWPC’00), Limerick, Ireland, 241–249Google Scholar
  9. Comon P (1994) Independent component analysis, a new concept? Signal Process 36(3):287–314MATHCrossRefGoogle Scholar
  10. Conover WJ (1998) Practical nonparametric statistics, 3rd edn. WileyGoogle Scholar
  11. Cooley R, Mobasher B, Srivastava J (1997) Web mining: information and pattern discovery on the world wide web. 9th IEEE International Conference on Tools with Articial Intelligence (ICTAI’97), 558–567Google Scholar
  12. Cornelissen B, Zaidman A, van Deursen A, Moonen L, Koschke R (2009) A systematic survey of program comprehension through dynamic analysis. IEEE Trans Software Eng (TSE) 35(5):684–702CrossRefGoogle Scholar
  13. Cubranic D, Murphy GC (2003) Hipikat: recommending pertinent software development artifacts. 25th International Conference on Software Engineering (ICSE’03), Portland, OR, 408–418Google Scholar
  14. Cubranic D, Murphy GC, Singer J, Booth KS (2004) Learning from project history: a case study for software development. 2004 ACM Conference on Computer Supported Cooperative Work (CSCW’04), Chicago, Illinois, USA, ACM, 82–91Google Scholar
  15. Cubranic D, Murphy GC, Singer J, Booth KS (2005) Hipikat: a project memory for software development. IEEE Trans Software Eng 31(6):446–465CrossRefGoogle Scholar
  16. de Alwis B, Murphy GC (2008) Answering conceptual queries with Ferret. 30th International Conference on Software Engineering (ICSE’08), Leipzig, Germany, 21–30Google Scholar
  17. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRefGoogle Scholar
  18. Dit B, Revelle M, Gethers M, Poshyvanyk D (2011) Feature location in source code: a taxonomy and survey. J Software Mainten Evol: Res Pract (JSME). doi:10.1002/smr.567
  19. Eaddy M, Aho AV, Antoniol G, Guéhéneuc YG (2008a) CERBERUS: tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. 16th IEEE International Conference on Program Comprehension (ICPC’08), Amsterdam, The Netherlands, 53–62Google Scholar
  20. Eaddy M, Zimmermann T, Sherwood K, Garg V, Murphy G, Nagappan N, Aho AV (2008b) Do crosscutting concerns cause defects? IEEE Trans Software Eng 34(4):497–515CrossRefGoogle Scholar
  21. Eisenbarth T, Koschke R, Simon D (2003) Locating features in source code. IEEE Trans Software Eng 29(3):210–224CrossRefGoogle Scholar
  22. Ganter B, Wille R (1996) Formal concept analysis. Springer, BerlinMATHGoogle Scholar
  23. Gay G, Haiduc S, Marcus M, Menzies T (2009) On the use of relevance feedback in IR-based concept location. 25th IEEE International Conference on Software Maintenance (ICSM’09), Edmonton, Canada, 351–360Google Scholar
  24. Gold N, Bennett K (2002) Hypothesis-based concept assignment in software maintenance. IEE Proc Software 149(4):103–110CrossRefGoogle Scholar
  25. Grant S, Cordy JR, Skillicorn DB (2008) Automated concept location using independent component analysis 15th Working Conference on Reverse Engineering (WCRE’08), Antwerp, Belgium, 138–142Google Scholar
  26. Harman M, Gold N, Hierons R, Binkley D (2002) Code extraction algorithms which unify slicing and concept assignment. 9th Working Conference on Reverse Engineering (WCRE’02), Richmond, VA, 11–21Google Scholar
  27. Henry S, Kafura D (1981) Software structure metrics based on information flow. IEEE Trans Software Eng (TSE) 7(5):510–518CrossRefGoogle Scholar
  28. Hill E, Pollock L, Vijay-Shanker K (2007) Exploring the neighborhood with dora to expedite software maintenance. 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE’07), 14–23Google Scholar
  29. Hill E, Pollock L, Vijay-Shanker K (2009) Automatically capturing source code context of NL-queries for software maintenance and reuse. 31st IEEE/ACM International Conference on Software Engineering (ICSE’09), Vancouver, British Columbia, CanadaGoogle Scholar
  30. Inoue K, Yokomori R, Yamamoto T, Matsushita M, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Software Eng (TSE) 31(3):213–225CrossRefGoogle Scholar
  31. Jiang H, Nguyen T, Che IX, Jaygarl H, Chang C (2008) Incremental latent semantic indexing for effective, automatic traceability link evolution management. 23rd IEEE/ACM International Conference on Automated Software Engineering (ASE’08), L’Aquila, ItalyGoogle Scholar
  32. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632MathSciNetMATHCrossRefGoogle Scholar
  33. Lawrance J, Bellamy R, Burnett M (2007) Scents in programs: does information foraging theory apply to program maintenance? IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’07), IEEE, 15–22Google Scholar
  34. Li Z (2009) Identifying high-level dependence structures using slice-based dependence analysis. King’s College London, University of London. Ph.DGoogle Scholar
  35. Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE’07), Atlanta, Georgia, 234–243Google Scholar
  36. Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug location using latent dirichlet allocation. 15th Working Conference on Reverse Engineering (WCRE’08), Antwerp, Belgium, 155–164Google Scholar
  37. Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. 11th IEEE Working Conference on Reverse Engineering (WCRE’04), Delft, The Netherlands, 214–223Google Scholar
  38. Marin M, van Deursen A, Moonen L (2004) Identifying aspects using fan-in analysis. 11th IEEE Working Conference on Reverse Engineering (WCRE’04), Delft, The Netherlands, 132–141Google Scholar
  39. Marin M, van Deursen A, Moonen L (2007) Identifying crosscutting concerns using fan-in analysis. ACM Trans Software Eng Meth (TOSEM) 17(1):1–34CrossRefGoogle Scholar
  40. Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRefGoogle Scholar
  41. Poshyvanyk D, Guéhéneuc YG, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Software Eng 33(6):420–432CrossRefGoogle Scholar
  42. Revelle M, Poshyvanyk D (2009) An exploratory study on assessing feature location techniques. 17th IEEE International Conference on Program Comprehension (ICPC’09), Vancouver, British Columbia, Canada, 218–222Google Scholar
  43. Robillard M (2005) Automatic generation of suggestions for program investigation. Joint European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering, Lisbon, Portugal, 11–20Google Scholar
  44. Robillard MP (2008) Topology analysis of software dependencies. ACM Trans Software Eng Meth 17(4):1–36CrossRefGoogle Scholar
  45. Robillard MP, Dagenais B (2008) Retrieving task-related clusters from change history. 15th Working Conference on Reverse Engineering (WCRE’08), 17–26Google Scholar
  46. Robillard MP, Dagenais B (2010) Recommending change clusters to support software investigation: an empirical study. J Software Mainten Evol Res Pract 22(3):143–164Google Scholar
  47. Robillard MP, Shepherd D, Hill E, Vijay-Shanker K, Pollock L (2007) An empirical study of the concept assignment problem. McGill University, MontrealGoogle Scholar
  48. Rohatgi A, Hamou-Lhadj A, Rilling J (2009) an approach for solving the feature location problem by measuring the component modification impact. IET Softw 3(4):292–311CrossRefGoogle Scholar
  49. Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-HillGoogle Scholar
  50. Saul MZ, Filkov V, Devanbu P, Bird C (2007) Recommending random walks. 11th European Software Engineering Conference held jointly with 15th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE’07), Dubrovnik, Croatia, 15–24Google Scholar
  51. Savage T, Revelle M, Poshyvanyk D (2010) FLAT^3: feature location and textual tracing tool. 32nd ACM/IEEE International Conference on Software Engineering (ICSE’10), Cape Town, South Africa, 255–258Google Scholar
  52. Shepherd D, Gibson E, Pollock L (2004) Design and evaluation of an automated aspect mining tool. Mid-Atlantic Student Workshop on Programming Languages and Systems (MASPLAS ‘04)Google Scholar
  53. Shepherd D, Palm J, Pollock L, Chu-Carroll M (2005) Timna: a framework for automatically combining aspect mining analyses. 20th IEEE/ACM international Conference on Automated Software Engineering (ASE’05), Long Beach, CA, USA, 184–193Google Scholar
  54. Shepherd D, Pollock L, Vijay-Shanker K (2007) Case study: supplementing program analysis with natural language analysis to improve a reverse engineering task. 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE’07), San Diego, California, USA, ACM, 49–54Google Scholar
  55. Sillito J, Murphy GC, De Volder K (2008) Asking and answering questions during a programming change task. IEEE Trans Software Eng (TSE) 34(4):434–451CrossRefGoogle Scholar
  56. Starke J, Luce C, Sillito J (2009) Searching and skimming: an exploratory study. 25th IEEE International Conference on Software Maintenance (ICSM’09), Edmonton, Alberta, CanadaGoogle Scholar
  57. Wilde N, Scully M (1995) Software reconnaissance: mapping program features to code. J Software Mainten Res Pract 7:49–62CrossRefGoogle Scholar
  58. Zaidman A, Demeyer S (2008) Automatic identification of key classes in a software system using webmining techniques. J Software Mainten Evol Res Pract 20(6):387–417CrossRefGoogle Scholar
  59. Zaidman A, Du Bois B, Demeyer S (2006) How webmining and coupling metrics improve early program comprehension. 14th IEEE International Conference on Program Comprehension (ICPC’06), Athens, Greece, 74–78Google Scholar
  60. Zhao W, Zhang L, Liu Y, Sun J, Yang F (2006) SNIAFL: towards a static non-interactive approach to feature location. ACM Trans Software Eng Meth (TOSEM) 15(2):195–226CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.The College of William and MaryWilliamsburgUSA

Personalised recommendations