Empirical Software Engineering

, Volume 19, Issue 3, pp 465–500

Configuring latent Dirichlet allocation based feature location

  • Lauren R. Biggers
  • Cecylia Bocovich
  • Riley Capshaw
  • Brian P. Eddy
  • Letha H. Etzkorn
  • Nicholas A. Kraft
Article
  • 791 Downloads

Abstract

Feature location is a program comprehension activity, the goal of which is to identify source code entities that implement a functionality. Recent feature location techniques apply text retrieval models such as latent Dirichlet allocation (LDA) to corpora built from text embedded in source code. These techniques are highly configurable, and the literature offers little insight into how different configurations affect their performance. In this paper we present a study of an LDA based feature location technique (FLT) in which we measure the performance effects of using different configurations to index corpora and to retrieve 618 features from 6 open source Java systems. In particular, we measure the effects of the query, the text extractor configuration, and the LDA parameter values on the accuracy of the LDA based FLT. Our key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context. Based on the results of our case study, we offer specific recommendations for configuring the LDA based FLT.

Keywords

Software evolution Program comprehension Feature location Static analysis Text retrieval 

References

  1. Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: Proc of the 16th IEEE int’l conf on program comprehension, pp 103–112. doi:10.1109/ICPC.2008.30
  2. Abebe S, Haiduc S, Marcus A, Tonella P, Antoniol G (2009a) Analyzing the evolution of the source code vocabulary. In: Proc of the 13th European conf on software maintenance and reengineering, pp 189–198. doi:10.1109/CSMR.2009.61
  3. Abebe S, Haiduc S, Tonella P, Marcus A (2009b) Lexicon bad smells in software. In: Proc of the 16th working conf on reverse engineering, pp 95–99. doi:10.1109/WCRE.2009.26
  4. Andrieu C, Freitas N, Doucet A, Jordan M (2003) An introduction to mcmc for machine learning. Mach Learn 50(1–2):5–43CrossRefMATHGoogle Scholar
  5. Antoniol G, Canfora G, Casazza G, Lucia AD, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983CrossRefGoogle Scholar
  6. Asuncion A, Welling M, Smyth P, Teh Y (2009) On smoothing and inference for topic models. In: Proc of the 25th conf on uncertainty in artificial intelligence, pp 27–34Google Scholar
  7. Asuncion H, Asuncion A, Taylor R (2010) Software traceability with topic modeling. In: Proc of the 32nd int’l conf on software engineering, pp 95–104. doi:10.1145/1806799.1806817
  8. Baldi P, Linstead E, Lopes C, Bajracharya S (2008) A theory of aspects as latent topics. In: Proc of the ACM SIGPLAN conf on object-oriented programming, systems, languages, and applications, pp 543–562. doi:10.1145/1449955.1449807
  9. Basili V, Caldiera G, Rombach H (1994) The goal question metric approach. ftp://ftp.cs.umd.edu/pub/sel/papers/gqm.pdf. Accessed 15 Feb 2011
  10. Beard M, Kraft N, Etzkorn L, Lukins S (2011) Measuring the accuracy of information retrieval based bug localization techniques. In: Proc of the 18th working conf on reverse engineering, pp 124–128. doi:10.1109/WCRE.2011.23
  11. Biggerstaff T, Mitbander B, Webster D (1993) The concept assignment problem in program understanding. In: Proc of the int’l conf on software engineering, pp 482–498Google Scholar
  12. Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  13. Canfora G, Cerulo L (2006) Fine grained indexing of software repositories to support impact analysis. In: Proc of the 3rd int’l wksp on mining software repositories, pp 105–111. doi:10.1145/1137983.1138009
  14. Chang J, Blei D (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150CrossRefMATHMathSciNetGoogle Scholar
  15. Corley C, Kraft N, Etzkorn L, Lukins S (2011) Recovering traceability links between source code and fixed bugs via patch analysis. In: Proc of the 6th int’l wks on traceability in emerging forms of software engineering, pp 31–37. doi:10.1145/1987856.1987863
  16. De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4). doi:10.1145/1276933.1276934 Google Scholar
  17. Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407CrossRefGoogle Scholar
  18. Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011a) Can better identifier splitting techniques help feature location? In: Proc of the 19th IEEE int’l conf on program comprehension, pp 11–20. doi:10.1109/ICPC.2011.47
  19. Dit B, Revelle M, Gethers M, Poshyvanyk D (2011b) Feature location in source code: a taxonomy and survey. J Softw Maint Evol: Res Pract. doi:10.1002/smr.567
  20. Eaddy M, Zimmermann T, Sherwood K, Garg V, Murphy G, Nagappan N, Aho A (2008) Do crosscutting concerns cause defects? IEEE Trans Softw Eng 34(4):497–515CrossRefGoogle Scholar
  21. Eisenberg A, Volder KD (2005) Dynamic feature traces: finding features in unfamiliar code. In: Proc of the 21st IEEE int’l conf on software maintenance, pp 337–346. doi:10.1109/ICSM.2005.42
  22. Fluri B, Wursch M, Gall H (2007) Do code and comments co-evolve? On the relation between source code and comment changes. In: Proc of the 14th working conf on reverse engineering, pp 70–79. doi:10.1109/WCRE.2007.21
  23. Fox C (1992) Lexical analysis and stoplists. In: Frakes W, Baeza-Yates R (eds) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJGoogle Scholar
  24. Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in IR-based concept location. In: Proc of the IEEE int’l conf on software maintenance, pp 351–360. doi:10.1109/ICSM.2009.5306315
  25. Gethers M, Poshyvanyk D (2010) Using relational topic models to capture coupling among classes in object-oriented software systems. In: Proc of the int’l conf on software maintenance, pp 1–10. doi:10.1109/ICSM.2010.5609687
  26. Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(Suppl 1):5228–5235. doi:10.1073/pnas.0307752101 CrossRefGoogle Scholar
  27. Heinrich G (2009) Parameter estimation for text analysis. Tech Rep, Fraunhofer IGD, Darmstadt, Germany. http://www.arbylon.net/publications/text-est2.pdf. Version 2.9. Accessed 15 Feb 2011
  28. Hill E, Pollock L, Vijay-Shanker K (2007) Exploring the neighborhood with Dora to expedite software maintenance. In: Proc of the 22nd int’l conf on automated software engineering, pp 14–23. doi:10.1145/1321631.1321637
  29. Lawrie D, Binkley D (2011) Expanding identifiers to normalize source code vocabulary. In: Proc of the 27th IEEE int’l conf on software maintenance, pp 113–122. doi:10.1109/ICSM.2011.6080778
  30. Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. In: Proc of the 22nd int’l conf on automated software engineering, pp 234–243. doi:10.1145/1321631.1321667
  31. Liu Y, Poshyvanyk D, Ferenc R, Gyimothy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proc of the 25th IEEE int’l conf on software maintenance, pp 233–242. doi:10.1109/ICSM.2009.5306318
  32. Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug localization using latent Dirichlet allocation. In: Proc of the 15th working conf on reverse engineering. doi:10.1109/WCRE.2008.33
  33. Lukins S, Kraft N, Etzkorn L (2010) Bug localization using latent Dirichlet allocation. Inf Softw Technol 52(9):972–990CrossRefGoogle Scholar
  34. Marcus A, Menzies T (2010) Software is data too. In: Proc of the FSE/SDP wksp on future of software engineering research, pp 229–232. doi:10.1145/1882362.1882410
  35. Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: Proc of the 21st IEEE int’l conf on software maintenance, pp 133–142. doi:10.1109/ICSM.2005.89
  36. Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: Proc of the 11th working conf on reverse engineering, pp 214–223. doi:10.1109/WCRE.2004.10
  37. Maskeri G, Sarkar S, Heafield K (2008) Mining business topics in source code using latent Dirichlet allocation. In: Proc of the 1st India software engineering conf. doi:10.1145/1342211.1342234
  38. Minka T (2009) Estimating a Dirichlet distribution. Tech Rep http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf. Accessed 20 Jun 2011
  39. Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: Proc of the IEEE int’l conf on program comprehension, pp 68–71. doi:10.1109/ICPC.2010.20
  40. Poshyvanyk D, Gueheneuc Y, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432CrossRefGoogle Scholar
  41. Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T (2009) Using information retrieval based coupling measures for impact analysis. Empir Software Eng 14(1):5–32CrossRefGoogle Scholar
  42. Rajlich V (2006) Changing the paradigm of software engineering. Commun ACM 49(8):67–70CrossRefGoogle Scholar
  43. Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proc of the 10th IEEE int’l wksp on program comprehension, pp 271–278. doi:10.1109/WPC.2002.1021348
  44. Rao S, Kak A (2011) Retrieval from software libraries for bug localization: a comparative study with generic and composite text models. In: Proc of the 8th working conf on mining software repositories, pp 43–52. doi:10.1145/1985441.1985451
  45. Ratanotayanon S, Choi H, Sim S (2010) My repository runneth over: an empirical study on diversifying data sources to improve feature search. In: Proc of the 18th IEEE int’l conf on program comprehension, pp 206–215. doi:10.1109/ICPC.2010.33
  46. Revelle M, Poshyvanyk D (2009) An exploratory study on assessing feature location techniques. In: Proc of the 17th int’l conf on program comprehension, pp 218–222. doi:10.1109/ICPC.2009.5090045
  47. Revelle M, Dit B, Poshyvanyk D (2010) Using data fusion and web mining to support feature location in software. In: Proc of the 18th IEEE int’l conf on program comprehension, pp 14–23. doi:10.1109/ICPC.2010.10
  48. Salton G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley, Reading, MAGoogle Scholar
  49. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523CrossRefGoogle Scholar
  50. Savage T, Dit B, Gethers M, Poshyvanyk D (2010) TopicXP: exploring topics in source code using latent dirichlet allocation. In: Proc of the 26th IEEE int’l conf on software maintenance, pp 1–6. doi:10.1109/ICSM.2010.5609654
  51. Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proc of the 19th IEEE int’l conf on program comprehension, pp 1–10. doi:10.1109/ICPC.2011.13
  52. Shao P, Atkison T, Kraft N, Smith R (2012) Combining lexical and structural information for static bug localization. Int J Comput Appl Technol 44(1):61–71CrossRefGoogle Scholar
  53. Thomas S, Adams B, Hassan A, Blostein D (2011) Modeling the evolution of topics in source code histories. In: Proc of the 8th IEEE working conf on mining software repositories, pp 173–182. doi:10.1145/1985441.1985467
  54. Tian K, Revelle M, Poshyvanyk D (2009) Using latent Dirichlet allocation for automatic categorization of software. In: Proc of the 6th IEEE working conf on mining software repositories, pp 163–166. doi:10.1109/MSR.2009.5069496
  55. Vinz B, Etzkorn L (2006) A synergistic approach to program comprehension. In: Proc of the 14th IEEE int’l conf on program comprehension, pp 69–73. doi:10.1109/ICPC.2006.7
  56. Wei X, Croft W (2006) Lda-based document models for ad-hoc retrieval. In: Proc of ACM SIGIR, pp 178–185. doi:10.1145/1148170.1148204
  57. Zhao W, Zhang L, Liu Y, Sun J, Yang F (2006) SNIAFL: towards a static noninteractive approach to feature location. ACM Trans Softw Eng Methodol 15(2):195–226CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Lauren R. Biggers
    • 1
  • Cecylia Bocovich
    • 2
    • 5
  • Riley Capshaw
    • 3
  • Brian P. Eddy
    • 1
  • Letha H. Etzkorn
    • 4
  • Nicholas A. Kraft
    • 1
  1. 1.Department of Computer ScienceThe University of AlabamaTuscaloosaUSA
  2. 2.Department of Mathematics, Statistics, and Computer ScienceMacalester CollegeSaint PaulUSA
  3. 3.Department of Mathematics & Computer ScienceHendrix CollegeConwayUSA
  4. 4.Department of Computer ScienceThe University of Alabama in HuntsvilleHuntsvilleUSA
  5. 5.David R. Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations