Configuring latent Dirichlet allocation based feature location
- First Online:
- Cite this article as:
- Biggers, L.R., Bocovich, C., Capshaw, R. et al. Empir Software Eng (2014) 19: 465. doi:10.1007/s10664-012-9224-x
- 763 Downloads
Feature location is a program comprehension activity, the goal of which is to identify source code entities that implement a functionality. Recent feature location techniques apply text retrieval models such as latent Dirichlet allocation (LDA) to corpora built from text embedded in source code. These techniques are highly configurable, and the literature offers little insight into how different configurations affect their performance. In this paper we present a study of an LDA based feature location technique (FLT) in which we measure the performance effects of using different configurations to index corpora and to retrieve 618 features from 6 open source Java systems. In particular, we measure the effects of the query, the text extractor configuration, and the LDA parameter values on the accuracy of the LDA based FLT. Our key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context. Based on the results of our case study, we offer specific recommendations for configuring the LDA based FLT.