Advertisement

Empirical Software Engineering

, Volume 19, Issue 3, pp 465–500 | Cite as

Configuring latent Dirichlet allocation based feature location

  • Lauren R. Biggers
  • Cecylia Bocovich
  • Riley Capshaw
  • Brian P. Eddy
  • Letha H. Etzkorn
  • Nicholas A. Kraft
Article

Abstract

Feature location is a program comprehension activity, the goal of which is to identify source code entities that implement a functionality. Recent feature location techniques apply text retrieval models such as latent Dirichlet allocation (LDA) to corpora built from text embedded in source code. These techniques are highly configurable, and the literature offers little insight into how different configurations affect their performance. In this paper we present a study of an LDA based feature location technique (FLT) in which we measure the performance effects of using different configurations to index corpora and to retrieve 618 features from 6 open source Java systems. In particular, we measure the effects of the query, the text extractor configuration, and the LDA parameter values on the accuracy of the LDA based FLT. Our key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context. Based on the results of our case study, we offer specific recommendations for configuring the LDA based FLT.

Keywords

Software evolution Program comprehension Feature location Static analysis Text retrieval 

1 Introduction

Software systems continuously undergo incremental change to add new functionalities or to remove defects in existing functionalities. A software developer who is tasked with changing an unfamiliar system must spend effort on program comprehension activities to gain the knowledge needed to make the change. Thus, to curb total software cost, techniques that target these activities are necessary. Similarly, automation of software change tasks that require program comprehension, such as concept location (Biggerstaff et al. 1993) and impact analysis (De Lucia et al. 2007), is key to reducing total software cost.

Feature location is a program comprehension activity in which a developer locates the source code entities that implement a functionality (i.e., a feature, Rajlich and Wilde 2002). Due to the large size of modern software systems, manual feature location is impractical. Thus, researchers have devoted much effort to developing (partially) automated feature location techniques (FLTs), many of which are based on text retrieval (Zhao et al. 2006; Poshyvanyk et al. 2007; Liu et al. 2007; Lukins et al. 2008). Indeed, Dit et al. (2011b) recently reviewed 89 articles from 25 venues and found that 27 of the 52 FLTs are based (at least in part) on text retrieval. However, these text retrieval based techniques are highly configurable (Marcus and Menzies 2010). For example, when using latent semantic indexing (LSI, Deerwester et al. 1990) we must select k, the number of (reduced) dimensions, or when using latent Dirichlet allocation (LDA, Blei et al. 2003) we must select α, β, and K, the two smoothing hyperparameters and the number of topics, respectively.

Which text to extract from the source code is another important configuration decision. In particular, the text extractor (Section 2.1.2) has seven possible configurations: (1) identifiers only, (2) comments only, (3) literals only, (4) identifiers and comments, (5) identifiers and literals, (6) comments and literals, and (7) identifiers, comments, and literals. Before we can index the source code, we must choose one of these configurations. Researchers (Marcus et al. 2004; Liu et al. 2007; Lukins et al. 2008) choose the seventh configuration most often. Liu et al. (2007), for example, defend this choice by stating that “a significant amount of domain knowledge is embedded in the comments and identifiers present in source code.” Zhao et al. (2006), however, choose the first configuration.

Unfortunately, few studies of text retrieval based FLTs directly address the decisions that a practitioner or researcher must make when configuring the FLT. Indeed, the feature location literature contains no empirical evidence that supports the selection of one configuration over another. Closely related literature provides some empirical evidence, but it is mixed. For example, Marcus and Poshyvanyk (2005) report no performance change when including comments in their study of conceptual cohesion of classes, whereas Abadi et al. (2008) report a performance decrease when excluding comments from their study of traceability.

In this paper we present a case study in which we consider the configuration of an LDA based FLT using 618 features in 6 open source Java systems. Specifically, we consider five configuration parameters, the first of which is the query. In recent work, Scanniello and Marcus (2011) consider the effect of the query on the performance of a hybrid FLT which combines a vector space model and clustering. The results of their study indicate that the query has a varied effect in the context of their approach. The second configuration parameter that we study is the extracted text, and we are aware of no study that considers this parameter. The remaining configuration parameters are the number of topics (K) and the two smoothing hyperparameters (α and β). There is little discussion of selecting K in the software engineering literature (Maskeri et al. 2008; Lukins et al. 2010), though Griffiths and Steyvers (2004) investigate methods for choosing optimal K values in the context of natural language (NL) document clustering. Similarly, there is no work in the software engineering literature that investigates the effects of α and β, though Asuncion et al. (2009) investigate the issue in the context of NL document clustering.

2 Background and Related Work

Software systems comprise many artifacts, including structured documents (e.g., XML configuration files), semi-structured documents (e.g., source code files), and unstructured documents (e.g., bug reports or requirements specifications). The text embedded in these software artifacts captures information such as the application domain and the developers’ knowledge. For the past decade, researchers have sought to automate software maintenance and program comprehension tasks by analyzing this embedded text using text retrieval (TR) methods. For example, TR methods show efficacy in automating concept location (Marcus et al. 2004; Zhao et al. 2006; Poshyvanyk et al. 2007; Liu et al. 2007; Revelle and Poshyvanyk 2009; Lukins et al. 2010), impact analysis (Canfora and Cerulo 2006; Poshyvanyk et al. 2009; Gethers and Poshyvanyk 2010), and traceability link recovery (Antoniol et al. 2002; De Lucia et al. 2007; Asuncion et al. 2010; Oliveto et al. 2010).

2.1 Source Code Indexing and Retrieval

After reviewing terminology, we describe the two part process of applying a TR method to source code (Fig. 1).
Fig. 1

Source code indexing and retrieval process

2.1.1 Terminology

We adopt terminology similar to that of Abebe et al. (2009b). In particular we use the following definitions:
Term

a sequence of letters and the basic unit of discrete data in a lexicon

Token

a sequence of non-whitespace characters; contains one or more terms

Entity

a source element such as a class or method

Identifier

a token representing the name of an entity

Comment

a sequence of tokens (delimited by language-specific markers, e.g., /* */)

String literal

a sequence of tokens (delimited by quotes)

Word

the smallest free form in a language

A term is one of: word, abbreviation of a word, contraction of one or two words, acronym of a series of words.

2.1.2 Indexing

The left side of Fig. 1 illustrates the source code indexing process. A document extractor takes source code as input and produces a corpus as output. Each document in the corpus contains the terms associated with a particular entity, typically a class or method.

Text Extraction

The text extractor is the first part of the document extractor. It parses the source code and produces a token stream. The text extractor may be configured to extract tokens (i.e., index source code text) from any combination of the following sources: identifiers, comments, and string literals. Abebe et al. (2009a) define an identifier to be a class name, attribute name, method name, or parameter name. We extend that definition and also allow an identifier to be a local variable name, enumeration constant name, label name, or a generic/template parameter name. With regard to Java, we consider an interface or enum name to be a class name. Comments generally are used either to map requirements to code or to describe the code (Vinz and Etzkorn 2006). Copyright notices typically are not indexed, as they add no information about program purpose or behavior. We think that literals generally are used either to convey information to the end-user (e.g., an error message) or to the developer (e.g., a debugging message). In the former case, we expect the literals to contain domain information, and in the latter case, we expect them to contain implementation information.

Given the three sources of source code text, there are seven possible text extractor configurations:

ID

Source(s) of extracted text

I

Identifiers

C

Comments

L

Literals

IC

Identifiers and comments

IL

Identifiers and literals

CL

Comments and literals

ICL

Identifiers, comments, and literals

Two of the seven configurations appear in the literature on feature location: I (Zhao et al. 2006) and ICL (Marcus et al. 2004; Poshyvanyk et al. 2007; Liu et al. 2007; Lukins et al. 2008; Revelle et al. 2010). To the best of our knowledge, the remaining five configurations do not appear in the literature on feature location.

Preprocessing

The preprocessor is the second part of the document extractor. It applies a series of transformations to each token and produces one or more terms from each token. Common transformations (Marcus et al. 2004; Marcus and Menzies 2010) include:
  • Splitting: separate tokens into constituent terms based on common coding style conventions (e.g., the use of camel case or underscores) and on the presence of non-letters (e.g., punctuation); optionally retain the original token (and henceforth treat it as a term)

  • Normalizing: replace each upper case letter with the corresponding lower case letter (or vice-versa)

  • Filtering: remove common words such as articles (e.g., ‘an’ or ‘the’), programming language keywords, standard library entity names, or short words

  • Stemming: strip suffixes to reduce words to their stems

Splitting and stemming in particular can impact accuracy (Lawrie and Binkley 2011), but we do not consider those issues in this paper.

Term Weighting

Different term weighting schemes may be used during indexing. When a term is added to a document, it is assigned a weight based on the scheme used. Common term weighting schemes include binary, term count, term frequency, and term frequency-inverse document frequency (tf-idf, Salton and Buckley 1988). In the binary term weighting scheme, each term has weight one (if it does appear in the document) or weight zero (if it does not appear in the document). In the term count weighting scheme, each term’s weight is the number of times it appears in the document, whereas in the term frequency weighting scheme, this count is normalized to prevent bias toward longer documents. Finally, in the tf-idf weighting scheme, each term’s weight increases proportionally to the number of times it appears in the document but is offset by its frequency in the corpus.

We do not consider the issue of term weighting in this paper. We use a term count weighting scheme, which is the input expected by R lda, the LDA implementation that we use for the case study.

Model Construction

The center of Fig. 1 illustrates the final step in the source code indexing process. A TR method takes a corpus as input and produces a TR model as output. Commonly used TR methods include the vector space model (VSM, Salton 1989), latent semantic indexing (LSI, Deerwester et al. 1990), and latent Dirichlet allocation (LDA, Blei et al. 2003). The TR model contains a representation of each document in the corpus.

2.1.3 Retrieval

The right side of Fig. 1 illustrates the source code retrieval process. A query engine takes a TR model and a query as input and produces a ranked list of documents as output. Recall that each document represents a particular entity such as a class or method.

Querying

The query may be a string formulated manually or automatically by an end-user or developer. For example, researchers often use text from an issue report (the title, the description, or the combination of the two) to formulate a query string automatically (Dit et al. 2011b; Scanniello and Marcus 2011). In such cases the query preprocessor must mimic the document extractor’s preprocessing steps. Alternatively, the query may be a document from the corpus, in which case the query preprocessor is null.

Ranking

The query engine first applies a classifier (i.e., a similarity measure) pairwise to the preprocessed query and each document in the TR model. It next uses the computed similarity scores to rank the documents in descending order by similarity to the query.

2.2 Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is a probabilistic generative model developed by Blei et al. (2003) for collections of discrete data. When used for text retrieval, LDA uses the co-occurrence of terms in a text corpus to identify the latent (i.e, hidden) structure of topics in the corpus. Each document in a corpus is modeled as a finite mixture over a set of topics, and each topic is modeled as an infinite mixture over a set of topic probabilities. That is, each document is modeled as a probability distribution indicating the likelihood that it expresses each topic, and each topic identified by LDA is modeled as a probability distribution indicating the likelihood of a term from the corpus being assigned to the topic.

LDA uses a bag-of-words representation in which each document is a vector of counts with V components, where V is the size of the vocabulary. Inputs to LDA include:
  • D, the documents

  • K, the number of topics

  • α, the Dirichlet hyperparameter for topic proportions

  • β, the Dirichlet hyperparameter for topic multinomials

Outputs of LDA are:
  • ϕ, the term-topic probability distribution

  • θ, the topic-document probability distribution

The hyperparameters α and β have a smoothing effect on the model. Specifically, α influences the topic distributions per document, whereas β influences the term distributions per topic. Decreasing the hyperparameter values makes the model more decisive—i.e., it makes ϕ and θ more specific (sparse). In particular, a lower α value results in fewer topics per document. Similarly, a lower β value results in fewer terms per topic, which generally increases the number of topics needed to describe a particular document. Griffiths and Steyvers (2004) attribute the granularity of the model to the value of β, because the more topics needed to describe each document, the smaller the details that separate the documents in the corpus.

Griffiths and Steyvers (2004) empirically investigate the optimal K value for a natural language corpus but do not offer any general methodology for selecting the K value. Minka (2009) describes complex estimation methods for setting the hyperparameter values, but the influence of different document contents on these methods is not yet understood (Heinrich 2009). Thus, hyperparameter values typically are set according to the de facto standard heuristics: α = 50/K and β = 0.01 (Wei and Croft 2006; Heinrich 2009; Lukins et al. 2010) or β = 0.1 (Griffiths and Steyvers 2004; Savage et al. 2010; Rao and Kak 2011). The preceding methods and heuristics originate from a natural language (NL) context, but the lexicons and syntax of NL documents differ from those of source code documents, so the methods and heuristics may not be optimal for a software engineering context (Maskeri et al. 2008).

Exact inference for LDA is intractable (Blei et al. 2003), necessitating approximate inference algorithms. Examples of such algorithms include collapsed Gibbs sampling (CGS) (Andrieu et al. 2003; Heinrich 2009; Wei and Croft 2006), which is a special case of Markov-chain Monte Carlo (MCMC) simulation that directly estimates term assignments to topics based on their distribution in the corpus (Griffiths and Steyvers 2004). Gibbs sampling requires a parameter, σ, the number of sweeps to make over the entire corpus.

Given an LDA model (α, β, ϕ, and θ), we can infer the term-topic and topic-document distributions for a new document (Blei et al. 2003). GibbsLDA+ +1 is a CGS implementation of LDA that can perform such inference. Alternatively, we can create a term-document matrix in which each cell contains the probability that the document generates the term, and each column in the matrix represents a document. R lda2 is a CGS implementation that can create such a matrix. Given the terms for query Q and the term-document matrix, we can compute the conditional probability (P) of Q given a document D i :
$$ \mathrm{Sim}(Q,D_i) = P(Q|D_i) = \prod\limits_{q_k \in Q} P(q_k|D_i) \label{eq:condprob} $$
where q k is the kth word in the query.

Researchers recently have employed LDA as a text retrieval tool in several areas of software engineering. Baldi et al. (2008) model concerns as latent topics and use LDA to mine aspects. Tian et al. (2009) use LDA to classify software systems into problem domain categories, and Liu et al. (2009) use LDA to model class cohesion as mixtures of latent topics. Asuncion et al. (2010) analyze captured traceability links using LDA, enabling artifact categorization and topical visualization. Oliveto et al. (2010) also apply LDA to traceability, using it to recover links between use cases and source code. Gethers and Poshyvanyk (2010) propose relational topic based coupling among classes (RTC), a coupling measure based on the RTM (Chang and Blei 2010), which is an extension of LDA. Lukins et al. (2008, 2010) apply LDA to feature location, and finally, Thomas et al. (2011) use LDA to model the evolution of topics in source code histories.

2.3 Feature Location

Feature location is a program comprehension activity in which a developer locates the source code entities that implement a functionality (i.e., a feature, Rajlich and Wilde 2002). Bug localization is sometimes used as a synonym for feature location (Lukins et al. 2008, 2010) when the features of interest are unwanted/faulty functionalities (i.e., bugs). Due to the large size of modern software systems, manual feature location is impractical. Static techniques for automatic feature location (e.g., Lukins et al. 2010) operate on source code, whereas dynamic ones (e.g., Eisenberg and Volder 2005) operate on execution traces. Blended techniques, such as PROMESIR (Poshyvanyk et al. 2007) and SITIR (Liu et al. 2007), incorporate both static and dynamic analyses. Unlike dynamic and blended techniques, static techniques require neither a working software system, nor a test case/suite that exercises the feature (or triggers the bug). The main focus of this paper, and thus of the following review, is on static techniques that use TR methods.

Dit et al. (2011b) provide a taxonomy and comprehensive survey of the literature on feature location.

In recent work Scanniello and Marcus (2011) combine VSM with clustering for static feature location. In addition to comparing their work to VSM alone, they investigate the effect of using three different queries on the performance of their approach, finding that the effect varies by subject system. Marcus et al. (2004) present the first LSI-based FLT. A recent extension by Gay et al. (2009) adds relevance feedback to that techique; users mark relevant results, and a new query is formulated automatically using that feedback. SNIAFL (Zhao et al. 2006), Dora (Hill et al. 2007), and LSICG (Shao et al. 2012) augment TR methods with call graph construction and analysis.

Liu et al. (2007) introduce SITIR, a blended interactive FLT in which a single scenario execution trace is filtered using LSI. First, a developer executes a single scenario that exercises the feature of interest then formulates a query. Next, SITIR ranks each executed method using LSI to measure similarity to the query. Liu et al. compare SITIR to LSI alone (a static technique), to scenario-based probabilistic ranking (SPR) (a dynamic technique), and to PROMESIR (Poshyvanyk et al. 2007) (which combines LSI and SPR). The results indicate that PROMESIR and SITIR perform marginally better than LSI alone but much better than SPR. Though PROMESIR outperforms SITIR, the difference is small, and SITIR requires fewer scenarios. Liu et al. observe that SITIR is less sensitive to poor queries than is LSI alone.

Ratanotayanon et al. (2010) investigate the effects of using diverse sources of static data on feature location performance. In particular, they consider combinations of change sets, issue trackers and dependency graphs. Their results indicate that using diverse data is not always beneficial. In another investigation of the effects of data fusion on feature location, Revelle et al. (2010) extend SITIR by mining dependence information from the Web. Their case study results indicate that by augmenting an existing FLT with a web mining algorithm, a statistically significant performance improvement can be achieved.

Lukins et al. (2008, 2010) present a static FLT that uses LDA to identify methods affected by a bug. They use the results from Poshyvanyk et al. (2007) to directly compare their LDA-based technique to an LSI-based technique. The results indicate that for Eclipse, using the same queries, LDA outperforms LSI. Further, the results indicate that for Mozilla, LDA can outperform LSI, but that LDA is sensitive to poor queries (as is LSI, Liu et al. 2007). Lukins et al. also show that the LDA-based technique scales to 322 bugs across 25 versions of two Java systems and that its performance is not influenced by source code stability.

Dit et al. (2011a) investigate the effects of three identifier splitting algorithms on the accuracy of an LSI based FLT and of an FLT based on the combination of LSI and dynamic analysis. They apply the FLT to two open source Java systems and conclude that FLTs using TR techniques can benefit from better identifier splitting algorithms and that manual splitting outperforms state-of-the-art algorithms.

Like Poshyvanyk et al. (2007) and Liu et al. (2007), Lukins et al. (2008, 2010) tune their queries to obtain acceptable results. Thus, the performance of any of these techniques depends on developer experience in that the selection of appropriate terms for the query requires some knowledge of the feature or bug domain. However, unlike Poshyvanyk et al. and Liu et al., Lukins et al. define a query refinement process and follow this process to formulate at most three queries for each bug. They formulate the first query using only terms from the bug title. If a second query is necessary, they add terms from the bug description. If a third query is necessary, they remove unrelated terms or add any of the following: abbreviations (e.g., mgmt for management), abbreviation expansions (e.g., format for fmt), term variants (e.g., parse for parser), term synonyms (e.g., abort for exit), or sub-terms (e.g., name for rename).

3 Case Study

In this section we describe the design of a case study in which we measure the effects of different configurations on the accuracy of an LDA based FLT. We describe the case study using the Goal–Question–Metric approach (Basili et al. 1994). The data for the case studies is available in this paper’s online appendix.3

3.1 Definition and Context

Our primary goals are to understand whether five factors interact and to understand whether and to what extent the five factors affect the performance of an LDA based FLT. The five factors of interest are the query (Query), the source code text to extract (Text), and three LDA parameter values (K, α, β). The quality focus of the study is on establishing the importance of proper configuration to attaining optimal performance from an LDA based FLT and on informing the configurations of five parameters. The perspective of the study is of a software developer performing a change task on a software system and using an LDA based FLT to identify a starting point (i.e., a method) from which to begin the change. The context of the study spans 618 features from 6 open source Java systems (ArgoUML, JabRef, jEdit, muCommander, Mylyn, Rhino).

3.1.1 Overview

We consider six factors in the case study, which has five parts. The design of Part 1 is listed in Table 1, the designs of Parts 2–4 are listed in Table 2, and the design of Part 5 is listed in Table 3. The first factor is Query, which is a categorical variable. The next factor is Text, which is a categorical variable that represents the text extractor configuration. The third factor is K, which is a ratio variable that represents the number of topics. α and β are the fourth and fifth factors, represent the smoothing hyperparameters, and are ratio variables. System is the sixth factor and is a categorical variable. We describe its categories in the next section.
Table 1

Case study design: part 1

Factor

Values

Query

Title, Description, Combined

Text

I, CL, ICL

K

100, 200, 500

α

0.5, 1.0, 50.0

β

0.01, 0.1, 0.5

Table 2

Case study design: parts 2–4

Part

Factor(s) of interest

Controlled factors

2

Query

Text (ICL), K (100 or 200), α (50/K), β (0.01)

3

Text, K

Query (Combined), α (50/K), β (0.01)

4

α, β

Query (Combined), Text (ICL), K (100 or 200)

Table 3

Case study design: part 5

Configuration

Query, Text, K, α, β

Predicted ArgoUML

Combined, ICL, 500, 1.0, 0.1

Predicted jEdit

Combined, ICL, 400, 1.0, 0.25

Predicted JabRef

Combined, ICL, 400, 1.0, 0.25

Predicted Rhino

Combined, ICL, 300, 1.0, 0.5

Heuristic 1

Combined, ICL, 200, 0.25, 0.01

Heuristic 2

Combined, ICL, 200, 0.25, 0.1

In Part 1 of the case study we focus on the interactions among five factors (Query, Text, K, α, β). We use a full factorial design to permit the detection of interaction effects using factorial ANOVA. To ensure the feasibility of this part of the case study, we limit each of the five factors to three possible values (for a total of 243 distinct configurations).

We focus on a different factor (or pair of factors) in each of Parts 2–4, controlling System throughout. In Part 2 we focus on Query and control the other factors. Query has three categories (Title, Description, Combined). In Part 3 we focus on Text and K. Text has seven categories (I, C, L, IC, IL, CL, ICL), and K has four categories per subject system. Because the value of Text may affect the size of the corpus (i.e., the numbers of documents and terms), and because K should be proportional to the size of the corpus (Griffiths and Steyvers 2004), we vary these factors together. In Part 4 of the case study we focus on α and β. Like Asuncion et al. (2009) we vary these factors together, assigning six values to each factor (0.01, 0.1, 0.25, 0.5, 0.75, 1).

In Part 5 of the case study we apply the lessons learned in Parts 1–4 and compare our predicted best configurations for four systems (ArgoUML, jEdit, JabRef, and Rhino) to two generic configurations informed by heuristics from the literature.

3.1.2 Subject Software Systems

We chose the six subjects of our study—ArgoUML,4 JabRef,5 jEdit,6 muCommander,7 Mylyn,8 and Rhino9—because they vary in size and application domain, because they are similar to systems developed in industry, and because suitable benchmarks (in the context of our study) are available online (Dit et al. 2011b; Eaddy et al. 2008).

Table 4 lists four size metrics for each of the six subject systems: source lines of code (SLOC), comment lines of code (CLOC), Java file count, and method count. The table also lists the number of features that we study for each system. Further, the application domains of the systems are as follows. ArgoUML is a UML modeling tool, and JabRef is a bibliography reference manager. jEdit is a programmer’s text editor, and muCommander is a cross-platform file manager. Mylyn is an Eclipse plug-in that provides a task-focused interface for ALM, and Rhino is a JavaScript engine that provides a compiler, an interpreter, and a debugger.
Table 4

Subject software systems

System

Version

SLOC

CLOC

Files

Methods

Features

ArgoUML

0.22

117,649

104,037

1,407

11,348

91

JabRef

2.6b

74,350

25,927

579

5,323

38

jEdit

4.3

98,460

42,589

483

6,550

149

muCommander

0.8.5

76,649

68,367

1,069

8,811

90

Mylyn

1.0.1

99,310

23,503

936

9,067

93

Rhino

1.6R5

45,225

15,451

129

2,565

157

 

Total

511,643

279,874

4,603

43,665

618

Each system is maintained by multiple developers who are obligated to follow coding standards set forth by project organizers. The source code for each system is stored in a change management system (CVS or Subversion), and the developers of each system use descriptive commit messages. Each system is accompanied by a test suite, and developers store bug reports in an issue tracking system (Bugzilla/SourceForge/Tigris).

3.1.3 Benchmarks

We study features that correspond to issues reported via Bugzilla or an equivalent issue tracking system. Most of the issue reports are requests to change an unwanted functionality (i.e., to remove a faulty feature), though some are requests to add a new functionality (i.e., to add a new feature). Two approaches are used to recover the set of methods modified to fix each bug or to add each functionality. Either the patches submitted to Bugzilla are used to recover the set of methods modified to address each issue (Corley et al. 2011), or the diffs stored in Subversion are used to recover the set of methods modified (Dit et al. 2011b; Eaddy et al. 2008). Like Poshyvanyk et al. (2007) and Dit et al. (2011b), we term this set of modified methods the “gold set” because methods modified to change a feature’s implementation are likely relevant to the feature. Consistent with the procedures of prior studies (Poshyvanyk et al. 2007; Liu et al. 2007; Lukins et al. 2008, 2010; Revelle et al. 2010), we use gold sets as the benchmarks by which we evaluate FLTs. The six gold sets were produced by other researchers and made available to the community.10 , 11

We study 618 features total. Specifically, we study the following numbers of features for each system: 91 for ArgoUML, 38 for JabRef, 149 for jEdit, 90 for muCommander, 93 for Mylyn, and 157 for Rhino. The selected features represent a subset of the available features for each benchmark. We exclude any available feature for which the bug report’s title or description is empty after the application of our preprocessing steps (i.e., splitting, normalizing, filtering, and stemming). The unique identifiers for the selected features are available in this paper’s online appendix.12

Due to the large number of features that we consider, and to eliminate potential bias, we automatically formulate three queries for each feature. Specifically, we use as the query the bug report’s title, its description, or its title and description combined (Dit et al. 2011b). This query formulation process is conservative, in that it does not rely on developer experience, and unbiased, in that it does not allow us to influence the results.

3.1.4 Effectiveness Measure

Though modifying or removing a functionality requires that the developer identify all entities to be changed, the goal of automatic feature location is to identify a single method from which the developer can begin the change (Rajlich 2006; Poshyvanyk et al. 2007; Lukins et al. 2010). Indeed, the results of a recent exploratory study (Revelle and Poshyvanyk 2009) indicate that finding one relevant method is the strength of automatic FLTs. Other methods associated with the feature can then be identified using impact analysis (Canfora and Cerulo 2006; Poshyvanyk et al. 2009; Gethers and Poshyvanyk 2010; Lukins et al. 2010; Beard et al. 2011).

Because static FLTs rank all methods in a system, recall and precision are not useful accuracy measures in this context. In particular, in the context of static FLTs, recall is always 1.0, and precision is always 1/n (where n is the number of methods). Thus, the rank of the first relevant method is used instead (Poshyvanyk et al. 2007; Liu et al. 2007; Lukins et al. 2008, 2010; Revelle et al. 2010). This measure, which Poshyvanyk et al. (2007) term the “effectiveness measure” for feature location, indicates the number of entities that the developer must examine (if following the ranking) before reaching a method that actually belongs to the feature. That is, the measure quantifies the number of false positives that a developer must examine. Thus, in this study we adopt the effectiveness measure as our accuracy measure.

3.1.5 Setting

To conduct the study, we instantiate the process illustrated in Fig. 1.

We implemented our document extractor in Python v2.6 using ANTLR v3 and an open source Java 1.5 grammar.13 We extract documents at the method level of granularity using a term count weighting scheme. We consider every method to be distinct. That is, if method bar is nested within method foo,14 each method is considered separately, and the text for method bar is not considered to be part of the text for method foo. We associate any comment that is contained in a method with that method. Further, like Fluri et al. (2007), we associate any block comment (or series of line comments) that precedes a method with that method.

Our document/query preprocessor implements the four transformations described in Section 2.1.2. We filter java.lang class names before splitting tokens. We split tokens based on camel case, underscores, and non-letters. After splitting tokens we retain the original token. We normalize to lower case before filtering English stop words (Fox 1992), Java keywords, and terms shorter than three characters. We apply a Porter stemmer15 to retained terms.

For the second and fourth parts of the first case study, we set K = 100 for Rhino and K = 200 for the other five systems. In previous studies, researchers have set K = 100 (Lukins et al. 2008, 2010) or K = 125 (Gethers and Poshyvanyk 2010) for Rhino, and we adopt the former value because our task and setting are similar to those of the former studies. We heuristically set K = 200 for the other five systems based on their sizes. As our study is concerned primarily with relative performance, not absolute performance, it is not critical that we choose the optimal K for each system.

We use R lda v1.2.3 to compute and query LDA models. Because R lda implements a CGS algorithm, we must set σ, the number of sweeps to make over the entire corpus. We set σ = 500, which provides a balance between execution efficiency and model convergence. Our classifier is conditional probability (P), which we describe in Section 2.2.

3.2 Hypotheses

We describe the hypotheses for each part of the case study.

3.2.1 Hypotheses for Part 1

For the five factors (Query, Text, K, α, β), we test all two-way interactions, all three-way interactions, all four-way interactions, and the five-way interaction for statistical significance.

For the 10 two-way interactions, each null hypothesis is of the form:
  • H 0 : μ + ν  There is no interaction between factors μ and ν.

Further, each alternative hypothesis is of the form:
  • H A : μ* ν  There is interaction between factors μ and ν.

For example:
  • H 0 : Query + Text  There is no interaction between factors Query and Text.

The remaining 9 null hypotheses are analogous. We tested these hypotheses using the effectiveness measure.

If we can reject a null hypothesis with high confidence (α = 0.05), we accept an alternative hypothesis stating that the two factors interact. For example, the alternative hypothesis corresponding to the example null hypothesis is:
  • H A : Query * Text There is interaction between factors Query and Text.

The remaining 9 alternative hypotheses are analogous.

The null and alternative hypotheses for the 10 three-way interactions, the 5 four-way interactions, and the five-way interaction are formulated similarly.

3.2.2 Hypotheses for Parts 2–5

We compare a large number of configurations, and we make no presupposition about the direction of the difference between any two configurations. Thus, all of our hypotheses are two-sided. In particular, each null hypothesis is of the form:
  • H 0 : μ = ν  Configuration μ does not significantly affect the accuracy of the LDA based FLT compared to configuration ν.

Further, each alternative hypothesis is of the form:
  • \(H_{A} : \mu \ne \nu\)  Configuration μ does significantly affect the accuracy of the LDA based FLT compared to configuration ν.

In the second part of the case study we focus on Query and control the other factors. Query is a categorical variable with three categories (Title, Description, Combined), so we formed three null hypotheses to test whether LDA based feature location produces different results when using different queries. In particular, for each null hypothesis, μ and ν are distinct query types in {Title, Description, Combined}. For example:
  • H 0 : Title = Description  Title does not significantly affect the accuracy of the LDA based FLT compared to Description.

The remaining two null hypotheses are analogous. We tested these hypotheses using the effectiveness measure.
If we can reject a null hypothesis with high confidence (α = 0.05), we accept a two-sided alternative hypothesis stating that a query type has an effect on the ranking of the first relevant method compared to another query type. For example, the alternative hypothesis corresponding to the example null hypothesis is:
  • \(H_{A} : Title \ne Description\)  Title does significantly affect the accuracy of the LDA based FLT compared to Description.

The remaining two alternative hypotheses are analogous.
In the third part of the case study we focus on Text and K and control the other factors. Text is a categorical variable with seven categories (I, C, L, IC, IL, CL, ICL), and K is a ratio variable with four values, so we formed 378 null hypotheses to test whether LDA based feature location produces different results when using different text and different numbers of topics. In particular, for each null hypothesis, μ and ν are distinct pairs in {I, C, L, IC, IL, CL, ICL} ×{75, 100, 150, 200}. For example:
  • H 0 : (I,75) = (C,100)  (I,75) does not significantly affect the accuracy of the LDA based FLT compared to (C,100).

The remaining 377 null hypotheses are analogous. We tested these hypotheses using the effectiveness measure.
If we can reject a null hypothesis with high confidence (α = 0.05), we accept a two-sided alternative hypothesis stating that the text/topics pair has an effect on the ranking of the first relevant method compared to another text/topics pair. For example, the alternative hypothesis corresponding to the example null hypothesis is:
  • \(H_{A} : (I,75) \ne (C,100)\)  (I,75) does significantly affect the accuracy of the LDA based FLT compared to (C,100).

The remaining 377 alternative hypotheses are analogous.
In the fourth part of the case study we focus on α and β and control the other factors. The hyperparameters α and β are ratio variables with six values each (0.01, 0.1, 0.25, 0.5, 0.75, 1), so we formed 630 null hypotheses to test whether LDA based feature location produces different results when using different values of α and β. In particular, for each null hypothesis, μ and ν are distinct pairs in {0.01, 0.1, 0.25, 0.5, 0.75, 1} ×{0.01, 0.1, 0.25, 0.5, 0.75, 1}. For example:
  • H 0 : (0.5,0.01) = (0.25,0.1)  (0.5,0.01) does not significantly affect the accuracy of the LDA based FLT compared to (0.25,0.1).

The remaining 629 null hypotheses are analogous. We tested these hypotheses using the effectiveness measure.
If we can reject a null hypothesis with high confidence (α = 0.05), we accept a two-sided alternative hypothesis stating that the α/β pair has an effect on the ranking of the first relevant method compared to another α/β pair. For example, the alternative hypothesis corresponding to the example null hypothesis is:
  • \(H_{A} : (0.5,0.01) \ne (0.25,0.1)\)  (0.5,0.01) does significantly affect the accuracy of the LDA based FLT compared to (0.25,0.1).

The remaining 629 alternative hypotheses are analogous.
In the fifth part of the case study we compare our predicted best configurations for four systems (ArgoUML, jEdit, JabRef, and Rhino) to two generic configurations informed by heuristics from the literature. In particular, for each null hypothesis, μ and ν are distinct pairs in {PredictedArgoUML, PredictedjEdit, PredictedJabRef, PredictedRhino} × {Heuristic1, Heuristic2}. For example:
  • H 0 : Predicted ArgoUML = Heuristic 1  Predicted ArgoUML does not significantly affect the accuracy of the LDA based FLT compared to Heuristic 1.

The remaining 7 null hypotheses are analogous.
If we can reject a null hypothesis with high confidence (α = 0.05), we accept a two-sided alternative hypothesis stating that the predicted configuration has an effect on the ranking of the first relevant method compared to the heuristic configuration. For example, the alternative hypothesis corresponding to the example null hypothesis is:
  • \(H_{A} : Predicted_{\rm ArgoUML} \ne Heuristic_1\)  Predicted ArgoUML does significantly affect the accuracy of the LDA based FLT compared to Heuristic 1.

The remaining 7 alternative hypotheses are analogous.

3.3 Data Collection and Analysis

We collected two kinds of data for this case study. First, we collected size metrics for the corpora. In particular, we built a corpus using each of the seven text extractor configurations, and we collected three size metrics for each corpus:

Terms

the number of unique terms

Uses

the total number of term uses (i.e., instances)

Docs

the number of non-empty documents

Table 5 lists the size metrics. Note that the Docs values for some corpora are less than the numbers of methods in the system. For example, Rhino’s L corpus contains 522 non-empty documents, whereas Rhino contains 2,565 methods. This is because some Rhino methods contain no string literals—documents for such methods are empty.
Table 5

Corpora size metrics

Metric

I

C

L

IC

IL

CL

ICL

(a) ArgoUML

  Terms

11,065

5,866

1,565

12,735

11,549

6,283

13,033

  Uses

326,417

144,162

13,357

470,579

339,774

157,519

483,936

  Docs

11,348

10,781

2,352

11,348

11,348

10,853

11,348

(b) JabRef

  Terms

8,081

3,725

2,411

9,476

9,426

4,969

10,500

  Uses

223,638

44,780

23,643

268,418

247,281

68,423

292,061

  Docs

5,323

2,151

1,760

5,323

5,323

2,983

5,323

(c) jEdit

  Terms

8,749

4,162

1,714

9,861

9,159

4,788

10,150

  Uses

259,082

59,208

13,842

318,290

272,924

73,050

332,132

  Docs

6,549

4,311

1,519

6,550

6,549

4,737

6,550

(d) muCommander

  Terms

10,491

4,552

1,624

11,943

11,222

5,289

12,556

  Uses

273,507

122,604

7,932

396,111

281,439

130,536

404,043

  Docs

8,811

4,393

923

8,811

8,811

4,624

8,811

(d) Mylyn

  Terms

11,591

3,210

1,309

12,514

11,988

3,724

12,818

  Uses

396,570

33,553

16,810

430,123

413,380

50,363

446,933

  Docs

9,067

2,348

1,217

9,067

9,067

3,117

9,067

(e) Rhino

  Terms

5,448

2,606

1,192

6,521

5,713

3,009

6,705

  Uses

128,472

29,688

6,017

158,160

134,489

35,705

164,177

  Docs

2,565

1,235

522

2,565

2,565

1,434

2,565

We also collected the effectiveness measure, which is the primary data of interest in our case study. For each feature and configuration, we collected one effectiveness measure for each query. We then analyzed the data for each system/configuration pair to determine the minimum, maximum, median, and the percentage of features for which the FLT failed, a phenomenon that we describe in the next paragraph. The FLT fails only in the second part of the case study, in which we study different text extractor configurations, and in the first part of the case study, in which we study factor interactions.

For each feature/configuration pair, the FLT assigns a rank in the range [1,n], where n is the number of methods, to each document in the corpus. That is, the range of the effectiveness measure is [1,n]. However, for some configurations the number of non-empty documents, m, is less than n. In such a case, the rank assigned to each empty document is implementation defined—a valid implementation may assign an empty document (which may represent a method in the gold set) any rank in the range [m + 1,n]. Thus, in such cases, an effectiveness measure in the range [m + 1,n] is meaningless in that it is not based on the similarity between the query (i.e., description of the feature) and the document. So, for a system/configuration pair, if the effectiveness measure for a feature is in the range [m + 1,n] then we consider the FLT to have failed, and we report this.

In our initial data analysis, we use the Kruskal–Wallis test, a non-parametric test similar to the (parametric) one-way ANOVA test. If a Kruskal–Wallis test reveals a significant effect, we conduct a post-hoc test using pairwise Mann–Whitney tests with Holm correction. Further, if a Mann–Whitney test reveals a significant difference in accuracy between two configurations, we compute the effect size (\(r = Z / \sqrt{N}\), where N is the total number of samples). We use the following (standard) interpretations of the effect size, r: negligible for |r| < 0.1, small for 0.1 ≤ |r| < 0.3, medium for 0.3 ≤ |r| < 0.5, and large for |r| ≥ 0.5.

4 Results

In this section we report the results for each part of our case study. We conclude this section with threats to validity.

4.1 Part 1: Testing for Interactions Among Factors

In Table 6 we list the results of a factorial ANOVA applied to the effectiveness measures for the 243 configurations applied to 618 features. We list all main effects but only statistically significant interaction effects. Four of the five factors have significant main effects, which is consistent with the results of Parts 2–4 of the case study. Only α does not have a significant main effect. There are four significant two-way interactions: (1) Text and K, (2) K and α, (3) Text and β, and (4) α and β. We anticipated the first and fourth interactions, and we accounted for them in the designs of Parts 3 and 4 of the case study. In particular, Griffiths and Steyvers (2004) identified the first interaction, stating that K should be proportional to the size of the corpus (which varies with Text). Further, the second and fourth interactions are inherent in LDA (Blei et al. 2003). Though there is an interaction between Text and β, any adjustment must be to the β parameter. This is because ICL is the only value of Text for which there is good performance with no failures (see Part 3 of the case study). There is one significant three-way interaction, between K, α, and β. Again, this interaction is inherent in LDA. There are no significant four-way or five-way interactions.
Table 6

Results of a factorial ANOVA

Factor

F value

p value

Query

79.3594

< 0.001

Text

57.6865

< 0.001

K

22.7080

< 0.001

α

2.8419

0.06

β

16.0175

< 0.001

Text:K

3.2958

0.01

K:α

3.4374

< 0.01

Text:β

26.8548

< 0.001

α:β

26.7096

< 0.001

K:α:β

1.9650

< 0.05

4.2 Part 2: Configuring the Query

We first present a statistical analysis of the results and then provide discussion of the results.

4.2.1 Statistical Analysis

In Fig. 2 we illustrate box plots which represent statistics describing the effectiveness measures for the test data. Recall that a small effectiveness measure is better than a larger one (i.e., rank 1 is better than rank 1,000). Several of the maximum effectiveness measures are beyond the scales of the diagrams. However, we chose the scales to highlight the (small) differences between the medians. Further, we omit outliers for readability.
Fig. 2

The effectiveness measure for three configurations (Title, Description, and Combined) of the LDA-based FLT applied to 91 ArgoUML features, 38 JabRef features, 149 jEdit features, 90 muCommander features, 93 Mylyn features, 157 Rhino features, and all 618 features

The box plots show that there is no consistent pattern across all systems, except that for each system there is relatively little difference between the medians of the three configurations. Description generally has the worst performance, though for ArgoUML and Rhino, the medians for Description are smaller (better) than those of Title. Similarly, Combined generally has the best performance, though for JabRef the median for Combined is larger (worse) than that of Title and for jEdit the median for Combined is equal to that of Title.

For each system we conducted a Kruskal–Wallis test to determine whether the Query factor has a significant effect on the accuracy of the LDA based FLT. The test revealed a significant effect only for ArgoUML (χ 2(2) = 11.74, p < 0.003), and a post-hoc test for ArgoUML using Mann–Whitney tests with Holm correction showed small differences between Description and Title (p < 0.008, |r| = 0.20) and between Combined and Title (p < 0.002, |r| = 0.24). Based on our statistical results, for ArgoUML we can reject the null hypotheses which states that Description and Combined do not significantly affect the accuracy of the LDA based FLT compared to Title, and we can instead accept the corresponding alternative hypotheses. However, the effect sizes are small in practice.

We also conducted a Kruskal–Wallis test on all 618 features, and the test revealed a significant effect (χ 2(2) = 7.67, p < 0.03). The post-hoc test using Mann–Whitney tests with Holm correction showed the negligible difference between Combined and Title (p < 0.006, |r| = 0.08). Based on our statistical results, when the data from all six subject systems are considered together, we can reject the null hypothesis which states that Combined does not significantly affect the accuracy of the LDA based FLT compared to Title, and we can instead accept the corresponding alternative hypothesis. However, the effect size is negligible in practice.

Across all systems, Combined outperforms Title, and though we did not find a statistically significant relationship between them, Title generally outperforms Description. This is an interesting finding, because it indicates that length alone does not explain the effectiveness of a query. That is, if shorter queries provided the best performance, we would expect Title to outperform both Description and Combined. Similarly, if longer queries provided the best performance, we would expect both Description and Combined to outperform Title. Instead, the results indicate that Title and Description complement each other, as their combination outperforms either of them in isolation. Based on the results of this part of our study, we recommend using Combined (i.e., the combination of the title and the description) when automatically formulating a query string for LDA based feature location.

4.2.2 Discussion

Our statistical analysis revealed that the Query factor has a significant effect on the accuracy of the LDA based FLT. We now present qualitative analysis of the results.

Figure 2g shows that the three queries produce similar results overall. However, it does not help us to understand the relative performance of the different queries on the same feature. That is, for a given feature, we cannot determine from the figure whether the three queries provide similar performance. So, we investigated how often all three queries return the same effectiveness measure and found that this happened only 3 % of the time (20 of 618 times). We also found that for 17 of those 20 features, all three queries returned 1, the best possible effectiveness measure. We then investigated how often all three queries return effectiveness measures within 10 ranks of each other (17 % of the time or 102 of 618 times), within 50 ranks of each other (32 % of the time or 198 of 618 times), and within 100 ranks of each other (40 % of the time or 248 of 618 times). Note that at least one of the queries performs noticeably worse than the others 83 % of the time.

In the following paragraphs, we highlight a number of features and discuss the performance of the LDA based FLT given different queries for each of these features.

Consider feature 401916 for ArgoUML. The three queries are shown in Table 7. For each of the three queries, the LDA based FLT returned 1. In particular, each of the queries returned ProjectBrowser.loadProject, a method from the gold set, first in the list of results. Upon inspection of the queries, we note that the Title and the Description are of similar length and that four of the five words in Title also appear in Description. Thus, it is probably not surprising that the queries performed similarly.
Table 7

Queries for ArgoUML, feature 4019

Query

Contents

Title

save project dialog rememb load

Description

save project dialog assum save filenam last project save project load

Combined

Title + Description

Next consider feature 284244417 for jEdit. Table 8 lists the three queries. For each of the three queries, the LDA based FLT returned 1. However, each of the queries returned a different method (from the gold set) first in the list of results. Moreover, the three different methods come from two distinct classes—HyperSearchRequest and HyperSearchResults. Upon inspection of the queries, we note that the Title and the Description have noticeably different lengths. In particular, Title has 7 words, whereas Description has 51 words (which include 5 of the 7 words from Title). This feature is an example of a finding that we described in the previous section. In particular, this feature demostrates that length alone does not explain the effectiveness of a query.
Table 8

Queries for jEdit, feature 2842444

Query

Contents

Title

manual stop hypersearch hyper search oper request

Description

maximum result option prompt stop continu hypersearch hyper search nice manual stop button purpos patch stop button ad us default set icon hypersearch hyper search pane highlight multi result button disabl search current activ enabl otherwis stop basic mimick max result mechan except temporari properti indic stop button click handl fine

Combined

Title + Description

Consider feature 31118 for muCommander. The three queries are shown in Table 9. For the Title query the LDA based FLT returned 5, for the Description query it returned 3, and for the Combined query it returned 1. Next consider feature 35231919 for Rhino. Table 10 lists the three queries. For the Title query the LDA based FLT returned 62, for the Description query it returned 3, and for the Combined query it returned 1. These features are examples of another finding that we described in the previous section. In particular, these features demonstrate that Title and Description can complement each other to provide improved performance as Combined. This observation is consistent with the way LDA models documents—unique or rare (within the corpus) word co-occurrences are key to differentiating documents from the corpus.
Table 9

Queries for muCommander, feature 311

Query

Contents

Title

free space indic flip

Description

free space indic look compar what normal normal bar amount space drive fill bar fill drive cool bar gradual chang color space drive orang color red color

Combined

Title + Description

Table 10

Queries for Rhino, feature 352319

Query

Contents

Title

cant restart continu catch block

Description

attempt restart continu captur catch block except thrown function enteractivationfunct enter activ function scriptruntim script runtim java line nativecal nativ call call nativecal nativ call activ invok interpret java line scriptruntim script runtim enteractivationfunct enter activ function frame scope throw frame scope nativewith nativ creat catch block instead nativecal nativ call attach file testcontinu test continu java testcontinu test continu jointli reproduc note wont root caus run itll trigger intQerpret except handler thatll run npe continu stack wasnt restor properli root caus guess rewrit interpret java line doesnt specifi frame scope instead walk parent chain nativecal nativ call help chang signatur scriptruntim script runtim enteractivationfunct enter activ function accept nativecal nativ call instead gener scriptabl help enforc practic

Combined

Title + Description

We also investigated how often all three queries returned effectiveness measures greater than 100 and found that this happened 28 % of the time (174 of 618 times). The most common causes of such performance include queries that contain misleading words and gold set methods that are short or contain only words that are common in the corpus. For example, consider method BugzillaAttachmentHandler.uploadAttachment, which is listed in Fig. 3. This method is in the gold set for feature 15125720 for Mylyn. Indeed, this method is the only member of the gold set for feature 151257. The LDA based FLT performs poorly for this feature, because BugzillaAttachmentHandler.uploadAttachment is relatively short and because its words often co-occur in the corpus.
Fig. 3

The source code for the Mylyn method BugzillaAttachmentHandler.uploadAttachment

4.3 Part 3: Configuring the Text Extractor and K

We first present a statistical analysis of the results and then provide discussion of the results.

4.3.1 Statistical Analysis

In Table 11 we list statistics describing the effectiveness measures for the 28 configurations applied to all 618 features. In particular, for each configuration we list the minimum (best) rank, the maximum (worst) rank, the median rank, and the percentage of times that the configuration failed (see Section 3.3). In Table 11, all of a configuration’s failures were included in the data and were assigned a rank equal to the number of methods in the particular system. For example, a failure for ArgoUML was assigned the rank 11,348 (the number of methods in ArgoUML), whereas a failure for Mylyn was assigned the rank 9,067 (the number of methods in Mylyn). No configuration that includes identifiers fails, whereas configurations that exclude identifiers fail from 5 % to 28 % of the time. Specifically, the CL configurations fail only about 5 % of the time (for 30 of the 618 features), whereas the L configurations fail about 28 % of the time (for 172 of the 618 features).
Table 11

The effectiveness measure for 28 configurations (Text/K pairs) of the LDA based FLT applied to all 618 features

ID

(Text, K)

Min

Max

Median

% Failed

1

(I,75)

1

10,206

70.5

0

2

(I,100)

1

10,151

70

0

3

(I,150)

1

10,152

57.5

0

4

(I,200)

1

10,141

54

0

5

(C,75)

1

9,101

58

8

6

(C,100)

1

8,948

38

8

7

(C,150)

1

10,093

37

8

8

(C,200)

1

10,159

31

8

9

(L,75)

1

11,348

145.5

28

10

(L,100)

1

11,348

122

28

11

(L,150)

1

11,348

133

28

12

(L,200)

1

11,348

138.5

28

13

(IC,75)

1

11,090

57

0

14

(IC,100)

1

11,076

48.5

0

15

(IC,150)

1

11,154

42.5

0

16

(IC,200)

1

10,884

46

0

17

(IL,75)

1

10,537

69

0

18

(IL,100)

1

10,547

60.5

0

19

(IL,150)

1

10,277

43

0

20

(IL,200)

1

10,332

52

0

21

(CL,75)

1

9,469

49

5

22

(CL,100)

1

10,244

34

5

23

(CL,150)

1

9,826

31

5

24

(CL,200)

1

10,481

29.5

5

25

(ICL,75)

1

10,673

48.5

0

26

(ICL,100)

1

10,632

51

0

27

(ICL,150)

1

10,719

37

0

28

(ICL,200)

1

10,772

32.5

0

Table 11 highlights a surprising result. Configurations 8, 23 and 24—(C,200), (CL,150) and (CL,200), respectively—have the lowest median ranks among the 28 configurations, even though these configurations each fail 5 % to 8 % of the time and are penalized harshly for each failure. The configuration with the lowest median rank and no failures is configuration 28—(ICL,200)—with a median rank of 32.5. We conclude that failures may be an acceptable trade-off for the concomitant gain in accuracy, particularly if the LDA based FLT is to be combined with another static or dynamic analysis (to form a hybrid technique).

We conducted a Kruskal–Wallis test on all 618 features. In the test, all of a configuration’s failures were included in the data and were assigned a rank equal to the number of methods in the particular system. Thus, this test skewed the results in favor of the configurations with the fewest failures.

The Kruskal–Wallis test (failures assigned maximum rank) revealed a significant effect (χ 2(27) = 215.37, p < 0.001). Based on the post-hoc test using Mann–Whitney tests with Holm correction, we can reject 89 (of 378) null hypotheses. Effect sizes range from negligible to small. In particular, the largest effect size (between configuration 9 and 24) is |r| = 0.20.

4.3.2 Discussion

Our statistical analysis revealed that the Text and K factors have a significant effect on the accuracy of the LDA based FLT. We now present a qualitative analysis of the results.

The results in Table 11 demonstrate that comments and literals play an important role in the performance of the LDA based FLT. That is, using identifiers only (I)—like Zhao et al. (2006)—for the LDA based FLT would result in reduced performance. Comments often provide a rich set of terms related to the problem domain, and literals often map directly to error messages or other aspects of the user interface that are mentioned in issue reports. On the other hand, identifiers often reflect the solution domain.

The configurations that provide the best performance are those where Text is C, CL, or ICL and where K is 200. Though (CL,200) provides the best absolute performance, it is subject to a 5 % failure rate. Moreover, because the corpora grow substantially when adding identifiers—that is, when switching from CL to ICL—we tested whether increasing K further (e.g., to 300) would permit ICL to provide the best performance. Indeed, when we increased K from 200 to 300, ICL instead provided the best performance. We explore this finding more in Part 5 of the case study.

4.4 Part 4: Configuring α and β

We first present a statistical analysis of the results and then provide discussion of the results.

4.4.1 Statistical Analysis

In Fig. 4 we illustrate box plots which represent statistics describing the effectiveness measures for the 36 configurations applied to all 618 features. We observe two interesting patterns. Among each group of six configurations that share the same α value, the configuration with β = 0.01 (the smallest value for β) performs the worst. That is, configurations 1, 7, 13, 19, 25, and 31 perform worst in their respective groups. This finding is interesting because β = 0.01 is a heuristic commonly used in the literature (Heinrich 2009; Wei and Croft 2006; Lukins et al. 2008; Lukins et al. 2010) and is the default value for β in the Mallet toolkit.21 Similarly, among each group of six configurations that share the same α value, the configuration with β = 0.1 (the second smallest value for β) performs either second- or third-worst. That is, configurations 2, 8, 14, 20, 26, and 32 perform second- or third-worst in their respective groups. Again, β = 0.1 is a heuristic commonly used in the literature (Griffiths and Steyvers 2004; Savage et al. 2010; Rao and Kak 2011) and is the default value for β in GibbsLDA+ +.22
Fig. 4

The effectiveness measure for 36 configurations (α/β pairs) of the LDA based FLT applied to all 618 features

Table 12

Key for Section 4.4

1 (0.01,0.01)

2 (0.01,0.10)

3 (0.01,0.25)

4 (0.01,0.50)

5 (0.01,0.75)

6 (0.01,1.00)

7 (0.10,0.01)

8 (0.10,0.10)

9 (0.10,0.25)

10 (0.10,0.50)

11 (0.10,0.75)

12 (0.10,1.00)

13 (0.25,0.01)

14 (0.25,0.10)

15 (0.25,0.25)

16 (0.25,0.50)

17 (0.25,0.75)

18 (0.25,1.00)

19 (0.50,0.01)

20 (0.50,0.10)

21 (0.50,0.25)

22 (0.50,0.50)

23 (0.50,0.75)

24 (0.50,1.00)

25 (0.75,0.01)

26 (0.75,0.10)

27 (0.75,0.25)

28 (0.75,0.50)

29 (0.75,0.75)

30 (0.75,1.00)

31 (1.00,0.01)

32 (1.00,0.10)

33 (1.00,0.25)

34 (1.00,0.50)

35 (1.00,0.75)

36 (1.00,1.00)

Each table entry provides an index for an α/β pair

For each system we conducted a Kruskal–Wallis test to determine whether the α and β factors together have a significant effect on the accuracy of the LDA based FLT. The test revealed significant effects for JabRef (χ 2(35) = 54.06, p < 0.03), jEdit (χ 2(35) = 76.13, p < 0.001), muCommander (χ 2(35) = 67.46, p < 0.001), and Mylyn (χ 2(35) = 128.18, p < 0.001). For JabRef and muCommander, post-hoc tests using Mann–Whitney tests with Holm correction showed no significant differences between any two configurations (α/β pairs). Based on an analogous post-hoc test for jEdit, we can reject only 3 of the 630 null hypotheses, and similarly, based on the post-hoc test for Mylyn, we can reject only 30 null hypotheses (or about 5 % of the 630 null hypotheses). Of the 30 rejected null hypotheses for Mylyn, all pertain to configurations 1, 7, and 13, which share the common value 0.01 for β.

We also conducted a Kruskal–Wallis test on all 618 features, and the test revealed a significant effect (χ 2(35) = 250.21, p < 0.001). Based on the post-hoc test using Mann–Whitney tests with Holm correction, we can reject 95 (of 630) null hypotheses. However, effect sizes range from negligible to small. In particular, the largest effect size (between configurations 7 and 34) is |r| = 0.21. The stability of the medians across the 36 configurations demonstrates the small effect sizes.

4.4.2 Discussion

In the statistical sense, the α and β factors have little influence on the accuracy of the LDA based FLT. However, like Griffiths and Steyvers (2004), we observe that β has more influence than does α. In addition, in the context of our case study, 0.25 and 0.50 are the β values which provide the best accuracy. Even though the observed effect sizes are relatively small, we find it interesting that the de facto standard heuristics from the natural language document clustering community are not optimal in this (source code) context.

4.5 Part 5: Applying the Lessons Learned

We first describe the results of applying the lessons learned, then describe recommendations for configuring the LDA based FLT.

4.5.1 Results

In this section we apply the lessons learned from Parts 1–4 of the case study. In particular, we use the lessons learned to predict an improved configuration for each of four subject systems—ArgoUML, JabRef, jEdit, and Rhino—when compared to two generic configurations informed by heuristics from the literature. We chose ArgoUML because it is the largest of our subject systems, Rhino because it is the smallest, and JabRef and jEdit because they are of similar size and have similar source line to comment line ratios.

Table 13 repeats the predicted configurations and the heuristic configurations. We first review the heuristic configurations. Both Heuristic 1 and Heuristic 2 have the value Combined for the Query parameter. To the best of our knowledge, Lukins et al. (2010) specify the only (manual) query formulation process for text retrieval based feature location. However, to avoid bias in our study, we automatically formulate queries from issue reports. Thus, we select Combined, because it performed the best in Part 2 of our study. Both Heuristic 1 and Heuristic 2 have the value ICL for the Text parameter. We select ICL, because most of the previous studies of text retrieval based feature location use ICL (e.g., Marcus et al. 2004; Poshyvanyk et al. 2007; Liu et al. 2007; Lukins et al. 2008, 2010; Revelle et al. 2010). Both Heuristic 1 and Heuristic 2 have the value 200 for the K parameter. Wei and Croft (2006) suggest 50 to 300 topics as good general purpose values for LDA, and we select 200 because our systems are of medium size. We use the heuristic 50/K (Griffiths and Steyvers 2004) and set α to 0.25 for both Heuristic 1 and Heuristic 2. Finally, Heuristic 1 has the value 0.01 for β (Wei and Croft 2006; Heinrich 2009; Lukins et al. 2010), and Heuristic 2 has the value 0.1 for β (Griffiths and Steyvers 2004; Savage et al. 2010; Rao and Kak 2011).
Table 13

Case study design: part 5

Configuration

Query, Text, K, α, β

Predicted ArgoUML

Combined, ICL, 500, 1.0, 0.1

Predicted jEdit

Combined, ICL, 400, 1.0, 0.25

Predicted JabRef

Combined, ICL, 400, 1.0, 0.25

Predicted Rhino

Combined, ICL, 300, 1.0, 0.5

Heuristic 1

Combined, ICL, 200, 0.25, 0.01

Heuristic 2

Combined, ICL, 200, 0.25, 0.1

All predicted configurations have the value Combined for Query, because Combined performed the best in Part 2 of our study. Further, all predicted configurations have the value ICL for Text. We select ICL, because in Part 3 of our study, ICL performed the best of the configurations that did not have any failures. Similarly, we select the K values based on the results of Part 3. Specifically, in Part 3 we observed that 200 was too small a value for K when paired with ICL for Text. Indeed, we found that performance increased when setting K to 300 instead. Thus, we set K = 300 for Rhino, the smallest of our systems. We conjecture that further increasing the value of K for our larger systems can further increase performance. Thus, we set K = 400 for jEdit and JabRef, and we set K = 500 for ArgoUML, the largest of our systems. We set α = 1.0 for all systems, because that value resulted in the best performance in Part 4 of our study. Finally, we set β for each system based on the value of K. We select β = 0.5 for Rhino (K = 300), because that is the β value that resulted in the best performance in Part 4. For the remaining systems, we use a β value that is inversely proportional to the value of K.

In Fig. 5 we illustrate box plots which represent statistics describing the effectiveness measures for the test data. The box plots show that our predicted configurations outperform the heuristic configurations in all cases. Moreover, the predicted configurations significantly affect the accuracy of the LDA based FLT compared to the heuristic configurations.
Fig. 5

The effectiveness measure for three configurations (Predicted, Heuristic 1, and Heuristic 2) of the LDA-based FLT applied to 91 ArgoUML features, 38 JabRef features, 149 jEdit features, and 157 Rhino features

4.5.2 Recommendations

Based on our results, we have a number of recommendations for configuring the LDA based FLT. First, we define system sizes. We consider a small system to be one with less than 100 KLOC, less than 10 K unique terms, and less than 200 K term usages, whereas we consider a small–medium system to be one with less than 200 KLOC, approximately 10 K unique terms, and less than 500 K term usages. We consider a medium–large system to be one with greater than 200 KLOC, greater than 10 K unique terms, and approximately 500 K term usages, and finally, we consider a large system to be one with greater than 1,000 KLOC. Next, the recommendations:
  1. 1.

    Set K = 300 for small systems, set K = 400 for small–medium systems, and set K = 500 for medium–large and large systems. Our recommendation regarding large systems is based on prior experience with Eclipse and Mozilla (Lukins et al. 2008, 2010).

     
  2. 2.

    Set α = 1.0 for all systems.

     
  3. 3.

    Set β for each system based on the value of K. Set β = 0.5 for small systems, and for larger systems, set β to be inversely proportional to K. For example, set β = 0.25 for small–medium systems, and set β = 0.1 for medium–large and large systems.

     

4.6 Threats to Validity

Our study has limitations that impact the validity of our findings, as well as our ability to generalize them. We describe some of these limitations and their impacts.

Threats to conclusion validity concern the degree to which conclusions we reach about relationships in our data are reasonable. We made no assumption about the distribution of the effectiveness measures, so we used a non-parametric statistical hypothesis test. Moreover, we used an adjustment method to control the family-wise error rate of our hypothesis tests. The test results are consistent with our observations.

Threats to construct validity concern the adequacy of the study procedure with regard to measurement of the concepts of interest and can arise due to poor measurement design. One such threat relates to our benchmarks. Specifically, errors in our gold sets would affect the correctness of our effectiveness measures. To mitigate this threat, we used previously used gold sets (Dit et al. 2011b; Eaddy et al. 2008; Gethers and Poshyvanyk 2010; Revelle and Poshyvanyk 2009; Revelle et al. 2010). The six gold sets were produced by other researchers and made available to the community.

Threats to internal validity include possible defects in our tool chain and possible errors in our execution of the study procedure, the presence of which might affect the accuracy of our results and the conclusions we draw from them. We controlled for these threats by testing our tool chain and by assessing the quality of our data. Because we applied the same tool chain to all subject systems, any errors are systematic and are unlikely to affect our results substantially.

Additional threats to internal validity are due to the preprocessing steps we applied to textual information extracted from source code. For example, applying a stemmer can cause terms with different meanings to be mapped to the same stem, thus causing overfitting of the data. Applying a more advanced stemmer may mitigate this issue. However, we believe that such a change is unlikely to affect our results substantially. Other decisions that potentially affect our results are the choices to split tokens, to retain the original (unsplit) tokens, and to filter stop words.

Another threat to internal validity pertains to our queries, which we took directly from issue report titles and descriptions. It is possible that these titles and descriptions do not accurately reflect the features of interest. However, four of the subject systems in our study are used primarily by software developers, who are likely to be better able to accurately describe the faulty features than are end-users. Though we certainly could have obtained better results by tuning the queries (Poshyvanyk et al. 2007; Lukins et al. 2010), our query formulation process prevented us from introducing bias.

Threats to external validity concern the extent to which we can generalize our results. The subjects of our study comprise 618 features in 6 open source Java systems, so we cannot generalize our results to systems implemented in other languages. However, the systems are from different domains and have characteristics which are similar to those of systems developed in industry.

5 Conclusions

In this paper we presented a case study in which we considered the configuration of an LDA based FLT using 618 features from 6 open source Java systems. Specifically, we considered five configuration parameters, the first of which was the query. The second configuration parameter that we studied was the extracted text, and we are aware of no other study that has considered this parameter. The remaining configuration parameters were the number of topics (K) and the two smoothing hyperparameters (α and β). There is little discussion of selecting K in the software engineering literature (Maskeri et al. 2008; Lukins et al. 2010), though Griffiths and Steyvers (2004) investigated methods for choosing optimal K values in the context of natural language (NL) document clustering. Similarly, there is no other work in the software engineering literature that has investigated the effects of α and β, though Asuncion et al. (2009) investigated the issue in the context of NL document clustering.

The results of our case study suggest that the LDA based FLT is robust with regard to its configuration. Though we observed statistically significant effects in each of the three parts of the case study, effect sizes were minimal to small. Nevertheless, we did find that certain configurations outperform others, and some of our findings contradict conventional wisdom. For example, common heuristics for selecting LDA hyperparameter values in the natural language context are not optimal in the source code context. Indeed, these heuristics (β = 0.01 or β = 0.1) provided the worst performance among the tested configurations when setting K using common heuristics from the natural language context. Finally, we offer specific recommendations for configuring an LDA based FLT. Our recommendations are based on the results of the case study and on prior experience (Lukins et al. 2008, 2010).

Footnotes

Notes

Acknowledgements

We thank the anonymous reviewers for their insightful comments and helpful suggestions. This material is based upon work supported by the National Science Foundation under Grant Nos. 0851824, 0915559, and 1156563 and by the U.S. Department of Education under Grant No. P200A100182.

References

  1. Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: Proc of the 16th IEEE int’l conf on program comprehension, pp 103–112. doi: 10.1109/ICPC.2008.30
  2. Abebe S, Haiduc S, Marcus A, Tonella P, Antoniol G (2009a) Analyzing the evolution of the source code vocabulary. In: Proc of the 13th European conf on software maintenance and reengineering, pp 189–198. doi: 10.1109/CSMR.2009.61
  3. Abebe S, Haiduc S, Tonella P, Marcus A (2009b) Lexicon bad smells in software. In: Proc of the 16th working conf on reverse engineering, pp 95–99. doi: 10.1109/WCRE.2009.26
  4. Andrieu C, Freitas N, Doucet A, Jordan M (2003) An introduction to mcmc for machine learning. Mach Learn 50(1–2):5–43CrossRefzbMATHGoogle Scholar
  5. Antoniol G, Canfora G, Casazza G, Lucia AD, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983CrossRefGoogle Scholar
  6. Asuncion A, Welling M, Smyth P, Teh Y (2009) On smoothing and inference for topic models. In: Proc of the 25th conf on uncertainty in artificial intelligence, pp 27–34Google Scholar
  7. Asuncion H, Asuncion A, Taylor R (2010) Software traceability with topic modeling. In: Proc of the 32nd int’l conf on software engineering, pp 95–104. doi: 10.1145/1806799.1806817
  8. Baldi P, Linstead E, Lopes C, Bajracharya S (2008) A theory of aspects as latent topics. In: Proc of the ACM SIGPLAN conf on object-oriented programming, systems, languages, and applications, pp 543–562. doi: 10.1145/1449955.1449807
  9. Basili V, Caldiera G, Rombach H (1994) The goal question metric approach. ftp://ftp.cs.umd.edu/pub/sel/papers/gqm.pdf. Accessed 15 Feb 2011
  10. Beard M, Kraft N, Etzkorn L, Lukins S (2011) Measuring the accuracy of information retrieval based bug localization techniques. In: Proc of the 18th working conf on reverse engineering, pp 124–128. doi: 10.1109/WCRE.2011.23
  11. Biggerstaff T, Mitbander B, Webster D (1993) The concept assignment problem in program understanding. In: Proc of the int’l conf on software engineering, pp 482–498Google Scholar
  12. Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  13. Canfora G, Cerulo L (2006) Fine grained indexing of software repositories to support impact analysis. In: Proc of the 3rd int’l wksp on mining software repositories, pp 105–111. doi: 10.1145/1137983.1138009
  14. Chang J, Blei D (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150CrossRefzbMATHMathSciNetGoogle Scholar
  15. Corley C, Kraft N, Etzkorn L, Lukins S (2011) Recovering traceability links between source code and fixed bugs via patch analysis. In: Proc of the 6th int’l wks on traceability in emerging forms of software engineering, pp 31–37. doi: 10.1145/1987856.1987863
  16. De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4). doi: 10.1145/1276933.1276934 Google Scholar
  17. Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407CrossRefGoogle Scholar
  18. Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011a) Can better identifier splitting techniques help feature location? In: Proc of the 19th IEEE int’l conf on program comprehension, pp 11–20. doi: 10.1109/ICPC.2011.47
  19. Dit B, Revelle M, Gethers M, Poshyvanyk D (2011b) Feature location in source code: a taxonomy and survey. J Softw Maint Evol: Res Pract. doi: 10.1002/smr.567
  20. Eaddy M, Zimmermann T, Sherwood K, Garg V, Murphy G, Nagappan N, Aho A (2008) Do crosscutting concerns cause defects? IEEE Trans Softw Eng 34(4):497–515CrossRefGoogle Scholar
  21. Eisenberg A, Volder KD (2005) Dynamic feature traces: finding features in unfamiliar code. In: Proc of the 21st IEEE int’l conf on software maintenance, pp 337–346. doi: 10.1109/ICSM.2005.42
  22. Fluri B, Wursch M, Gall H (2007) Do code and comments co-evolve? On the relation between source code and comment changes. In: Proc of the 14th working conf on reverse engineering, pp 70–79. doi: 10.1109/WCRE.2007.21
  23. Fox C (1992) Lexical analysis and stoplists. In: Frakes W, Baeza-Yates R (eds) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJGoogle Scholar
  24. Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in IR-based concept location. In: Proc of the IEEE int’l conf on software maintenance, pp 351–360. doi: 10.1109/ICSM.2009.5306315
  25. Gethers M, Poshyvanyk D (2010) Using relational topic models to capture coupling among classes in object-oriented software systems. In: Proc of the int’l conf on software maintenance, pp 1–10. doi: 10.1109/ICSM.2010.5609687
  26. Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(Suppl 1):5228–5235. doi: 10.1073/pnas.0307752101 CrossRefGoogle Scholar
  27. Heinrich G (2009) Parameter estimation for text analysis. Tech Rep, Fraunhofer IGD, Darmstadt, Germany. http://www.arbylon.net/publications/text-est2.pdf. Version 2.9. Accessed 15 Feb 2011
  28. Hill E, Pollock L, Vijay-Shanker K (2007) Exploring the neighborhood with Dora to expedite software maintenance. In: Proc of the 22nd int’l conf on automated software engineering, pp 14–23. doi: 10.1145/1321631.1321637
  29. Lawrie D, Binkley D (2011) Expanding identifiers to normalize source code vocabulary. In: Proc of the 27th IEEE int’l conf on software maintenance, pp 113–122. doi: 10.1109/ICSM.2011.6080778
  30. Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. In: Proc of the 22nd int’l conf on automated software engineering, pp 234–243. doi: 10.1145/1321631.1321667
  31. Liu Y, Poshyvanyk D, Ferenc R, Gyimothy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proc of the 25th IEEE int’l conf on software maintenance, pp 233–242. doi: 10.1109/ICSM.2009.5306318
  32. Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug localization using latent Dirichlet allocation. In: Proc of the 15th working conf on reverse engineering. doi: 10.1109/WCRE.2008.33
  33. Lukins S, Kraft N, Etzkorn L (2010) Bug localization using latent Dirichlet allocation. Inf Softw Technol 52(9):972–990CrossRefGoogle Scholar
  34. Marcus A, Menzies T (2010) Software is data too. In: Proc of the FSE/SDP wksp on future of software engineering research, pp 229–232. doi: 10.1145/1882362.1882410
  35. Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: Proc of the 21st IEEE int’l conf on software maintenance, pp 133–142. doi: 10.1109/ICSM.2005.89
  36. Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: Proc of the 11th working conf on reverse engineering, pp 214–223. doi: 10.1109/WCRE.2004.10
  37. Maskeri G, Sarkar S, Heafield K (2008) Mining business topics in source code using latent Dirichlet allocation. In: Proc of the 1st India software engineering conf. doi: 10.1145/1342211.1342234
  38. Minka T (2009) Estimating a Dirichlet distribution. Tech Rep http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf. Accessed 20 Jun 2011
  39. Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: Proc of the IEEE int’l conf on program comprehension, pp 68–71. doi: 10.1109/ICPC.2010.20
  40. Poshyvanyk D, Gueheneuc Y, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432CrossRefGoogle Scholar
  41. Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T (2009) Using information retrieval based coupling measures for impact analysis. Empir Software Eng 14(1):5–32CrossRefGoogle Scholar
  42. Rajlich V (2006) Changing the paradigm of software engineering. Commun ACM 49(8):67–70CrossRefGoogle Scholar
  43. Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proc of the 10th IEEE int’l wksp on program comprehension, pp 271–278. doi: 10.1109/WPC.2002.1021348
  44. Rao S, Kak A (2011) Retrieval from software libraries for bug localization: a comparative study with generic and composite text models. In: Proc of the 8th working conf on mining software repositories, pp 43–52. doi: 10.1145/1985441.1985451
  45. Ratanotayanon S, Choi H, Sim S (2010) My repository runneth over: an empirical study on diversifying data sources to improve feature search. In: Proc of the 18th IEEE int’l conf on program comprehension, pp 206–215. doi: 10.1109/ICPC.2010.33
  46. Revelle M, Poshyvanyk D (2009) An exploratory study on assessing feature location techniques. In: Proc of the 17th int’l conf on program comprehension, pp 218–222. doi: 10.1109/ICPC.2009.5090045
  47. Revelle M, Dit B, Poshyvanyk D (2010) Using data fusion and web mining to support feature location in software. In: Proc of the 18th IEEE int’l conf on program comprehension, pp 14–23. doi: 10.1109/ICPC.2010.10
  48. Salton G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley, Reading, MAGoogle Scholar
  49. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523CrossRefGoogle Scholar
  50. Savage T, Dit B, Gethers M, Poshyvanyk D (2010) TopicXP: exploring topics in source code using latent dirichlet allocation. In: Proc of the 26th IEEE int’l conf on software maintenance, pp 1–6. doi: 10.1109/ICSM.2010.5609654
  51. Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proc of the 19th IEEE int’l conf on program comprehension, pp 1–10. doi: 10.1109/ICPC.2011.13
  52. Shao P, Atkison T, Kraft N, Smith R (2012) Combining lexical and structural information for static bug localization. Int J Comput Appl Technol 44(1):61–71CrossRefGoogle Scholar
  53. Thomas S, Adams B, Hassan A, Blostein D (2011) Modeling the evolution of topics in source code histories. In: Proc of the 8th IEEE working conf on mining software repositories, pp 173–182. doi: 10.1145/1985441.1985467
  54. Tian K, Revelle M, Poshyvanyk D (2009) Using latent Dirichlet allocation for automatic categorization of software. In: Proc of the 6th IEEE working conf on mining software repositories, pp 163–166. doi: 10.1109/MSR.2009.5069496
  55. Vinz B, Etzkorn L (2006) A synergistic approach to program comprehension. In: Proc of the 14th IEEE int’l conf on program comprehension, pp 69–73. doi: 10.1109/ICPC.2006.7
  56. Wei X, Croft W (2006) Lda-based document models for ad-hoc retrieval. In: Proc of ACM SIGIR, pp 178–185. doi: 10.1145/1148170.1148204
  57. Zhao W, Zhang L, Liu Y, Sun J, Yang F (2006) SNIAFL: towards a static noninteractive approach to feature location. ACM Trans Softw Eng Methodol 15(2):195–226CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Lauren R. Biggers
    • 1
  • Cecylia Bocovich
    • 2
    • 5
  • Riley Capshaw
    • 3
  • Brian P. Eddy
    • 1
  • Letha H. Etzkorn
    • 4
  • Nicholas A. Kraft
    • 1
  1. 1.Department of Computer ScienceThe University of AlabamaTuscaloosaUSA
  2. 2.Department of Mathematics, Statistics, and Computer ScienceMacalester CollegeSaint PaulUSA
  3. 3.Department of Mathematics & Computer ScienceHendrix CollegeConwayUSA
  4. 4.Department of Computer ScienceThe University of Alabama in HuntsvilleHuntsvilleUSA
  5. 5.David R. Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations