International Conference on Distributed Computing and Internet Technology

Distributed Computing and Internet Technology pp 127-131 | Cite as

Improved Bug Localization Technique Using Hybrid Information Retrieval Model

  • Alpa Gore
  • Siddharth Dutt Choubey
  • Kopal Gangrade
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9581)

Abstract

The need of bug localization tools and increased popularity of text based IR models to locate the source code files containing bugs is growing continuously. Time and cost required for fixing bugs can be considerably minimized by improving the techniques of reducing the search space from few thousand source code files to a few files. The main contribution of this paper is to propose a Hybrid model based on two existing IR models (VSM and N-gram) for bug localization. In the proposed hybrid model performance is further improved by using word based bigrams. We have also introduced a weighing factor beta β to calculate the weighted sum of unigram and bigram and analyzed its accuracy for values ranging from (0–1). Using TopN, MRR and MAP measures, we have conducted experiments which show that the proposed hybrid model outperforms some existing state-of-art bug localization techniques.

1 Introduction

Bug fixing is an important activity and improving its performance in terms of time and efforts required, has become a major area of concern. This is the reason that bug fixing techniques have gained special attention for researchers. The steps involved in traditional bug fixing are as follows: 1. Bug reports are received and verified. 2. The developer team locates the buggy source code files to be fixed. 3. The source code files are fixed. The second step is a time consuming activity if done manually. The bug fixing efforts and time can be minimized by using tools to locate the buggy source code files. This process of using tools to locate the buggy source code files is termed as bug localization. Previous work done on bug localization using information retrieval techniques are: 1. Lukins et al. in 2010 [3] worked on applying LDA (Latent Dirichlet Allocation) model for bug localization. 2. Rao and Kak [4] in 2011 did a comparative analysis of various IR techniques like Unigram, Latent Semantic Analysis (LSA), VSM, LDA and Cluster. 3. Zhou et al. [5] proposed BugLocator which used sophisticated TF.IDF formulation, length of file factor and similarity among bugs previously fixed. 4. Saha et al. [7] proposed BLUiR and suggested and tested an approach based on concept of using code structural information for information retrieval. 5. Lal and Sureka [6] proposed and tested hypothesis of applying the concepts of character based n-gram to achieve bug localization.

2 Architecture of Hybrid Model

In Vector Space Model (VSM) vector value is calculated based on token frequency tf and inverse document frequency idf of each token [2]. One of the disadvantages of VSM is that it does not include term dependencies into the model, for instance for modeling phrases or adjacent terms. Using N-gram model Song and Croft [1] proposed the use of statistical language model approximated by N-gram models in information retrieval. Unigram model assumes that each word occurs independently. The bigram model takes the local context into consideration. The proposed hybrid model captures the relevance of adjacent terms by using bigrams model and calculating bigram based vectors.

2.1 Overview of Proposed Approach

We propose a hybrid model that uses word based N-gram model [1] in conjunction with VSM model [4]. The proposed hybrid model along with single terms (unigrams) also performs indexing for bigrams. A weighing factor beta (β) (0 ≤ β ≤ 1) is used to combine scores from both unigram and bigram terms and then final ranking of documents is done. The data set used for experiment is SWT(v3.1), AspectJ and Eclipse(v3.1) (Table 3) which is a subset of the data set used by BugLocator and the comparative study of the result is done. TopN, MRR (Mean Reciprocal Rank) and MAP (Mean Average Precison) is used as evaluation matrix. Figure 1 shows the architecture of the proposed hybrid model. Improvement in the performance of bug localization is based on utilizing semantic similarity by applying statistical language model like N-gram Model. TF.IDF model is used for indexing and calculating scores. We have used rVSM proposed by Zhou et al. [5] for length normalization, TF.IDF and final score calculations.
Fig. 1.

Overall architecture of the new hybrid approach using unigram and Bigram scores

Every bug report is a search query and is matched against each source code to calculate scores. In the proposed hybrid approach we calculate uScore (from Unigram Vector) and bScore (from Bigram Vector) using rVSM [5]. After calculating uScore and bScore for each file we combine the two scores by calculating weighed sum of the two using the following:
$$ {\text{fScore}} = {\text{uScore}} \times ( 1-\upbeta) + {\text{bScore}} \times\upbeta $$
(1)
where β is a weighing factor and (0 ≤ β ≤ 1). The fScore is the final score and is weighted sum of uScore and bScore. The source code files are ranked using fScore and the FinalRank is returned to the user.

3 Experimental Result

We have evaluated the value of MRR (Mean Reciprocal Rank) measure for various values of β (in range 0 to 1) which is shown in Table 1 to establish optimum value of β for each dataset. The comparative analysis of performance of our proposed hybrid approach with respect to Classical VSM and rVSM [5] on Top1, Top5, Top10, MRR and MAP measures is shown in Tables 2 and 3, Fig. 2.
Table 1.

Value of MRR for SWT, AspectJ and Eclipse datasets for β range (0 to 1)

β →

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

SWT

0.532

0.533

0.530

0.531

0.550

0.553

0.540

0.544

0.506

0.489

0.346

AspectJ

0.347

0.351

0.352

0.353

0.347

0.343

0.336

0.318

0.294

0.262

0.212

Eclipse

0.300

0.322

0.336

0.381

0.343

0.336

0.324

0.306

0.284

0.265

0.241

Table 2.

Comparative analysis of MRR for rVSM and Hybrid model for benchmark dataset

Model

SWT

AspectJ

Eclipse

rVSM

0.47

0.33

0.35

Hybrid

0.553 (β = 0.5)

0.353 (β = 0.3)

0.381 (β = 0.3)

Table 3.

Details of Benhmark Datasets

Project

#Bugs

#Files

SWT (v3.1)

98

484

AspectJ

286

6485

Eclipse (v3.1)

3075

12863

Fig. 2.

Comparative analysis of Top1, Top5 and Top10, MAP and MRR for Classical, rVSM and Hybrid Model for data set SWT, AspectJ and Eclipse

4 Conclusion and Future Work

The experiment results from Table 1 show that optimum performance is achieved at β = 0.5, β = 0.3 and β = 0.3 for benchmark datasets SWT, AspectJ and Eclipse respectively. Also the proposed hybrid model shows consistent performance improvements on all the three TopN, MRR and MAP measures when compared with Classical and rVSM techniques. The future work will be focused on testing hybrid model on preprocessed large data sets.

References

  1. 1.
    Song, F., Croft, B.: A general language model for information retrieval. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval (1999)Google Scholar
  2. 2.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)MATHCrossRefGoogle Scholar
  3. 3.
    Lukins, S., Kraft, N., Etzkorn, L.: Bug localization using latent Dirichlet allocation. Inf. Softw. Technol. 52(9), 972–990 (2010)CrossRefGoogle Scholar
  4. 4.
    Rao, S., Kak, A.: Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: Proceeding of the 8th Working Conference on Mining Software Repositories (MSR 2011), pp.43–52. ACM, Waikiki, Honolulu, Hawaii (May 2011)Google Scholar
  5. 5.
    Zhou, J., Zhang, H., Lo, D.: Where should the bugs be fixed? - more accurate information retrieval-based bug localization based on bug reports. In: Proceedings of the 2012 International Conference on Software Engineering, ICSE 2012, pp. 14–24. IEEE Press, Piscataway, NJ, USA (2012)Google Scholar
  6. 6.
    Lal, S., Sureka, A.: A static technique for fault localization using character n-gram based information retrieval model. In: Proceedings of ISEC 2012, Kanpur, UP, India (22–25 February 2012)Google Scholar
  7. 7.
    Saha, R.K., Lease, M., Khurshid, S., Perry, D.E.: Improving bug localization using structured information retrieval. In: Proceedings of ASE, pp. 345–355, Heidelberg, New York (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Alpa Gore
    • 1
  • Siddharth Dutt Choubey
    • 2
  • Kopal Gangrade
    • 2
  1. 1.Department of Computer Science and EngineeringShri Ram Institute of TechnologyJabalpurIndia
  2. 2.Department of Information TechnologyShri Ram Institute of TechnologyJabalpurIndia

Personalised recommendations