Improved Bug Localization Technique Using Hybrid Information Retrieval Model
The need of bug localization tools and increased popularity of text based IR models to locate the source code files containing bugs is growing continuously. Time and cost required for fixing bugs can be considerably minimized by improving the techniques of reducing the search space from few thousand source code files to a few files. The main contribution of this paper is to propose a Hybrid model based on two existing IR models (VSM and N-gram) for bug localization. In the proposed hybrid model performance is further improved by using word based bigrams. We have also introduced a weighing factor beta β to calculate the weighted sum of unigram and bigram and analyzed its accuracy for values ranging from (0–1). Using TopN, MRR and MAP measures, we have conducted experiments which show that the proposed hybrid model outperforms some existing state-of-art bug localization techniques.
Bug fixing is an important activity and improving its performance in terms of time and efforts required, has become a major area of concern. This is the reason that bug fixing techniques have gained special attention for researchers. The steps involved in traditional bug fixing are as follows: 1. Bug reports are received and verified. 2. The developer team locates the buggy source code files to be fixed. 3. The source code files are fixed. The second step is a time consuming activity if done manually. The bug fixing efforts and time can be minimized by using tools to locate the buggy source code files. This process of using tools to locate the buggy source code files is termed as bug localization. Previous work done on bug localization using information retrieval techniques are: 1. Lukins et al. in 2010  worked on applying LDA (Latent Dirichlet Allocation) model for bug localization. 2. Rao and Kak  in 2011 did a comparative analysis of various IR techniques like Unigram, Latent Semantic Analysis (LSA), VSM, LDA and Cluster. 3. Zhou et al.  proposed BugLocator which used sophisticated TF.IDF formulation, length of file factor and similarity among bugs previously fixed. 4. Saha et al.  proposed BLUiR and suggested and tested an approach based on concept of using code structural information for information retrieval. 5. Lal and Sureka  proposed and tested hypothesis of applying the concepts of character based n-gram to achieve bug localization.
2 Architecture of Hybrid Model
In Vector Space Model (VSM) vector value is calculated based on token frequency tf and inverse document frequency idf of each token . One of the disadvantages of VSM is that it does not include term dependencies into the model, for instance for modeling phrases or adjacent terms. Using N-gram model Song and Croft  proposed the use of statistical language model approximated by N-gram models in information retrieval. Unigram model assumes that each word occurs independently. The bigram model takes the local context into consideration. The proposed hybrid model captures the relevance of adjacent terms by using bigrams model and calculating bigram based vectors.
2.1 Overview of Proposed Approach
3 Experimental Result
Value of MRR for SWT, AspectJ and Eclipse datasets for β range (0 to 1)
Comparative analysis of MRR for rVSM and Hybrid model for benchmark dataset
0.553 (β = 0.5)
0.353 (β = 0.3)
0.381 (β = 0.3)
Details of Benhmark Datasets
4 Conclusion and Future Work
The experiment results from Table 1 show that optimum performance is achieved at β = 0.5, β = 0.3 and β = 0.3 for benchmark datasets SWT, AspectJ and Eclipse respectively. Also the proposed hybrid model shows consistent performance improvements on all the three TopN, MRR and MAP measures when compared with Classical and rVSM techniques. The future work will be focused on testing hybrid model on preprocessed large data sets.
- 1.Song, F., Croft, B.: A general language model for information retrieval. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval (1999)Google Scholar
- 4.Rao, S., Kak, A.: Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: Proceeding of the 8th Working Conference on Mining Software Repositories (MSR 2011), pp.43–52. ACM, Waikiki, Honolulu, Hawaii (May 2011)Google Scholar
- 5.Zhou, J., Zhang, H., Lo, D.: Where should the bugs be fixed? - more accurate information retrieval-based bug localization based on bug reports. In: Proceedings of the 2012 International Conference on Software Engineering, ICSE 2012, pp. 14–24. IEEE Press, Piscataway, NJ, USA (2012)Google Scholar
- 6.Lal, S., Sureka, A.: A static technique for fault localization using character n-gram based information retrieval model. In: Proceedings of ISEC 2012, Kanpur, UP, India (22–25 February 2012)Google Scholar
- 7.Saha, R.K., Lease, M., Khurshid, S., Perry, D.E.: Improving bug localization using structured information retrieval. In: Proceedings of ASE, pp. 345–355, Heidelberg, New York (2013)Google Scholar