Abstract
Concern localization refers to the process of locating code units that match a particular textual description. It takes as input textual documents such as bug reports and feature requests and outputs a list of candidate code units that are relevant to the bug reports or feature requests. Many information retrieval (IR) based concern localization techniques have been proposed in the literature. These techniques typically represent code units and textual descriptions as a bag of tokens at one level of abstraction, e.g., each token is a word, or each token is a topic. In this work, we propose a multi-abstraction concern localization technique named M ULAB. M ULAB represents a code unit and a textual description at multiple abstraction levels. Similarity of a textual description and a code unit is now made by considering all these abstraction levels. We combine a vector space model (VSM) and multiple topic models to compute the similarity and apply a genetic algorithm to infer semi-optimal topic model configurations. We also propose 12 variants of M ULAB by using different data fusion methods. We have evaluated our solution on 175 concerns from 9 open source Java software systems. The experimental results show that variant CombMNZ-Def performs better than other variants, and also outperforms the state-of-art baseline called P R (PageRank based algorithm), which is proposed by Scanniello et al. (Empir Softw Eng 20(6):1666–1720 2015) in terms of effectiveness and rank.
Similar content being viewed by others
Notes
A concern is a concept, requirement, feature, or property related to a software system (Robillard and Murphy 2007). In this work, we focus on bug reports and feature requests which are subsets of concerns, but the proposed approach could be used for generic concerns.
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick
References
Antoniol G, Penta MD, Harman M (2004) A robust search-based approach to project management in the presence of abandonment, rework, error and uncertainty. In: Proceedings of the software metrics, 10th International Symposium. IEEE Computer Society, Washington, METRICS ’04, pp 172–183. https://doi.org/10.1109/METRICS.2004.4
Anvik J, Hiew L, Murphy GC (2005) Coping with an open bug repository. In: ETX, pp 35–39
Arcuri A, Fraser G (2011) On parameter tuning in search based software engineering. In: Search based software engineering, pp 33–47
Aslam JA, Montague M (2001) Models for metasearch. In: SIGIR 2001: Proceedings Of the international ACM SIGIR conference on research and development in information retrieval, september 9-13, 2001, New Orleans, pp 275–284
Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM, New York, pp 95–104
Binkley D, Lawrie D (2014) Learning to rank improves ir in se. In: IEEE international conference on software maintenance and evolution, pp 441–445
Blei DM, Lafferty JD (2007) Correction: a correlated topic model of science. Statistics 1(1):17–35
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2013) Multi-objective cross-project defect prediction. In: 2013 IEEE Sixth international conference on software testing, verification and validation (ICST). IEEE, Piscataway, pp 252–261
Cliff N (2014) Ordinal methods for behavioral data analysis. Psychology Press, Milton Park
Dit B, Revelle M, Poshyvanyk D (2013) Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empir Softw Eng 18(2):277–309
Eaddy M, Aho AV, Antoniol G, Guéhéneuc YG (2008) Cerberus: Tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. In: ICPC 2008. The 16th IEEE International Conference on Program Comprehension, 2008. IEEE, Piscataway, pp 53–62
Fox EA, Koushik MP, Shaw JA, Modlin R, Rao D (1992) Combining evidence from multiple searches. In: Text retrieval conference, pp 319–328
Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in ir-based concept location. In: IEEE international conference on software maintenance, 2009. ICSM 2009. IEEE, Piscataway, pp 351–360
Gold N, Harman M, Li Z, Mahdavi K (2006) Allowing overlapping boundaries in source code using a search based approach to concept binding. In: 22nd IEEE international conference on software maintenance, 2006. ICSM’06. IEEE, Piscataway, pp 310–319
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning 1989. Addison-Wesley, Reading
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235
Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013) Automatic query reformulations for text retrieval in software engineering. In: Proceedings of the international conference on software engineering, IEEE Press, ICSE, pp 842–851
Harman M, Jones BF (2001) Search-based software engineering. Inf Softw Technol 43(14):833–839
Harman M, Mansouri SA, Zhang Y (2012) Search-based software engineering: trends, techniques and applications. ACM Comput Surv (CSUR) 45(1):11
Hindle A, Barr ET, Su Z, Gabel M, Devanbu P (2012) On the naturalness of software. In: 2012 34th international conference on software engineering (ICSE). IEEE, Piscataway, pp 837–847
Holland JH (1975) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press, Ann Arbor
Hotho A, Maedche A, Staab S (2002) Ontology-based text document clustering. KI 16(4):48–54
Joseph EA (1997) Combination of multiple searches. Int J Uncertain Fuzziness Knowl- Based Syst
Kleinberg J, Tomkins A (1999) Applications of linear algebra in information retrieval and hypertext analysis
Le TDB, Oentaryo RJ, Lo D (2015) Information retrieval and spectrum based bug localization: Better together. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, New York, pp 579–590
Le XBD, Le QL, Lo D, Goues CL (2016a) Enhancing automated program repair with deductive verification. In: ICSME
Le XD, Lo D, Le Goues C (2016b) History driven program repair. In: IEEE 23Rd international conference on software analysis, evolution, and reengineering, SANER 2016, Suita, March 14-18, 2016, pp 213–224
Le Goues C, Nguyen T, Forrest S, Weimer W (2012) Genprog: a generic method for automatic software repair. IEEE Trans Softw Eng 38(1):54–72
Li Z, Harman M, Hierons RM (2007) Search algorithms for regression test case prioritization. IEEE Trans Softw Eng 33(4):225–237. https://doi.org/10.1109/TSE.2007.38
Liu D, Xu S (2007) A combined concept location method for java programs. In: Computer software and applications conference, 2007. COMPSAC 2007. 31st annual international, vol 2. IEEE, Piscataway, pp 29–42
Lohar S, Amornborvornwong S, Zisman A, Cleland-Huang J (2013) Improving trace accuracy through data-driven configuration and composition of tracing features. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, New York, pp 378–388
Lucia L, Lo D, Xia X (2014) Fusion fault localizers. In: Proceedings of the 29th ACM/IEEE international conference on automated software engineering. ACM, New York, pp 127–138
Mancoridis S, Mitchell BS, Chen Y, Gansner ER (1999) Bunch: a clustering tool for the recovery and maintenance of software system structures. In: Proceedings of the IEEE international conference on software maintenance, IEEE Computer Society, Washington, ICSM ’99, p 50 . http://dl.acm.org/citation.cfm?id=519621.853406
Manning C, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge
Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: ICSE 2003
Moreno L, Bandara W, Haiduc S, Marcus A (2013) On the relationship between the vocabulary of bug reports and source code. In: Proceedings of international conference on software maintenance, pp 452–455
Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: 2010 IEEE 18th international conference on program comprehension (ICPC). IEEE, Piscataway, pp 68–71
Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, Piscataway, pp 522–531
Panichella A, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2016) Parameterizing and assembling ir-based solutions for software engineering tasks using genetic algorithms. In: SANER
Poshyvanyk D, Gueheneuc YG, Marcus A, Antoniol G, Rajlich VC (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432
Rao S, Kak AC (2011) Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: MSR
Robillard MP, Murphy GC (2007) Representingconcerns in source code. ACM Trans Softw Eng Methodol 16(1):38. https://doi.org/10.1145/1189748.1189751
Rousseeuw PJ, Kaufman L (1990) Finding groups in data. Wiley Online Library, Hoboken
Salton G, Harman D (2003) Information retrieval p 777
Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min Knowl Disc 2 (2):169–194
Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: 2011 IEEE 19th international conference on program comprehension (ICPC). IEEE, Piscataway, pp 1–10
Scanniello G, Marcus A, Pascale D (2015) Link analysis algorithms for static concept location: an empirical assessment. Empir Softw Eng 20(6):1666–1720
Shaw JA, Fox EA (2014) Combination of multiple searches. IEEE Trans Multimedia 16(1):277–282
Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: ASE, pp 253–262
Thomas SW (2011) Mining software repositories using topic models. In: Proceedings of the 33rd international conference on software engineering. ACM, New York, pp 1138–1139
Tonella P (2004) Evolutionary testing of classes. In: Proceedings of the 2004 ACM SIGSOFT international symposium on software testing and analysis, ACM, New York, ISSTA ’04, p 119–128 . https://doi.org/10.1145/1007512.1007528
Wallach HM, Mimno DM, McCallum A (2009) Rethinking lda: Why priors matter. In: Advances in neural information processing systems, pp 1973–1981
Wang S, Lo D (2014) Version history, similar report, and structure: Putting them together for improved bug localization. In: Proceedings of the 22nd international conference on program comprehension. ACM, New York, pp 53–63
Wang S, Lo D, Lucia JL, Lau HC (2011a) Search-based fault localization. In: Alexander P, Pasareanu CS, Hosking JG (eds) ASE. IEEE, Piscataway, pp 556–559
Wang S, Lo D, Xing Z, Jiang L (2011) Concern localization using information retrieval: an empirical study on linux kernel. In: WCRE
Wang S, Lo D, Lawall J (2014) Compositional vector space models for improved bug localization
Wang T, Harman M, Jia Y, Krinke J (2013) Searching for better configurations: a rigorous approach to clone evaluation. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, New York, pp 455–465
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1 (6):80–83
Wu S (2012) Data fusion in information retrieval. Adapt Learn Optim 36 (2):2997C3006
Xia X, Lo D (2017) An effective change recommendation approach for supplementary bug fixes. Autom Softw Eng 24(2):455–498
Xia X, Lo D, Wang X, Zhang C, Wang X (2014) Cross-language bug localization. In: Proceedings of the 22nd international conference on program comprehension. ACM, New York, pp 275–278
Xia X, Lo D, Wang X, Zhou B (2015) Dual analysis for recommending developers to resolve bugs. Journal of Software Evolution & Process 27(3):195–220
Xia X, Lo D, Pan SJ, Nagappan N, Wang X (2016a) Hydra: Massively compositional model for cross-project defect prediction. IEEE Transactions on software Engineering
Xia X, Lo D, Wang X, Yang X (2016b) Collective personalized change classification with multiobjective search. IEEE Transactions on Reliability
Xia X, Lo D, Ding Y, Al-Kofahi JM, Nguyen TN, Wang X (2017) Improving automated bug triaging with specialized topic model. IEEE Trans Softw Eng 43(3):272–297
Xuan J, Monperrus M (2014) Learning to combine multiple ranking metrics for fault localization. In: IEEE international conference on software maintenance and evolution, pp 191–200
Ye X, Bunescu R, Liu C (2014) Learning to rank relevant files for bug reports using domain knowledge. In: ACM sigsoft international symposium on foundations of software engineering, pp 689–699
Zhang Y, Lo D, Xia X, Duy TDB, Scanniello G, Sun J (2016) Inferring links between concerns and methods with multi-abstraction vector space model. In: IEEE international conference on software maintenance and evolution ICSME, pp 110–121
Zhao W, Zhang L, Liu Y, Sun J, Yang F (2006) Sniafl: towards a static noninteractive approach to feature location. ACM Trans Softw Eng Methodol (TOSEM) 15(2):195–226
Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In: 2012 34th international conference on software engineering (ICSE). IEEE, Piscataway, pp 14–24
Acknowledgments
This work was supported by NSFC Program (No. 61602403 and 61572426).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Bram Adams and Denys Poshyvanyk
Appendices
Appendix: A
Following tables show the detailed experiment results of Research Question 1, each table presents the effectiveness and rank scores of 1 among the 12 variants of MULAB. For each variant, we report the number of wins, loses, and draws for each Java system. Wins, loses, and draws represent the number of concernsFootnote 12 or methodsFootnote 13 for which a variant outperforms the other variants, loses from another variant, and performs as well as all the other variants, respectively. We also report the overall results in the last row.
Appendix: B
Following tables show the detailed experiment results of Research Question 2, each table presents the effectiveness and rank scores of CombMNZ-Def, VSM model, and 4 abstraction levels. For each method, we report the number of wins, loses, and draws for each Java system. Wins, loses, and draws represent the number of concernsFootnote 14 or methodsFootnote 15 for which a method performs the best, loses from another method, and performs as well as all the other methods, respectively. We also report the overall results in the last row.
Rights and permissions
About this article
Cite this article
Zhang, Y., Lo, D., Xia, X. et al. Fusing multi-abstraction vector space models for concern localization. Empir Software Eng 23, 2279–2322 (2018). https://doi.org/10.1007/s10664-017-9585-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-017-9585-2