Fusing multi-abstraction vector space models for concern localization

Zhang, Yun; Lo, David; Xia, Xin; Scanniello, Giuseppe; Le, Tien-Duy B.; Sun, Jianling

doi:10.1007/s10664-017-9585-2

Fusing multi-abstraction vector space models for concern localization

Published: 27 December 2017

Volume 23, pages 2279–2322, (2018)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Yun Zhang¹,
David Lo²,
Xin Xia ORCID: orcid.org/0000-0002-6302-3256^1,3,
Giuseppe Scanniello⁴,
Tien-Duy B. Le² &
…
Jianling Sun¹

5 Citations
4 Altmetric
Explore all metrics

Abstract

Concern localization refers to the process of locating code units that match a particular textual description. It takes as input textual documents such as bug reports and feature requests and outputs a list of candidate code units that are relevant to the bug reports or feature requests. Many information retrieval (IR) based concern localization techniques have been proposed in the literature. These techniques typically represent code units and textual descriptions as a bag of tokens at one level of abstraction, e.g., each token is a word, or each token is a topic. In this work, we propose a multi-abstraction concern localization technique named M ULAB. M ULAB represents a code unit and a textual description at multiple abstraction levels. Similarity of a textual description and a code unit is now made by considering all these abstraction levels. We combine a vector space model (VSM) and multiple topic models to compute the similarity and apply a genetic algorithm to infer semi-optimal topic model configurations. We also propose 12 variants of M ULAB by using different data fusion methods. We have evaluated our solution on 175 concerns from 9 open source Java software systems. The experimental results show that variant CombMNZ-Def performs better than other variants, and also outperforms the state-of-art baseline called P R (PageRank based algorithm), which is proposed by Scanniello et al. (Empir Softw Eng 20(6):1666–1720 2015) in terms of effectiveness and rank.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards semantically enhanced detection of emerging quality-related concerns in source code

Article 17 February 2023

Structured information in bug report descriptions—influence on IR-based bug localization and developers

Article 08 May 2019

Toward a Token-Based Approach to Concern Detection in MATLAB Sources

Notes

A concern is a concept, requirement, feature, or property related to a software system (Robillard and Murphy 2007). In this work, we focus on bug reports and feature requests which are subsets of concerns, but the proposed approach could be used for generic concerns.
http://tartarus.org/martin/PorterStemmer/
http://pyevolve.sourceforge.net/
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick
When effectiveness is used as a yardstick
When rank is used as a yardstick

References

Antoniol G, Penta MD, Harman M (2004) A robust search-based approach to project management in the presence of abandonment, rework, error and uncertainty. In: Proceedings of the software metrics, 10th International Symposium. IEEE Computer Society, Washington, METRICS ’04, pp 172–183. https://doi.org/10.1109/METRICS.2004.4
Anvik J, Hiew L, Murphy GC (2005) Coping with an open bug repository. In: ETX, pp 35–39
Arcuri A, Fraser G (2011) On parameter tuning in search based software engineering. In: Search based software engineering, pp 33–47
Aslam JA, Montague M (2001) Models for metasearch. In: SIGIR 2001: Proceedings Of the international ACM SIGIR conference on research and development in information retrieval, september 9-13, 2001, New Orleans, pp 275–284
Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM, New York, pp 95–104
Binkley D, Lawrie D (2014) Learning to rank improves ir in se. In: IEEE international conference on software maintenance and evolution, pp 441–445
Blei DM, Lafferty JD (2007) Correction: a correlated topic model of science. Statistics 1(1):17–35
MathSciNet MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
MATH Google Scholar
Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2013) Multi-objective cross-project defect prediction. In: 2013 IEEE Sixth international conference on software testing, verification and validation (ICST). IEEE, Piscataway, pp 252–261
Cliff N (2014) Ordinal methods for behavioral data analysis. Psychology Press, Milton Park
Google Scholar
Dit B, Revelle M, Poshyvanyk D (2013) Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empir Softw Eng 18(2):277–309
Article Google Scholar
Eaddy M, Aho AV, Antoniol G, Guéhéneuc YG (2008) Cerberus: Tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. In: ICPC 2008. The 16th IEEE International Conference on Program Comprehension, 2008. IEEE, Piscataway, pp 53–62
Fox EA, Koushik MP, Shaw JA, Modlin R, Rao D (1992) Combining evidence from multiple searches. In: Text retrieval conference, pp 319–328
Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in ir-based concept location. In: IEEE international conference on software maintenance, 2009. ICSM 2009. IEEE, Piscataway, pp 351–360
Gold N, Harman M, Li Z, Mahdavi K (2006) Allowing overlapping boundaries in source code using a search based approach to concept binding. In: 22nd IEEE international conference on software maintenance, 2006. ICSM’06. IEEE, Piscataway, pp 310–319
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning 1989. Addison-Wesley, Reading
MATH Google Scholar
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235
Article Google Scholar
Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013) Automatic query reformulations for text retrieval in software engineering. In: Proceedings of the international conference on software engineering, IEEE Press, ICSE, pp 842–851
Harman M, Jones BF (2001) Search-based software engineering. Inf Softw Technol 43(14):833–839
Article Google Scholar
Harman M, Mansouri SA, Zhang Y (2012) Search-based software engineering: trends, techniques and applications. ACM Comput Surv (CSUR) 45(1):11
Article Google Scholar
Hindle A, Barr ET, Su Z, Gabel M, Devanbu P (2012) On the naturalness of software. In: 2012 34th international conference on software engineering (ICSE). IEEE, Piscataway, pp 837–847
Holland JH (1975) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press, Ann Arbor
MATH Google Scholar
Hotho A, Maedche A, Staab S (2002) Ontology-based text document clustering. KI 16(4):48–54
Google Scholar
Joseph EA (1997) Combination of multiple searches. Int J Uncertain Fuzziness Knowl- Based Syst
Kleinberg J, Tomkins A (1999) Applications of linear algebra in information retrieval and hypertext analysis
Le TDB, Oentaryo RJ, Lo D (2015) Information retrieval and spectrum based bug localization: Better together. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ACM, New York, pp 579–590
Le XBD, Le QL, Lo D, Goues CL (2016a) Enhancing automated program repair with deductive verification. In: ICSME
Le XD, Lo D, Le Goues C (2016b) History driven program repair. In: IEEE 23Rd international conference on software analysis, evolution, and reengineering, SANER 2016, Suita, March 14-18, 2016, pp 213–224
Le Goues C, Nguyen T, Forrest S, Weimer W (2012) Genprog: a generic method for automatic software repair. IEEE Trans Softw Eng 38(1):54–72
Article Google Scholar
Li Z, Harman M, Hierons RM (2007) Search algorithms for regression test case prioritization. IEEE Trans Softw Eng 33(4):225–237. https://doi.org/10.1109/TSE.2007.38
Article Google Scholar
Liu D, Xu S (2007) A combined concept location method for java programs. In: Computer software and applications conference, 2007. COMPSAC 2007. 31st annual international, vol 2. IEEE, Piscataway, pp 29–42
Lohar S, Amornborvornwong S, Zisman A, Cleland-Huang J (2013) Improving trace accuracy through data-driven configuration and composition of tracing features. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, New York, pp 378–388
Lucia L, Lo D, Xia X (2014) Fusion fault localizers. In: Proceedings of the 29th ACM/IEEE international conference on automated software engineering. ACM, New York, pp 127–138
Mancoridis S, Mitchell BS, Chen Y, Gansner ER (1999) Bunch: a clustering tool for the recovery and maintenance of software system structures. In: Proceedings of the IEEE international conference on software maintenance, IEEE Computer Society, Washington, ICSM ’99, p 50 . http://dl.acm.org/citation.cfm?id=519621.853406
Manning C, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge
Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: ICSE 2003
Moreno L, Bandara W, Haiduc S, Marcus A (2013) On the relationship between the vocabulary of bug reports and source code. In: Proceedings of international conference on software maintenance, pp 452–455
Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: 2010 IEEE 18th international conference on program comprehension (ICPC). IEEE, Piscataway, pp 68–71
Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, Piscataway, pp 522–531
Panichella A, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2016) Parameterizing and assembling ir-based solutions for software engineering tasks using genetic algorithms. In: SANER
Poshyvanyk D, Gueheneuc YG, Marcus A, Antoniol G, Rajlich VC (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432
Article Google Scholar
Rao S, Kak AC (2011) Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: MSR
Robillard MP, Murphy GC (2007) Representingconcerns in source code. ACM Trans Softw Eng Methodol 16(1):38. https://doi.org/10.1145/1189748.1189751
Article Google Scholar
Rousseeuw PJ, Kaufman L (1990) Finding groups in data. Wiley Online Library, Hoboken
MATH Google Scholar
Salton G, Harman D (2003) Information retrieval p 777
Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min Knowl Disc 2 (2):169–194
Article Google Scholar
Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: 2011 IEEE 19th international conference on program comprehension (ICPC). IEEE, Piscataway, pp 1–10
Scanniello G, Marcus A, Pascale D (2015) Link analysis algorithms for static concept location: an empirical assessment. Empir Softw Eng 20(6):1666–1720
Article Google Scholar
Shaw JA, Fox EA (2014) Combination of multiple searches. IEEE Trans Multimedia 16(1):277–282
Article Google Scholar
Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: ASE, pp 253–262
Thomas SW (2011) Mining software repositories using topic models. In: Proceedings of the 33rd international conference on software engineering. ACM, New York, pp 1138–1139
Tonella P (2004) Evolutionary testing of classes. In: Proceedings of the 2004 ACM SIGSOFT international symposium on software testing and analysis, ACM, New York, ISSTA ’04, p 119–128 . https://doi.org/10.1145/1007512.1007528
Wallach HM, Mimno DM, McCallum A (2009) Rethinking lda: Why priors matter. In: Advances in neural information processing systems, pp 1973–1981
Wang S, Lo D (2014) Version history, similar report, and structure: Putting them together for improved bug localization. In: Proceedings of the 22nd international conference on program comprehension. ACM, New York, pp 53–63
Wang S, Lo D, Lucia JL, Lau HC (2011a) Search-based fault localization. In: Alexander P, Pasareanu CS, Hosking JG (eds) ASE. IEEE, Piscataway, pp 556–559
Wang S, Lo D, Xing Z, Jiang L (2011) Concern localization using information retrieval: an empirical study on linux kernel. In: WCRE
Wang S, Lo D, Lawall J (2014) Compositional vector space models for improved bug localization
Wang T, Harman M, Jia Y, Krinke J (2013) Searching for better configurations: a rigorous approach to clone evaluation. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, New York, pp 455–465
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1 (6):80–83
Article Google Scholar
Wu S (2012) Data fusion in information retrieval. Adapt Learn Optim 36 (2):2997C3006
Google Scholar
Xia X, Lo D (2017) An effective change recommendation approach for supplementary bug fixes. Autom Softw Eng 24(2):455–498
Article Google Scholar
Xia X, Lo D, Wang X, Zhang C, Wang X (2014) Cross-language bug localization. In: Proceedings of the 22nd international conference on program comprehension. ACM, New York, pp 275–278
Xia X, Lo D, Wang X, Zhou B (2015) Dual analysis for recommending developers to resolve bugs. Journal of Software Evolution & Process 27(3):195–220
Article Google Scholar
Xia X, Lo D, Pan SJ, Nagappan N, Wang X (2016a) Hydra: Massively compositional model for cross-project defect prediction. IEEE Transactions on software Engineering
Xia X, Lo D, Wang X, Yang X (2016b) Collective personalized change classification with multiobjective search. IEEE Transactions on Reliability
Xia X, Lo D, Ding Y, Al-Kofahi JM, Nguyen TN, Wang X (2017) Improving automated bug triaging with specialized topic model. IEEE Trans Softw Eng 43(3):272–297
Article Google Scholar
Xuan J, Monperrus M (2014) Learning to combine multiple ranking metrics for fault localization. In: IEEE international conference on software maintenance and evolution, pp 191–200
Ye X, Bunescu R, Liu C (2014) Learning to rank relevant files for bug reports using domain knowledge. In: ACM sigsoft international symposium on foundations of software engineering, pp 689–699
Zhang Y, Lo D, Xia X, Duy TDB, Scanniello G, Sun J (2016) Inferring links between concerns and methods with multi-abstraction vector space model. In: IEEE international conference on software maintenance and evolution ICSME, pp 110–121
Zhao W, Zhang L, Liu Y, Sun J, Yang F (2006) Sniafl: towards a static noninteractive approach to feature location. ACM Trans Softw Eng Methodol (TOSEM) 15(2):195–226
Article Google Scholar
Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In: 2012 34th international conference on software engineering (ICSE). IEEE, Piscataway, pp 14–24

Download references

Acknowledgments

This work was supported by NSFC Program (No. 61602403 and 61572426).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou Shi, Zhejiang Sheng, China
Yun Zhang, Xin Xia & Jianling Sun
School of Information Systems, Singapore Management University, Singapore, Singapore
David Lo & Tien-Duy B. Le
Faculty of Information Technology, Monash University, Scenic Blvd, Clayton, VIC, 3800, Australia
Xin Xia
University of Basilicata, Potenza, Italy
Giuseppe Scanniello

Authors

Yun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
David Lo
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xia
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Scanniello
View author publications
You can also search for this author in PubMed Google Scholar
Tien-Duy B. Le
View author publications
You can also search for this author in PubMed Google Scholar
Jianling Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Xia.

Additional information

Communicated by: Bram Adams and Denys Poshyvanyk

Appendices

Appendix: A

Following tables show the detailed experiment results of Research Question 1, each table presents the effectiveness and rank scores of 1 among the 12 variants of MULAB. For each variant, we report the number of wins, loses, and draws for each Java system. Wins, loses, and draws represent the number of concerns^{Footnote 12} or methods^{Footnote 13} for which a variant outperforms the other variants, loses from another variant, and performs as well as all the other variants, respectively. We also report the overall results in the last row.

Table 17 Data analysis results on effectiveness and rank scores of MULAB_{b
a
s
i
c}

Fusing multi-abstraction vector space models for concern localization

Abstract

Access this article

Similar content being viewed by others

Towards semantically enhanced detection of emerging quality-related concerns in source code

Structured information in bug report descriptions—influence on IR-based bug localization and developers

Toward a Token-Based Approach to Concern Detection in MATLAB Sources

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix: A

Appendix: B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation