Skip to main content
Log in

A survey on the use of topic models when mining software repositories

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Researchers in software engineering have attempted to improve software development by mining and analyzing software repositories. Since the majority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have helped make sense of unstructured data in software repositories even more. However, even though there are hundreds of studies on applying topic models to software repositories, there is no study that shows how the models are used in the software engineering research community, and which software engineering tasks are being supported through topic models. Moreover, since the performance of these topic models is directly related to the model parameters and usage, knowing how researchers use the topic models may also help future studies make optimal use of such models. Thus, we surveyed 167 articles from the software engineering literature that make use of topic models. We find that i) most studies centre around a limited number of software engineering tasks; ii) most studies use only basic topic models; iii) and researchers usually treat topic models as black boxes without fully exploring their underlying assumptions and parameter values. Our paper provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a particular software engineering task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. The creators of LSI call these reduced dimensions “concepts”, not “topics”. However, to be consistent with other topic modeling approaches, we will use the term “topics”.

References

  • Abebe SL, Alicante A, Corazza A, Tonella P (2013) Supporting concept location through identifier parsing and ontology extraction. J Syst Softw 86 (11):2919–2938

    Article  Google Scholar 

  • Ahsan SN, Ferzund J, Wotawa F (2009) Automatic software bug triage system (BTS) based on Latent Semantic Indexing and Support Vector Machine. In: Proceedings of the 4th international conference on software engineering advances, pp 216–221

  • Alhindawi N, Dragan N, Collard M, Maletic J (2013a) Improving feature location by enhancing source code with stereotypes. In: Proceedings of the 2013 29th IEEE international conference on software maintenance, pp 300–309

  • Alhindawi N, Meqdadi O, Bartman B, Maletic J (2013b) A tracelab-based solution for identifying traceability links using lsi. In: Proceedings of 2013 international workshop on traceability in emerging forms of software engineering (TEFSE), pp 79–82

  • Ali N, Sabane A, Gueheneuc Y, Antoniol G (2012) Improving bug location using binary class relationships. In: Proceedings of the 2012 IEEE 12th international working conference on source code analysis and manipulation, pp 174–183

  • Ali N, Sharafi Z, Gueheneuc Y-G, Antoniol G (2014) An empirical study on the importance of source code entities for requirements traceability. Empirical Softw Eng 20:442–478

  • Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: Proceedings of the 10th working conference on mining software repositories. MSR ’13, pp 183–192

  • Allamanis M., Sutton C (2013) Why, when, and what: analyzing stack overflow questions by topic, type, and code. In: Proceedings of the 10th working conference on mining software repositories. MSR ’13, pp 53–56

  • Andrzejewski D, Mulhern A, Liblit B, Zhu X (2007) Statistical debugging using latent topic models. In: Proceedings of the 18th European conference on machine learning, pp 6–17

  • Anthes G (2010) Topic models vs. unstructured data. Commun ACM 53:16–18

    Google Scholar 

  • Antoniol G, Hayes JH, Gueheneuc YG, Di Penta M (2008) Reuse or rewrite: combining textual, static, and dynamic analyses to assess the cost of keeping a system up-to-date. In: Proceedings of the 24th international conference on software maintenance, pp 147–156

  • Asadi F, Antoniol G, Gueheneuc Y-G (2010a) Concept location with genetic algorithms: a comparison of four distributed architectures. In: Proceedings of the 2nd international symposium on search based software engineering. SSBSE ’10, pp 153–162

  • Asadi F, Di Penta M, Antoniol G, Guéhéneuc Y-G (2010b) A heuristic-based approach to identify concepts in execution traces. In: Proceedings of the 2010 14th European conference on software maintenance and reengineering (CSMR), pp 31–40

  • Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd international conference on software engineering, pp 95–104

  • Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval, vol 463. ACM Press, New York

    Google Scholar 

  • Bajaj K, Pattabiraman K, Mesbah A (2014) Mining questions asked by web developers. In: Proceedings of the 11th working conference on mining software repositories. MSR 2014, pp 112–121

  • Bajracharya S, Lopes C (2009) Mining search topics from a code search engine usage log. In: Proceedings of the 6th international working conference on mining software repositories, pp 111–120

  • Bajracharya SK, Lopes CV (2010) Analyzing and mining a code search engine usage log. Empir Softw Eng 17:1–43

  • Baldi PF, Lopes CV, Linstead EJ, Bajracharya SK (2008) A theory of aspects as latent topics. ACM SIGPLAN Not 43(10):543–562

    Article  Google Scholar 

  • Barnard K, Duygulu P, Forsyth D, De Freitas N, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135

    MATH  Google Scholar 

  • Bartholomew DJ (1987) Latent variable models and factors analysis. Oxford University Press Inc., New York

    MATH  Google Scholar 

  • Barua A, Thomas SW, Hassan AE (2012) What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir Softw Eng 19:619–654

  • Bassett B, Kraft N (2013) Structural information based term weighting in text retrieval for feature location. In: Proceedings of the 2013 21st IEEE international conference on program comprehension , pp 133–141

  • Bavota G, De Lucia A, Marcus A, Oliveto R (2010) A two-step technique for extract class refactoring. In: Proceedings of the 25th international conference on automated software engineering, pp 151–154

  • Bavota G, Lucia AD, Marcus A, Oliveto R (2012) Using structural and semantic measures to improve software modularization. Empir Softw Eng 18:901–932

  • Bavota G, De Lucia A, Oliveto R, Panichella A, Ricci F, Tortora G (2013) The role of artefact corpus in lsi-based traceability recovery. In: Proceedings of the 2013 international workshop on traceability in emerging forms of software engineering, pp 83–89

  • Bavota G, Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2014) Methodbook: recommending move method refactorings via relational topic models. IEEE Trans Softw Eng 40(7):671–694

    Article  Google Scholar 

  • Beard M, Kraft N, Etzkorn L, Lukins S (2011) Measuring the accuracy of information retrieval based bug localization techniques. In: Proceedings of the 2011 18th working conference on reverse engineering. WCRE ’11, pp 124–128

  • Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33:577–591

  • Bettenburg N, Adams B (2010) Workshop on Mining Unstructured Data (MUD) because “Mining Unstructured Data is like fishing in muddy waters! In: Proceedings of the 17th working conference on reverse engineering, pp 277–278

  • Biggers LR, Bocovich C, Capshaw R, Eddy BP, Etzkorn LH, Kraft NA (2014) Configuring latent dirichlet allocation based feature location. Empir Softw Eng 19(3):465–500

    Article  Google Scholar 

  • Binkley D, Lawrie D, Uehlinger C (2012) Vocabulary normalization improves ir-based concept location. In: Proceedings of the 2012 28th IEEE international conference on software maintenance, pp 588–591

  • Binkley D, Heinz D, Lawrie D, Overfelt J (2014) Understanding lda in source code analysis. In: Proceedings of the 22nd international conference on program comprehension. ICPC 2014, pp 26–36

  • Bishop CM (1998) Latent variable models. Learning in graphical models

  • Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 113–120

  • Blei DM, Lafferty JD (2007) A correlated topic model of science. Ann Appl Stat 1(1):17–35

    Article  MathSciNet  MATH  Google Scholar 

  • Blei DM, Lafferty JD (2009) Topic models. In: Text mining: classification, clustering, and applications. Chapman & Hall, London, pp 71–94

  • Blei DM, McAuliffe J (2008) Supervised topic models. Adv Neural Inf Proc Syst 20:121–128

    Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Blei D, Griffiths TL, Jordan MI, Tenenbaum JB (2004) Hierarchical topic models and the nested Chinese restaurant process. Adv Neural Inf Proc Syst 16:106

    Google Scholar 

  • Blei DM, Griffiths TL, Jordan MI (2010) The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J ACM 57(2):1–30

    Article  MathSciNet  MATH  Google Scholar 

  • Blumberg R, Atre S (2003) The problem with unstructured data. DM Rev 13:42–49

    Google Scholar 

  • Borg M, Runeson P, Ardö A (2014) Recovering from a decade: a systematic mapping of information retrieval approaches to software traceability. Empir Softw Eng 19(6):1565–1616

    Article  Google Scholar 

  • Bose JC, Suresh U (2008) Root cause analysis using sequence alignment and Latent Semantic Indexing. In: Proceedings of the 19th Australian conference on software engineering, pp 67–376

  • Bradford RB (2008) An empirical study of required dimensionality for large-scale latent semantic indexing applications. In: Proceedings of the 17th ACM conference on information and knowledge management. CIKM ’08, pp 153–162

  • Brickey J, Walczak S, Burgess T (2012) Comparing semi-automated clustering methods for persona development. IEEE Trans Softw Eng 38(3):537–546

    Article  Google Scholar 

  • Campbell J-C, Zhang C, Xu Z, Hindle A, Miller J (2013) Deficient documentation detection a methodology to locate deficient project documentation using topic analysis. In: Proceedings of the 2013 10th IEEE working conference on mining software repositories, pp 57–60

  • Canfora G, Cerulo L, Cimitile M, Di Penta M (2014) How changes affect software entropy: an empirical study. Empir Softw Eng 19(1):1–38

    Article  Google Scholar 

  • Capobianco G, De Lucia A, Oliveto R, Panichella A, Panichella S (2009) Traceability recovery using numerical analysis. In: Proceedings of the 16th working conference on reverse engineering, pp 195–204

  • Carpineto C, Romano G (2012) A survey of automatic query expansion in information retrieval. ACM Comput Surv 44(1):1–50

    Article  MATH  Google Scholar 

  • Chang J, Blei DM (2009) Relational topic models for document networks. In: Proceedings of the 12th international conference on artificial intelligence and statistics, pp 81–88

  • Chang J, Boyd-Graber J, Blei DM (2009) Connections between the lines: augmenting social networks with text. In: Proceedings of the 15th international conference on knowledge discovery and data mining, pp 169–178

  • Chen T-H, Thomas SW, Nagappan M, Hassan AE (2012) Explaining software defects using topic models. In: Proceedings of the 2012 9th IEEE working conference on mining software repositories. MSR ’12, pp 189–198

  • Chen T-H, Thomas SW, Hemmati H, Nagappan M, Hassan AE (2015) An empirical study on topic defect-proneness and testedness. In: under submission

  • Cleary B, Exton C, Buckley J, English M (2008) An empirical analysis of information retrieval based concept location techniques in software comprehension. Empir Softw Eng 14(1):93–130

    Article  Google Scholar 

  • Comon P (1994) Independent component analysis, a new concept? Sig Process 36(3):287–314

    Article  MATH  Google Scholar 

  • Corley C, Kammer E, Kraft N (2012) Modeling the ownership of source code topics. In: Proceedings of the 2012 IEEE 20th international conference on program comprehension, pp 173–182

  • Dallmeier V, Zimmermann T (2007) Extraction of bug localization benchmarks from history. In: Proceedings of the 22nd international conference on automated software engineering, pp 433–436

  • Dasgupta T, Grechanik M, Moritz E, Dit B, Poshyvanyk D (2013) Enhancing software traceability by automatically expanding corpora with relevant documentation. In: Proceedings of the 2013 IEEE international conference on software maintenance. ICSM ’13, pp 320–329

  • de Boer RC, van Vliet H (2008) Architectural knowledge discovery with Latent Semantic Analysis: constructing a reading guide for software product audits. J Syst Softw 81(9):1456–1469

    Article  Google Scholar 

  • De Lucia A, Fasano F, Oliveto R, Tortora G (2004) Enhancing an artefact management system with traceability recovery features. In: Proceedings of the 20th international conference on software maintenance, pp 306–315

  • De Lucia A, Fasano F, Oliveto R, Tortora G (2006) Can information retrieval techniques effectively support traceability link recovery? In: Proceedings of the 14th international conference on program comprehension, pp 307–316

  • De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4):13–50

  • De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2011) Improving ir-based traceability recovery using smoothing filters. In: Proceedings of the 2011 IEEE 19th international conference on program comprehension, pp 21–30

  • De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2012) Using ir methods for labeling source code artifacts: is it worthwhile? In: Proceedings of the 2012 20th international conference on program comprehension, pp 193–202

  • De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2014) Labeling source code with information retrieval methods: an empirical study. Empir Softw Eng 19:1–38

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  • Dietterich TG (2000) Ensemble methods in machine learning. In: Proceedings of the first international workshop on multiple classifier systems. MCS ’00, pp 1–15

  • Dit B, Poshyvanyk D, Marcus A (2008) Measuring the semantic similarity of comments in bug reports. In: Proceedings 1st international workshop on semantic technologies in system maintenance

  • Dit B, Panichella A, Moritz E, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013a) Configuring topic models for software engineering tasks in tracelab. In: Proceedings of the 2013 international workshop on traceability in emerging forms of software engineering, pp 105–109

  • Dit B, Revelle M, Poshyvanyk D (2013b) Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empir Softw Eng 18(2):277–309

    Article  Google Scholar 

  • Dit B, Moritz E, Linares-Vásquez M, Poshyvanyk D (2013c) Supporting and accelerating reproducible research in software maintenance using tracelab component library. In: Proceedings of the 2013 IEEE international conference on software maintenance. ICSM ’13, pp 330–339

  • Dit B, Moritz E, Linares-Vásquez M, Poshyvanyk D, Cleland-Huang J (2014) Supporting and accelerating reproducible empirical research in software evolution and maintenance using tracelab component library. Empir Softw Eng 1–39

  • Eddy B, Robinson J, Kraft N, Carver J (2013) Evaluating source code summarization techniques: replication and expansion. In: Proceedings of 2013 IEEE 21st international conference on program comprehension, pp 13–22

  • Eyal-Salman H, Seriai A-D, Dony C (2013) Feature-to-code traceability in legacy software variants. In: Proceedings of 2013 39th EUROMICRO conference on the software engineering and advanced applications, pp 57–61

  • Flaherty P, Giaever G, Kumm J, Jordan MI, Arkin AsP (2005) A latent variable model for chemogenomic profiling. Bioinformatics 21(15):3286

    Article  Google Scholar 

  • Gall CS, Lukins S, Etzkorn L, Gholston S, Farrington P, Utley D, Fortune J, Virani S (2008) Semantic software metrics computed from natural language design specifications. IET Softw 2(1):17–26

    Article  Google Scholar 

  • Galvis Carreño LV, Winbladh K (2013) Analysis of user comments: an approach for software requirements evolution. In: Proceedings of the 2013 international conference on software engineering. ICSE ’13, pp 582–591

  • Gamma E (2007) JHotDraw. http://www.jhotdraw.org/

  • Gethers M, Poshyvanyk D (2010) Using relational topic models to capture coupling among classes in object-oriented software systems. In: Proceedings of the 26th international conference on software maintenance, pp 1–10

  • Gethers M, Kagdi H, Dit B, Poshyvanyk D (2011a) An adaptive approach to impact analysis from change requests to source code. In: Proceedings of the 2011 26th IEEE/ACM international conference on automated software engineering, pp 540–543

  • Gethers M, Savage T, Di Penta M, Oliveto R, Poshyvanyk D, De Lucia A (2011b) Codetopics: which topic am I coding now? In: Proceedings of the 33rd international conference on software engineering. ICSE ’11, pp 1034–1036

  • Gethers M, Oliveto R, Poshyvanyk D, Lucia A (2011c) On integrating orthogonal information retrieval methods to improve traceability recovery. In: Proceedings of the 27th international conference on software maintenance, pp 133–142

  • Gethers M, Oliveto R, Poshyvanyk D, Lucia AD (2011d) On integrating orthogonal information retrieval methods to improve traceability recovery. In: Proceedings of the 2011 27th international conference on software maintenance. ICSM ’11, pp 133–142

  • Gethers M, Dit B, Kagdi H, Poshyvanyk D (2012) Integrated impact analysis for managing software changes. In: Proceedings of the 2012 34th international conference on software engineering, pp 430–440

  • Godfrey MW, Hassan AE, Herbsleb J, Murphy GC, Robillard M, Devanbu P, Mockus A, Perry DE, Notkin D (2008) Future of mining software archives: a roundtable. IEEE Softw 26(1):67–70

    Article  Google Scholar 

  • Gorla A, Tavecchia I, Gross F, Zeller A (2014) Checking app behavior against app descriptions. In: Proceedings of the 36th international conference on software engineering. ICSE 2014, pp 1025–1035

  • Grant S, Cordy JR (2009) Vector space analysis of software clones. In: Proceedings of the 17th international conference on program comprehension, pp 233–237

  • Grant S, Cordy JR (2010) Estimating the optimal number of latent concepts in source code analysis. In: Proceedings of the 10th international working conference on source code analysis and manipulation, pp 65–74

  • Grant S, Cordy J (2014) Examining the relationship between topic model similarity and software maintenance. In: Proceedings of the 2014 software evolution week—IEEE conference on software maintenance, reengineering and reverse engineering (CSMR-WCRE), pp 303–307

  • Grant S, Cordy JR, Skillicorn D (2008) Automated concept location using independent component analysis. In: Proceedings of the 15th working conference on reverse engineering, pp 138–142

  • Grant S, Martin D, Cordy J, Skillicorn D (2011a) Contextualized semantic analysis of web services. In: Proceedings of the 2011 13th IEEE international symposium on web systems evolution, pp 33–42

  • Grant S, Cordy JR, Skillicorn DB (2011b) Reverse engineering co-maintenance relationships using conceptual analysis of source code. In: Proceedings of the 2011 18th working conference on reverse engineering. WCRE ’11, pp 87–91

  • Grant S, Cordy J, Skillicorn D (2012) Using topic models to support software maintenance. In: Proceedings of 2012 16th European conference on the software maintenance and reengineering (CSMR) , pp 403–408

  • Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. In: Proceedings of the 32nd international conference on software engineering, pp 475–484

  • Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235

    Article  Google Scholar 

  • Griffiths TL, Steyvers M, Tenenbaum JB (2007) Topics in semantic representation. Psychol Rev 114(2):211–244

    Article  Google Scholar 

  • Grimes S (2008) Unstructured data and the 80 percent rule. Clarabridge Bridgepoints

  • Guo W, Diab M (2012) Modeling sentences in the latent space. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers—volume 1. ACL ’12, pp 864–872

  • Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013) Automatic query reformulations for text retrieval in software engineering. In: Proceedings of the 2013 international conference on software engineering. ICSE ’13, pp 842–851

  • Hall D, Jurafsky D, Manning CD (2008) Studying the history of ideas using topic models. In: Proceedings of the conference on empirical methods in natural language processing. ACL, pp 363–371

  • Han D, Zhang C, Fan X, Hindle A, Wong K, Stroulia E (2012) Understanding android fragmentation with topic analysis of vendor-specific bugs. In: Proceedings of the 2012 19th working conference on reverse engineering. WCRE ’12, pp 83–92

  • Hassan AE (2004) Mining software repositories to assist developers and support managers. Ph.D. thesis, University of Waterloo, Waterloo

  • Hassan AE (2008) The road ahead for mining software repositories. In: Frontiers of software maintenance, pp 48–57

  • Hassan AE, Holt RC (2005) The top ten list: dynamic fault prediction. In: Proceedings of the 21st international conference on software maintenance, pp 263–272

  • Hassan AE, Xie T (2010) Software intelligence: the future of mining software engineering data. In: Proceedings of the FSE/SDP workshop on future of software engineering research, pp 161–166

  • Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: the study of methods. IEEE Trans Softw Eng IEEE Trans 4–19

  • Hindle A, Godfrey MW, Holt RC (2009) What’s hot and what’s not: windowed developer topic analysis. In: Proceedings of the 25th international conference on software maintenance, pp 339–348

  • Hindle A, Godfrey MW, Holt RC (2010) Software process recovery using recovered unified process views. In: Proceedings of the 26th international conference on software maintenance, pp 1–10

  • Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: Proceedings of the 8th working conference on mining software repositories. MSR ’11, pp 163–172

  • Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2012a) Automated topic naming—supportin cross-project analysis of software maintenance activities. Empir Softw Eng

  • Hindle A, Barr ET, Su Z, Gabel M, Devanbu PT (2012b) On the naturalness of software. In: Proceedings of the 34th international conference on software engineering, pp 837–847

  • Hindle A, Bird C, Zimmermann T, Nagappan N (2012c) Relating requirements to implementation via topic analysis: do topics extracted from requirements make sense to managers and developers? In: Proceedings of the 2012 28th IEEE international conference on software maintenance, pp 243–252

  • Hindle A, Bird C, Zimmermann T, Nagappan N (2014) Do topics make sense to managers and developers? Empir Softw Eng 20:479–515

  • Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd international conference on research and development in information retrieval, pp 50–57

  • Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196

    Article  MATH  Google Scholar 

  • Hu W, Wong K (2013) Using citation influence to predict software defects. In: Proceedings of the 2013 10th IEEE working conference on mining software repositories, pp 419–428

  • Hu X, Cai Z, Franceschetti D, Penumatsa P, Graesser AC (2003) LSA: the first dimension and dimensional weighting. In: Proceedings of the 25th meeting of the Cognitive Science Society, Boston. Cognitive Science Society, pp 1–6

  • Iacob C, Harrison R (2013) Retrieving and analyzing mobile apps feature requests from online reviews. In: Proceedings of the 2013 10th IEEE working conference on mining software repositories, pp 41–44

  • Islam M, Marchetto A, Susi A, Kessler F, Scanniello G (2012a) Motcp: a tool for the prioritization of test cases based on a sorting genetic algorithm and latent semantic indexing. In: Proceedings of the 2012 28th IEEE international conference on software maintenance (ICSM), pp 654–657

  • Islam M, Marchetto A, Susi A, Scanniello G (2012b) A multi-objective technique to prioritize test cases based on latent semantic indexing. In: Proceedings of the 2012 16th European conference on software maintenance and reengineering, pp 21–30

  • Jiang H, Nguyen TN, Chen I, Jaygarl H, Chang C (2008) Incremental Latent Semantic Indexing for automatic traceability link evolution management. In: Proceedings of the 23rd international conference on automated software engineering, pp 59–68

  • Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on information and knowledge management. CIKM ’11, pp 775–784

  • Jolliffe I (2002) Principal component analysis. Springer, New York

    MATH  Google Scholar 

  • Kagdi H, Gethers M, Poshyvanyk D, Collard M (2010) Blending conceptual and evolutionary couplings to support change impact analysis in source code. In: Proceedings of the 17th working conference on reverse engineering, pp 119–128

  • Kagdi H, Gethers M, Poshyvanyk D, Hammad M (2012a) Assigning change requests to software developers. J Softw Evol Process 24(1):3–33

    Article  Google Scholar 

  • Kagdi H, Gethers M, Poshyvanyk D (2012b) Integrating conceptual and logical couplings for change impact analysis in software. Empir Softw Eng 18:933–969

  • Kamei Y, Matsumoto S, Monden A, Matsumoto K, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: Proceedings of the 26th international conference on software maintenance, pp 1–10

  • Kaushik N., Tahvildari L (2012) A comparative study of the performance of ir models on duplicate bug detection. In: Proceedings of the 2012 16th European conference on software maintenance and reengineering. CSMR ’12, pp 159–168

  • Kaushik N, Tahvildari L, Moore M (2011) Reconstructing traceability between bugs and test cases: an experimental study. In: Proceedings of the 2011 18th working conference on reverse engineering. WCRE ’11, pp 411–414

  • Kawaguchi S, Garg PK, Matsushita M, Inoue K (2006) Mudablue: an automatic categorization system for open source repositories. J Syst Softw 79(7):939–953

    Article  Google Scholar 

  • Kelly M, Alexander J, Adams B, Hassan A (2011) Recovering a balanced overview of topics in a software domain. In: Proceedings of the 2011 11th IEEE international working conference on source code analysis and manipulation (SCAM), pp 135–144

  • Kitchenham BA, Budgen D, Pearl Brereton O (2011) Using mapping studies as the basis for further research—a participant-observer case study. Inf Softw Technol 53(6):638–651

    Article  Google Scholar 

  • Kouters E, Vasilescu B, Serebrenik A, van den Brand M (2012) Who’s who in gnome: using lsa to merge software repository identities. In: Proceedings of the 2012 28th IEEE international conference on software maintenance, pp 592–595

  • Kuhn A, Ducasse S, Girba T (2005) Enriching reverse engineering with semantic clustering. In: Proceedings of the 12th working conference on reverse engineering, pp 133–142

  • Kuhn A, Ducasse S, Girba T (2007) Semantic clustering: identifying topics in source code. Inf Softw Technol 49(3):230–243

    Article  Google Scholar 

  • Kuhn A, Loretan P, Nierstrasz O (2008) Consistent layout for thematic software maps. In: Proceedings of the 15th working conference on reverse engineering, pp 209–218

  • Kuhn A, Erni D, Loretan P, Nierstrasz O (2010) Software cartography: thematic software visualization with consistent layout. J Softw Maint Evol Res Pract 22:191–210

  • Le T-D, Wang S, Lo D (2013) Multi-abstraction concern localization. In: Proceedings of the 2013 29th IEEE international conference on software maintenance (ICSM), pp 364–367

  • Lehman MM (1980) Programs, life cycles, and laws of software evolution. Proc IEEE 68(9):1060–1076

    Article  Google Scholar 

  • Li W, McCallum A (2006) Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd international conference on machine learning, pp 577–584

  • Limsettho N, Hata H, Matsumoto K-I (2014) Comparing hierarchical dirichlet process with latent dirichlet allocation in bug report multiclass classification. In: Proceedings of 2014 15th IEEE/ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing (SNPD), pp 1–6

  • Lin MY, Amor R, Tempero E (2006) A Java reuse repository for Eclipse using LSI. In: Proceedings of the 2006 Australian software engineering conference, pp 351–362

  • Linares-Vásquez M, Dit B, Poshyvanyk D (2013) An exploratory analysis of mobile development issues using Stack Overflow. In: Proceedings of the 10th working conference on mining software repositories. MSR ’13, pp 93–96

  • Linstead E, Baldi P (2009) Mining the coherence of GNOME bug reports with statistical topic models. In: Proceedings of the 6th working conference on mining software repositories, pp 99–102

  • Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2007a) Mining concepts from code with probabilistic topic models. In: Proceedings of the 22nd international conference on automated software engineering, pp 461–464

  • Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2007b) Mining Eclipse developer contributions via author-topic models. In: Proceedings of the 4th international workshop on mining software repositories, pp 30–33

  • Linstead E, Lopes C, Baldi P (2008a) An application of latent Dirichlet allocation to analyzing software evolution. In: Proceedings of the 7th international conference on machine learning and applications, pp 813–818

  • Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2008b) Mining internet-scale software repositories. In: Advances in neural information processing systems, vol 2007, pp 929–936

  • Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2008c) Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Disc 18(2):300–336

    Article  MathSciNet  Google Scholar 

  • Linstead E, Hughes L, Lopes C, Baldi P (2009) Software analysis with unsupervised topic models. In: NIPs workshop on application of topic models: text and beyond

  • Liu Y, Poshyvanyk D, Ferenc R, Gyimothy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proceedings of the 25th international conference on software maintenance, pp 233–242

  • Loehlin JC (1987) Latent variable models. Erlbaum Hillsdale

  • Lohar S, Amornborvornwong S, Zisman A, Cleland-Huang J (2013) Improving trace accuracy through data-driven configuration and composition of tracing features. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ESEC/FSE 2013, pp 378–388

  • Lormans M (2007) Monitoring requirements evolution using views. In: Proceedings of the 11th European conference on software maintenance and reengineering, pp 349–352

  • Lormans M, Van Deursen A (2006) Can LSI help reconstructing requirements traceability in design and test? In: Proceedings of 10th European conference on software maintenance and reengineering, pp 47–56

  • Lormans M, Gross HG, van Deursen A, van Solingen R (2006) Monitoring requirements coverage using reconstructed views: an industrial case study. In: Proceedings of the 13th working conference on reverse engineering, pp 275–284

  • Lormans M, Deursen A, Gross H-G (2008) An industrial case study in reconstructing requirements views. Empir Softw Eng 13(6):727–760

    Article  Google Scholar 

  • Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent Dirichlet allocation. In: Proceedings of the 15th working conference on reverse engineering, pp 155–164

  • Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent Dirichlet allocation. Inf Softw Technol 52(9):972–990

    Article  Google Scholar 

  • Madsen R, Sigurdsson S, Hansen L, Larsen J (2004) Pruning the vocabulary for better context recognition. In: Proceedings of the 17th international conference on pattern recognition, pp 483–488

  • Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proceedings of the 23rd international conference on software engineering, pp 103–112

  • Maletic JI, Valluri N (1999) Automatic software clustering via Latent Semantic Analysis. In: Proceeding of the 14th international conference on automated software engineering, pp 251–254

  • Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval, vol 1. University Press Cambridge, Cambridge

    Book  MATH  Google Scholar 

  • Marcus A (2004) Semantic driven program analysis. In: Proceedings of the 20th international conference on software maintenance, pp 469–473

  • Marcus A, Maletic JI (2001) Identification of high-level concept clones in source code. In: Proceedings of the 16th international conference on automated software engineering, pp 107–114

  • Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using Latent Semantic Indexing. In: Proceedings of the 25th international conference on software engineering, pp 125–135

  • Marcus A, Sergeyev A, Rajlich V, Maletic JI (2004) An information retrieval approach to concept location in source code. In: Proceedings of the 11th working conference on reverse engineering, pp 214–223

  • Marcus A, Rajlich V, Buchta J, Petrenko M, Sergeyev A (2005) Static techniques for concept location in object-oriented code. In: Proceedings of the 13th international workshop on program comprehension, pp 33–42

  • Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34 (2):287–300

    Article  Google Scholar 

  • Maskeri G, Sarkar S, Heafield K (2008) Mining business topics in source code using latent Dirichlet allocation. In: Proceedings of the 1st conference on India software engineering conference, pp 113–120

  • McCallum AK (2002) Mallet: a machine learning for language toolkit. http://mallet.cs.umass.edu

  • McIlroy S, Ali N, Khalid H, E Hassan A (2015) Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews. Empir Softw Eng 1–40

  • McMillan C, Poshyvanyk D, Revelle M (2009) Combining textual and structural analysis of software artifacts for traceability link recovery. In: Proceedings of the ICSE workshop on traceability in emerging forms of software engineering, pp 41–48

  • Medini S (2011) Scalable automatic concept mining from execution traces. In: Proceedings of the 2011 19th international conference on program comprehension (ICPC), pp 238–241

  • Medini S, Antoniol G, Gueheneuc Y, Di Penta M, Tonella P (2012) Scan: an approach to label and relate execution trace segments. In: Proceedings of 2012 19th working conference on the reverse engineering (WCRE), pp 135–144

  • Mei Q, Zhai CX (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th international conference on knowledge discovery in data mining, pp 198–207

  • Mei Q, Shen X, Zhai CX (2007) Automatic labeling of multinomial topic models. In: Proceedings of the 13th international conference on Knowledge discovery and data mining, pp 490–499

  • Mei Q, Cai D, Zhang D, Zhai CX (2008) Topic modeling with network regularization. In: Proceeding of the 17th international conference on World Wide Web, pp 101–110

  • Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38 (11):39–41

    Article  Google Scholar 

  • Mimno D, Wallach HM, Naradowsky J, Smith DA, McCallum A (2009) Polylingual topic models. In: Proceedings of the 2009 conference on empirical methods in natural language processing, pp 880–889

  • Misirli AT, Bener AB, Turhan B (2011) An industrial case study of classifier ensembles for locating software defects. Softw Qual J 19(3):515–536

    Article  Google Scholar 

  • Misra J, Das S (2013) Entity disambiguation in natural language text requirements. In: Proceedings of the 2013 20th Asia-Pacific software engineering conference (APSEC), vol 1, pp 239–246

  • Misra J, Annervaz KM, Kaulgud V, Sengupta S, Titus G (2012) Software clustering: unifying syntactic and semantic features. In: Proceedings of the 2012 19th working conference on reverse engineering, pp 113–122

  • Moritz E, Linares-Vasquez M, Poshyvanyk D, Grechanik M, McMillan C, Gethers M (2013) Export: detecting and visualizing api usages in large source code repositories. In: Proceedings of the 2013 IEEE/ACM 28th international conference on automated software engineering, pp 646–651

  • Naguib H, Narayan N, Brugge B, Helal D (2013) Bug report assignee recommendation using activity profiles. In: Proceedings of the 2013 10th IEEE working conference on mining software repositories, pp 22–30

  • Neuhaus S, Zimmermann T (2010) Security trend analysis with CVE topic models. In: Proceedings of the 21st international symposium on software reliability engineering, pp 111–120

  • Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics. HLT ’10, pp 100–108

  • Nguyen TH, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: Proceedings of the 17th working conference on reverse engineering, pp 259–268

  • Nguyen AT, Nguyen TT, Al-Kofahi J, Nguyen HV, Nguyen TN (2011a) A topic-based approach for narrowing the search space of buggy files from a bug report. In: Proceedings of the 26th international conference on automated software engineering, pp 263–272

  • Nguyen TT, Nguyen TN, Phuong TM (2011b) Topic-based defect prediction (nier track). In: Proceedings of the 33rd international conference on software engineering. ICSE ’11, pp 932–935

  • Nguyen A, Nguyen TT, Nguyen T, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 2012 27th IEEE/ACM international conference on automated software engineering, pp 70–79

  • Nie K, Zhang L (2012) Software feature location based on topic models. In: Proceedings of the 2012 19th Asia-Pacific software engineering conference. APSEC ’12, pp 547–552

  • Niu N, Savolainen J, Bhowmik T, Mahmoud A, Reddivari S (2012) A framework for examining topical locality in object-oriented software. In: Proceedings of the 2012 IEEE 36th annual computer software and applications conference, pp 219–224

  • Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: Proceedings of the 18th international conference on program comprehension, pp 68–71

  • Oliveto R, Gethers M, Bavota G, Poshyvanyk D, De Lucia A (2011) Identifying method friendships to remove the feature envy bad smell. In: Proceeding of the 33rd international conference on software engineering (NIER Track), pp 820–823

  • Ossher J, Bajracharya S, Linstead E, Baldi P, Lopes C (2009) Sourcererdb: an aggregated repository of statically analyzed and cross-linked open source java projects. In: Proceedings of the 6th working conference on mining software repositories, pp 183–186

  • Pagano D, Maalej W (2013) How do open source communities blog? Empir Softw Eng 18(6):1090–1124

    Article  Google Scholar 

  • Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In: Proceedings of the 2013 international conference on software engineering. ICSE ’13, pp 522–531

  • Parizy M, Takayama K, Kanazawa Y (2014) Software defect prediction for lsi designs. In: Proceedings of the 2014 IEEE international conference on software maintenance and evolution (ICSME), pp 565–568

  • Paul M (2009) Cross-collection topic models: automatically comparing and contrasting text. Master’s thesis, University of Illinois at Urbana-Champaign, Urbana

  • Petersen K, Feldt R, Mujtaba S, Mattsson M (2008) Systematic mapping studies in software engineering. In: Proceedings of the 12th international conference on evaluation and assessment in software engineering. EASn08, pp 68–77

  • Phan X-H, Nguyen L-M, Horiguchi S (2008a) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web. WWW ’08, pp 91–100

  • Phan XH, Nguyen LM, Horiguchi S (2008b) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceeding of the 17th international conference on World Wide Web, pp 91–100

  • Pingclasai N, Hata H, Matsumoto K-I (2013) Classifying bug reports to bugs and other requests using topic modeling. In: Proceedings of the 2013 20th Asia-Pacific software engineering conference (APSEC). APSEC ’13, pp 13–18

  • Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed Gibbs sampling for latent Dirichlet allocation. In: Proceeding of the 14th international conference on knowledge discovery and data mining, pp 569–577

  • Porter M (1980) An algorithm for suffix stripping. Program 14:130

    Article  Google Scholar 

  • Poshyvanyk D, Grechanik M (2009) Creating and evolving software by searching, selecting and synthesizing relevant source code. In: Proceedings of the 31st international conference on software engineering, pp 283–286

  • Poshyvanyk D, Marcus A (2007) Combining formal concept analysis with information retrieval for concept location in source code. In: Proceedings of the 15th international conference on program comprehension, pp 37–48

  • Poshyvanyk D, Marcus A, Rajlich V et al (2006) Combining probabilistic ranking and Latent Semantic Indexing for feature identification. In: Proceedings of the 14th international conference on program comprehension, pp 137–148

  • Poshyvanyk D, Gueheneuc Y, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432

    Article  Google Scholar 

  • Poshyvanyk D, Gethers M, Marcus A (2013) Concept location using formal concept analysis and information retrieval. ACM Trans Softw Eng Methodol 21 (4):23:1–23:34

    Google Scholar 

  • Qusef A, Bavota G, Oliveto R, Lucia AD, Binkley D (2013) Evaluating test-to-code traceability recovery methods through controlled experiments. J Softw Evol Process 25(11):1167–1191

    Article  Google Scholar 

  • Rahman F, Bird C, Devanbu PT (2012) Clones: what is that smell? Empir Softw Eng 17(4–5):503–530

    Article  Google Scholar 

  • Raja U (2012) All complaints are not created equal: text analysis of open source software defect reports. Empir Softw Eng 18:117–138

  • Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proceedings of the 10th international workshop on program comprehension, pp 271–278

  • Ramage D, Hall D, Nallapati R, Manning CD (2009a) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 1, pp 248–256

  • Ramage D, Rosen E, Chuang J, Manning CD, McFarland DA (2009b) Topic modeling for the social sciences. In: NIPS 2009 workshop on applications for topic models: text and beyond

  • Rao S, Kak A (2011) Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: Proceeding of the 8th working conference on mining software repositories, pp 43–52

  • Revelle M, Poshyvanyk D (2009) An exploratory study on assessing feature location techniques. In: Proceedings of the 17th international conference on program comprehension, pp 218–222

  • Revelle M, Dit B, Poshyvanyk D (2010) Using data fusion and web mining to support feature location in software. In: Proceedings of the 18th international conference on program comprehension, pp 14–23

  • Risi M, Scanniello G, Tortora G (2010) Architecture recovery using latent semantic indexing and k-means: an empirical evaluation. In: Proceedings of the 2010 8th IEEE international conference on software engineering and formal methods (SEFM), pp 103–112

  • Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, pp 487–494

  • Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74 (7):470–495

    Article  MathSciNet  MATH  Google Scholar 

  • Saha R, Lease M, Khurshid S, Perry D (2013) Improving bug localization using structured information retrieval. In: Proceedings of the 2013 IEEE/ACM 28th international conference on automated software engineering, pp 345–355

  • Salton G, McGill MJ (1983) Introduction to modern information retrieval. New York

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):620

    Article  MATH  Google Scholar 

  • Savage T, Dit B, Gethers M, Poshyvanyk D (2010) TopicXP: exploring topics in source code using latent Dirichlet allocation. In: Proceedings of the 26th international conference on software maintenance, pp 1–6

  • Shang W, Jiang ZM, Adams B, Hassan AE, Godfrey MW, Nasser M, Flora P (2013) An exploratory study of the evolution of communicated information about the execution of large software systems. J Softw Evol Process 26(1):3–26

    Article  Google Scholar 

  • Sharafl Z, Gueheneuc Y-G, Ali N, Antoniol G (2012) An empirical study on requirements traceability using eye-tracking. In: Proceedings of the 2012 IEEE international conference on software maintenance. ICSM ’12, pp 191–200

  • Shihab E, Jiang ZM, Ibrahim WM, Adams B, Hassan AE (2010) Understanding the impact of code and process metrics on post-release defects: a case study on the eclipse project. In: Proceedings of the international symposium on empirical software engineering and measurement

  • Somasundaram K, Murphy GC (2012) Automatic categorization of bug reports using latent dirichlet allocation. In: Proceedings of the 5th India software engineering conference. ISEC ’12, pp 125–130

  • Steyvers M, Griffiths T (2007) Probabilistic topic models. In: Latent semantic analysis: a road to meaning. Laurence Erlbaum

  • Tairas R, Gray J (2009) An information retrieval process to aid in the analysis of code clones. Empir Softw Eng 14(1):33–56

    Article  Google Scholar 

  • Tang J, Meng Z, Nguyen X, Mei Q, Zhang M (2014) Understanding the limiting factors of topic modeling via posterior contraction analysis. In: Jebara T, Xing EP (eds) Proceedings of the 31st international conference on machine learning (ICML-14), pp 190–198

  • Thomas S, Nagappan M, Blostein D, Hassan A (2013) The impact of classifier configuration and classifier combination on bug localization. IEEE Trans Softw Eng 39(10):1427–1443

    Article  Google Scholar 

  • Thomas SW, Adams B, Hassan AE, Blostein D (2010) Validating the use of topic models for software evolution. In: Proceedings of the 10th international working conference on source code analysis and manipulation, pp 55–64

  • Thomas SW, Hemmati H, Hassan AE, Blostein D (2014) Static test case prioritization using topic models. Empir Softw Eng 19:182–212

  • Tian K, Revelle M, Poshyvanyk D (2009) Using latent Dirichlet allocation for automatic categorization of software. In: Proceedings of the 6th international working conference on mining software repositories, pp 163–166

  • Tichy W (2010) An interview with Prof. Andreas Zeller: mining your way to software reliability. Ubiquity 2010

  • Ujhazi B, Ferenc R, Poshyvanyk D, Gyimothy T (2010) New conceptual coupling and cohesion metrics for object-oriented systems. In: Proceedings of the 10th international working conference on source code analysis and manipulation, pp 33–42

  • Van der Spek P, Klusener S, Van de Laar P (2008) Towards recovering architectural concepts using Latent Semantic Indexing. In: Proceedings of the 12th European conference on software maintenance and reengineering, pp 253–257

  • Wall M, Rechtsteiner A, Rocha L (2003) Singular value decomposition and principal component analysis, pp 91–109

  • Wallach H, Mimno D, McCallum A (2009a) Rethinking LDA: why priors matter. In: Proceedings of NIPS-09, Vancouver, BC

  • Wallach HM, Murray I, Salakhutdinov R, Mimno D (2009b) Evaluation methods for topic models. In: Proceedings of the 26th international conference on machine learning, pp 1105–1112

  • Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th international conference on Knowledge discovery and data mining. ACM, pp 424–433

  • Wang C, Thiesson B, Meek C, Blei D (2009) Markov topic models. In: The twelfth international conference on artificial intelligence and statistics (AISTATS), pp 583–590

  • Wang S, Lo D, Xing Z, Jiang L (2011) Concern localization using information retrieval: an empirical study on linux kernel. In: Proceedings of the 2011 18th working conference on reverse engineering. WCRE ’11, pp 92–96

  • Wu C, Chang E, Aitken A (2008) An empirical approach for semantic web services discovery. In: Proceedings of the 19th Australian conference on software engineering, pp 412–421

  • Xia X, Lo D, Wang X, Zhou B (2013) Accurate developer recommendation for bug resolution. In: Proceedings of the 2013 20th working conference on reverse engineering, pp 72–81

  • Xie B, Li M, Jin J, Zhao J, Zou Y (2013) Mining cohesive domain topics from source code. In: Proceedings of the international conference on software reuse, pp 239–254

  • Xue Y, Xing Z, Jarzabek S (2012) Feature location in a collection of product variants. In: Proceedings of the 2012 19th working conference on reverse engineering, pp 145–154

  • Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on World Wide Web. WWW ’13, pp 1445–1456

  • Yang G, Zhang T, Lee B (2014) Towards semi-automatic bug triage and severity prediction based on topic model and multi-feature of bug reports. In: Proceedings of the 2014 IEEE 38th annual computer software and applications conference. COMPSAC ’14, pp 97–106

  • Yu S (2012) Retrieving software maintenance history with topic models. In: Proceedings of the 2012 28th IEEE international conference on software maintenance (ICSM), pp 621–624

  • Zawawy H, Kontogiannis K, Mylopoulos J (2010) Log filtering and interpretation for root cause analysis. In: Proceedings of the 26th international conference on software maintenance, pp 1–5

  • Zhai C. X (2008) Statistical language models for information retrieval. Synthesis Lect Human Lang Technol 1(1):1–141

    Article  Google Scholar 

  • Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed? - more accurate information retrieval-based bug localization based on bug reports. In: Proceedings of the 34th international conference on software engineering. ICSE ’12, pp 14–24

  • Zimmermann T, Weisgerber P, Diehl S, Zeller A (2005) Mining version histories to guide software changes. IEEE Trans Softw Eng 31:429–445

  • Zou C, Hou D (2014) Lda analyzer: a tool for exploring topic models. In: Proceedings of the 2014 international conference on software maintenance and evolution (ICSME), pp 593–596

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tse-Hsun Chen.

Additional information

Communicated by: Mark Harman

Appendices

Appendix A: Article Selection Process

In this paper, we are interested in locating articles that use topic modeling techniques to solve SE tasks. We focus our attention on articles written between December 1999 and December 2014, more than a full decade of research results. We consider in our search a wide range of journals and conferences in an effort to cover as many relevant articles as possible.

To select our list of articles, we use first compile a list of highly relevant venues to search. Then, we perform a series of keyword searches at each venue, producing a list of candidate articles. For each of the candidate articles, we read the abstract (and, in some cases, the introduction) of the article to determine if the article is indeed relevant to our interests. This yields an initial set of related articles. For each of the articles in the initial set, we consider the citations that are contained in the article for additional relevant articles. Then, we reach our final set of articles.

1.1 A.1 Considered Venues

Table 4 lists the journals and conference venues that we included in our initial search for articles.

Table 4 The fifteen venues that we considered in our initial article selection process

1.2 A.2 Keyword Searches and Filtering

We collected the initial set of articles by performing keyword searches at the publisher websites for each of our considered venues. We also searched using aggregate search engines, such as the ACM Digital Library and IEEE Xplore. The keywords and search queries that we use are listed below.

1.3 IEEE Xplore

("topic models" OR "topic model" OR "lsi" OR "lda" OR "plsi" OR "latent dirichlet allocation" OR "latent semantic") AND ("Publication Title":"Source Code Analysis and Manipulation" OR "Publication Title":"Software Engineering, IEEE Transactions on" OR "Publication Title":"Reverse Engineering" OR "Publication Title":"Software Maintenance" OR "Publication Title":"Software Engineering" OR "Publication Title":"Program Comprehension" OR "Publication Title":"Mining Software Repositories")

1.4 Software Practice and Experience

lsi or lda or "topic model" or "topic models" or "latent dirichlet allocation" or "latent semantic" AND publication title="Software Practice and Experience"

1.5 Journal of Software Maintenance and Evolution

lsi or lda or "topic model" or "topic models" or "latent dirichlet allocation" or "latent semantic" AND publication title="Software Maintenance and Evolution"

1.6 Empirical Software Engineering

lsi or lda or "topic model" or "topic models" or latent dirichlet allocation" or "latent semantic"

In general, we found that keyword searches resulted in many irrelevant results. We manually filtered the search results by reading the article’s abstract (and sometimes introduction) to determine if the article solved an SE task by employing one or more topic modeling technique. The articles that were determined to be relevant were added to our initial set.

1.7 A.3 Reference Checking

For each article in the initial set, we followed its citations to obtain another list of potentially relevant articles. Again, we filtered this list by reading the abstract and introduction. The articles that were determined to relevant were added to our final set of articles.

1.8 A.4 Article Selection Results

We finally arrive at 167 articles published between 1999 and 2014. Figure 7 shows the distribution of venues and years for the articles.

Fig. 7
figure 7

Article selection results

Appendix B: Article Characterization Results for Facets 1–6

Table 5 Article characterization results of facets 1–4
Table 6 Article characterization results of facets 5 and 6

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, TH., Thomas, S.W. & Hassan, A.E. A survey on the use of topic models when mining software repositories. Empir Software Eng 21, 1843–1919 (2016). https://doi.org/10.1007/s10664-015-9402-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-015-9402-8

Keywords

Navigation