Abstract
Large organizations like Microsoft tend to rely on formal requirements documentation in order to specify and design the software products that they develop. These documents are meant to be tightly coupled with the actual implementation of the features they describe. In this paper we evaluate the value of high-level topic-based requirements traceability and issue report traceability in the version control system, using Latent Dirichlet Allocation (LDA). We evaluate LDA topics on practitioners and check if the topics and trends extracted match the perception that industrial Program Managers and Developers have about the effort put into addressing certain topics. We then replicate this study again on Open Source Developers using issue reports from issue trackers instead of requirements, confirming our previous industrial conclusions. We found that efforts extracted as commits from version control systems relevant to a topic often matched the perception of the managers and developers of what actually occurred at that time. Furthermore we found evidence that many of the identified topics made sense to practitioners and matched their perception of what occurred. But for some topics, we found that practitioners had difficulty interpreting and labelling them. In summary, we investigate the high-level traceability of requirements topics and issue/bug report topics to version control commits via topic analysis and validate with the actual stakeholders the relevance of these topics extracted from requirements and issues.
Similar content being viewed by others
Notes
The set of stop words used: http://softwareprocess.es/b/stop_words
git-svn man page: https://www.kernel.org/pub/software/scm/git/docs/git-svn.html
git-grep.pl is located here: https://github.com/abramhindle/gh-lda-extractor
JSON Definition: http://JSON.org
5 Github Issue Extractor: https://github.com/abramhindle/github-issues-to-json
6 Google Code Issue Extractor: https://github.com/abramhindle/google-code-bug-tracker-downloader
Vowpal Wabbit: https://github.com/JohnLangford/vowpal_wabbit/wiki
Our Github LDA Extractor: https://github.com/abramhindle/gh-lda-extractor
FreeNode: http://freenode.org
IRSSI IRC Client: http://irssi.org/
Asterisk issue tracker guidelines: https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines
References
Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983
Asuncion A, Welling M, Smyth P, Teh YW (2009) On smoothing and inference for topic models. In: Proceedings of the 25th conference on uncertainty in artificial intelligence.AUAI Press, pp 27–34
Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering, ICSE ’10, vol 1. ACM, New York, pp 95–104. doi:10.1145/1806799.1806817
Baldi PF, Lopes CV, Linstead EJ, Bajracharya SK (2008) A theory of aspects as latent topics. In: Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications, OOPSLA ’08. ACM, New York, pp 543–562
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Capiluppi A, Izquierdo-Cortázar D (2013) Effort estimation of floss projects: a study of the linux kernel. Empir Softw Eng 18(1):60–88
Cheng BHC, Atlee JM (2007) Research directions in requirements engineering. In: 2007 future of software engineering, FOSE ’07. IEEE Computer Society, Washington, DC, pp 285–303
Cleland-Huang J, Settimi R, BenKhadra O, Berezhanskaya E, Christina S (2005) Goal-centric traceability for managing non-functional requirements. In: Proceedings of the 27th international conference on software engineering, ICSE ’05. ACM, New York, pp 362–371
De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2012) Using ir methods for labeling source code artifacts: Is it worthwhile?. In: IEEE 20th international conference on program comprehension (ICPC), 2012. IEEE, pp 193–202
De Lucia A, Marcus A, Oliveto R, Poshyvanyk D (2012) Information retrieval methods for automated traceability recovery. In: Software and systems traceability. Springer, pp 71–98
Ernst N, Mylopoulos J (2010) On the perception of software quality requirements during the project lifecycle. In: Wieringa R, Persson A (eds) Requirements engineering: foundation for software quality. Lecture notes in computer science, vol 6182. Springer, Berlin / Heidelberg, pp 143–157
Gethers M, Oliveto R, Poshyvanyk D, Lucia AD (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: 2011 27th IEEE international conference on software maintenance (ICSM). IEEE, pp 133–142
Grant S, Cordy JR (2010) Estimating the optimal number of latent concepts in source code analysis. In: Proceedings of the 2010 10th IEEE working conference on source code analysis and manipulation, SCAM ’10. IEEE Computer Society, Washington, DC, pp 65–74
Hindle A, Bird C, Zimmermann T, Nagappan N (2012) Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers? In: Proceedings of the 28th IEEE international conference on software maintenance. IEEE
Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. ACM, New York, pp 163–172
Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. In: Advances in neural information processing systems. pp 856–864
Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. In: Proceedings of the 29th international conference on software engineering, ICSE ’07. IEEE Computer Society, Washington, DC, pp 344–353
Koch S (2008) Effort modeling and programmer participation in open source software projects. Inf Econ Policy 20(4):345–355
Konrad S, Cheng B (2006) Automated analysis of natural language properties for uml models. In: Bruel JM (ed) Satellite events at the MoDELS 2005 conference. Lecture notes in computer science, vol 3844. Springer, Berlin / Heidelberg, pp 48–57
Kozlenkov A, Zisman A (2002) Are their design specifications consistent with our requirements? In: Proceedings of the 10th anniversary IEEE joint international conference on requirements engineering, RE ’02. IEEE Computer Society, Washington, DC, pp 145–156
Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Inf Softw Technol 49(3):230–243
Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: Proceedings of the 2008 15th working conference on reverse engineering, WCRE ’08. IEEE Computer Society, Washington, DC, pp 155–164
Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings 25th international conference on software engineering, 2003. IEEE, pp 125–135
Marcus A, Sergeyev A, Rajlich V, Maletic JI (2004) An information retrieval approach to concept location in source code. In: Proceedings of the 11th working conference on reverse engineering, WCRE ’04. IEEE Computer Society, Washington, DC, pp 214–223
McMillan C, Poshyvanyk D, Revelle M (2009) Combining textual and structural analysis of software artifacts for traceability link recovery. In: Proceedings of the 2009 ICSE workshop on traceability in emerging forms of software engineering, TEFSE ’09. IEEE Computer Society, Washington, DC, pp 41–48
Murphy GC, Notkin D, Sullivan KJ (2001) Software reflexion models: bridging the gap between design and implementation. IEEE Trans Softw Eng 27(4):364–380. doi:10.1109/32.917525
Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 522–531
Poshyvanyk D (2008) Using information retrieval to support software maintenance tasks, Ph.D. thesis, Wayne State University, Detroit, MI, USA
Ramage D, Dumais ST, Liebling DJ (2010) Characterizing microblogs with topic models. In: ICWSM
Ramesh B (1998) Factors influencing requirements traceability practice. Commun ACM 41(12):37–44. doi:10.1145/290133.290147
Reiss, SP (2006) Incremental maintenance of software artifacts. IEEE Trans. Softw. Eng. 32(9):682–697. doi:10.1109/TSE.2006.91
Sabetzadeh M, Easterbrook S (2005) Traceability in viewpoint merging: a model management perspective. In: Proceedings of the 3rd international workshop on traceability in emerging forms of software engineering, TEFSE ’05. ACM, New York, pp 44–49
Savage T, Dit B, Gethers M, Poshyvanyk D (2010) Topicxp: exploring topics in source code using latent dirichlet allocation. In: Proceedings of the 2010 IEEE international conference on software maintenance, ICSM ’10. IEEE Computer Society, Washington, DC, pp 1–6
Shull F, Singer J, Sjberg DIK (2010) Guide to advanced empirical software engineering, 1st edn. Springer Publishing Company Incorporated
Sneed HM (2007) Testing against natural language requirements. In: Proceedings of the 7th international conference on quality software, QSIC ’07. IEEE Computer Society, Washington, DC, pp 380–387
Thomas SW, Adams B, Hassan AE, Blostein D (2010) Validating the use of topic models for software evolution. In: Proceedings of the 2010 10th IEEE working conference on source code analysis and manipulation, SCAM ’10. IEEE Computer Society, Washington, DC, pp 55–64
Thomas SW, Adams B, Hassan AE, Blostein D (2011) Modeling the evolution of topics in source code histories. In: Proceedings of the 8th working conference on mining software repositories, MSR ’11. ACM, New York, pp 173–182
Tillmann N., Chen F., Schulte W. (2006) Discovering likely method specifications. In: Liu Z., He J. (eds) Formal methods and software engineering. Lecture notes in computer science, vol 4260. Springer, Berlin / Heidelberg, pp 717–736
Wiegers KE (2003) Software requirements, 2nd edn. Microsoft Press, Redmond
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Norwell
Acknowledgments
Thanks to the many managers and developers at Microsoft who volunteered their time to participate in our research and provide their valuable insights and feedback. Abram Hindle performed some of this work as a visiting researcher at Microsoft Research. Thanks to the Natural Sciences and Engineering Research Council of Canada for partially funding this work. Thanks to Abram Hindle’s first student, Zhang Chenlei, for his feedback. Thanks to the FLOSS developers who chose to participate: Julian Harty, Lisa Milne, Tobias Leich, Ian Cordasco, Ricky Elrod, Anthony Grimes, Geoffrey Greer, Nicolas J. Bouliane, Drew DeVault, Daniel Huckstep, Chad Whitacre, Devin Joel Austin, and Gerson Goulart.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Massimiliano Di Penta and Jonathan Maletic
Rights and permissions
About this article
Cite this article
Hindle, A., Bird, C., Zimmermann, T. et al. Do topics make sense to managers and developers?. Empir Software Eng 20, 479–515 (2015). https://doi.org/10.1007/s10664-014-9312-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-014-9312-1