Skip to main content
Log in

Do topics make sense to managers and developers?

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Large organizations like Microsoft tend to rely on formal requirements documentation in order to specify and design the software products that they develop. These documents are meant to be tightly coupled with the actual implementation of the features they describe. In this paper we evaluate the value of high-level topic-based requirements traceability and issue report traceability in the version control system, using Latent Dirichlet Allocation (LDA). We evaluate LDA topics on practitioners and check if the topics and trends extracted match the perception that industrial Program Managers and Developers have about the effort put into addressing certain topics. We then replicate this study again on Open Source Developers using issue reports from issue trackers instead of requirements, confirming our previous industrial conclusions. We found that efforts extracted as commits from version control systems relevant to a topic often matched the perception of the managers and developers of what actually occurred at that time. Furthermore we found evidence that many of the identified topics made sense to practitioners and matched their perception of what occurred. But for some topics, we found that practitioners had difficulty interpreting and labelling them. In summary, we investigate the high-level traceability of requirements topics and issue/bug report topics to version control commits via topic analysis and validate with the actual stakeholders the relevance of these topics extracted from requirements and issues.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The set of stop words used: http://softwareprocess.es/b/stop_words

  2. git-svn man page: https://www.kernel.org/pub/software/scm/git/docs/git-svn.html

  3. git-grep.pl is located here: https://github.com/abramhindle/gh-lda-extractor

  4. JSON Definition: http://JSON.org

  5. 5 Github Issue Extractor: https://github.com/abramhindle/github-issues-to-json

  6. 6 Google Code Issue Extractor: https://github.com/abramhindle/google-code-bug-tracker-downloader

  7. Vowpal Wabbit: https://github.com/JohnLangford/vowpal_wabbit/wiki

  8. Our Github LDA Extractor: https://github.com/abramhindle/gh-lda-extractor

  9. FreeNode: http://freenode.org

  10. IRSSI IRC Client: http://irssi.org/

  11. Asterisk issue tracker guidelines: https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines

References

  • Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983

    Article  Google Scholar 

  • Asuncion A, Welling M, Smyth P, Teh YW (2009) On smoothing and inference for topic models. In: Proceedings of the 25th conference on uncertainty in artificial intelligence.AUAI Press, pp 27–34

  • Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering, ICSE ’10, vol 1. ACM, New York, pp 95–104. doi:10.1145/1806799.1806817

  • Baldi PF, Lopes CV, Linstead EJ, Bajracharya SK (2008) A theory of aspects as latent topics. In: Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications, OOPSLA ’08. ACM, New York, pp 543–562

    Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Capiluppi A, Izquierdo-Cortázar D (2013) Effort estimation of floss projects: a study of the linux kernel. Empir Softw Eng 18(1):60–88

    Article  Google Scholar 

  • Cheng BHC, Atlee JM (2007) Research directions in requirements engineering. In: 2007 future of software engineering, FOSE ’07. IEEE Computer Society, Washington, DC, pp 285–303

    Google Scholar 

  • Cleland-Huang J, Settimi R, BenKhadra O, Berezhanskaya E, Christina S (2005) Goal-centric traceability for managing non-functional requirements. In: Proceedings of the 27th international conference on software engineering, ICSE ’05. ACM, New York, pp 362–371

    Google Scholar 

  • De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2012) Using ir methods for labeling source code artifacts: Is it worthwhile?. In: IEEE 20th international conference on program comprehension (ICPC), 2012. IEEE, pp 193–202

  • De Lucia A, Marcus A, Oliveto R, Poshyvanyk D (2012) Information retrieval methods for automated traceability recovery. In: Software and systems traceability. Springer, pp 71–98

  • Ernst N, Mylopoulos J (2010) On the perception of software quality requirements during the project lifecycle. In: Wieringa R, Persson A (eds) Requirements engineering: foundation for software quality. Lecture notes in computer science, vol 6182. Springer, Berlin / Heidelberg, pp 143–157

    Google Scholar 

  • Gethers M, Oliveto R, Poshyvanyk D, Lucia AD (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: 2011 27th IEEE international conference on software maintenance (ICSM). IEEE, pp 133–142

  • Grant S, Cordy JR (2010) Estimating the optimal number of latent concepts in source code analysis. In: Proceedings of the 2010 10th IEEE working conference on source code analysis and manipulation, SCAM ’10. IEEE Computer Society, Washington, DC, pp 65–74

    Chapter  Google Scholar 

  • Hindle A, Bird C, Zimmermann T, Nagappan N (2012) Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers? In: Proceedings of the 28th IEEE international conference on software maintenance. IEEE

  • Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. ACM, New York, pp 163–172

    Google Scholar 

  • Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. In: Advances in neural information processing systems. pp 856–864

  • Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. In: Proceedings of the 29th international conference on software engineering, ICSE ’07. IEEE Computer Society, Washington, DC, pp 344–353

    Google Scholar 

  • Koch S (2008) Effort modeling and programmer participation in open source software projects. Inf Econ Policy 20(4):345–355

    Article  Google Scholar 

  • Konrad S, Cheng B (2006) Automated analysis of natural language properties for uml models. In: Bruel JM (ed) Satellite events at the MoDELS 2005 conference. Lecture notes in computer science, vol 3844. Springer, Berlin / Heidelberg, pp 48–57

    Google Scholar 

  • Kozlenkov A, Zisman A (2002) Are their design specifications consistent with our requirements? In: Proceedings of the 10th anniversary IEEE joint international conference on requirements engineering, RE ’02. IEEE Computer Society, Washington, DC, pp 145–156

    Chapter  Google Scholar 

  • Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Inf Softw Technol 49(3):230–243

    Article  Google Scholar 

  • Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: Proceedings of the 2008 15th working conference on reverse engineering, WCRE ’08. IEEE Computer Society, Washington, DC, pp 155–164

    Chapter  Google Scholar 

  • Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings 25th international conference on software engineering, 2003. IEEE, pp 125–135

  • Marcus A, Sergeyev A, Rajlich V, Maletic JI (2004) An information retrieval approach to concept location in source code. In: Proceedings of the 11th working conference on reverse engineering, WCRE ’04. IEEE Computer Society, Washington, DC, pp 214–223

    Chapter  Google Scholar 

  • McMillan C, Poshyvanyk D, Revelle M (2009) Combining textual and structural analysis of software artifacts for traceability link recovery. In: Proceedings of the 2009 ICSE workshop on traceability in emerging forms of software engineering, TEFSE ’09. IEEE Computer Society, Washington, DC, pp 41–48

    Chapter  Google Scholar 

  • Murphy GC, Notkin D, Sullivan KJ (2001) Software reflexion models: bridging the gap between design and implementation. IEEE Trans Softw Eng 27(4):364–380. doi:10.1109/32.917525

    Article  Google Scholar 

  • Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 522–531

  • Poshyvanyk D (2008) Using information retrieval to support software maintenance tasks, Ph.D. thesis, Wayne State University, Detroit, MI, USA

  • Ramage D, Dumais ST, Liebling DJ (2010) Characterizing microblogs with topic models. In: ICWSM

  • Ramesh B (1998) Factors influencing requirements traceability practice. Commun ACM 41(12):37–44. doi:10.1145/290133.290147

    Article  Google Scholar 

  • Reiss, SP (2006) Incremental maintenance of software artifacts. IEEE Trans. Softw. Eng. 32(9):682–697. doi:10.1109/TSE.2006.91

    Article  Google Scholar 

  • Sabetzadeh M, Easterbrook S (2005) Traceability in viewpoint merging: a model management perspective. In: Proceedings of the 3rd international workshop on traceability in emerging forms of software engineering, TEFSE ’05. ACM, New York, pp 44–49

    Chapter  Google Scholar 

  • Savage T, Dit B, Gethers M, Poshyvanyk D (2010) Topicxp: exploring topics in source code using latent dirichlet allocation. In: Proceedings of the 2010 IEEE international conference on software maintenance, ICSM ’10. IEEE Computer Society, Washington, DC, pp 1–6

    Chapter  Google Scholar 

  • Shull F, Singer J, Sjberg DIK (2010) Guide to advanced empirical software engineering, 1st edn. Springer Publishing Company Incorporated

  • Sneed HM (2007) Testing against natural language requirements. In: Proceedings of the 7th international conference on quality software, QSIC ’07. IEEE Computer Society, Washington, DC, pp 380–387

    Google Scholar 

  • Thomas SW, Adams B, Hassan AE, Blostein D (2010) Validating the use of topic models for software evolution. In: Proceedings of the 2010 10th IEEE working conference on source code analysis and manipulation, SCAM ’10. IEEE Computer Society, Washington, DC, pp 55–64

    Chapter  Google Scholar 

  • Thomas SW, Adams B, Hassan AE, Blostein D (2011) Modeling the evolution of topics in source code histories. In: Proceedings of the 8th working conference on mining software repositories, MSR ’11. ACM, New York, pp 173–182

    Chapter  Google Scholar 

  • Tillmann N., Chen F., Schulte W. (2006) Discovering likely method specifications. In: Liu Z., He J. (eds) Formal methods and software engineering. Lecture notes in computer science, vol 4260. Springer, Berlin / Heidelberg, pp 717–736

    Google Scholar 

  • Wiegers KE (2003) Software requirements, 2nd edn. Microsoft Press, Redmond

    Google Scholar 

  • Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Norwell

    Book  Google Scholar 

Download references

Acknowledgments

Thanks to the many managers and developers at Microsoft who volunteered their time to participate in our research and provide their valuable insights and feedback. Abram Hindle performed some of this work as a visiting researcher at Microsoft Research. Thanks to the Natural Sciences and Engineering Research Council of Canada for partially funding this work. Thanks to Abram Hindle’s first student, Zhang Chenlei, for his feedback. Thanks to the FLOSS developers who chose to participate: Julian Harty, Lisa Milne, Tobias Leich, Ian Cordasco, Ricky Elrod, Anthony Grimes, Geoffrey Greer, Nicolas J. Bouliane, Drew DeVault, Daniel Huckstep, Chad Whitacre, Devin Joel Austin, and Gerson Goulart.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abram Hindle.

Additional information

Communicated by: Massimiliano Di Penta and Jonathan Maletic

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hindle, A., Bird, C., Zimmermann, T. et al. Do topics make sense to managers and developers?. Empir Software Eng 20, 479–515 (2015). https://doi.org/10.1007/s10664-014-9312-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-014-9312-1

Keywords

Navigation