Empirical Software Engineering

, Volume 18, Issue 6, pp 1125–1155 | Cite as

Automated topic naming

Supporting cross-project analysis of software maintenance activities
  • Abram Hindle
  • Neil A. Ernst
  • Michael W. Godfrey
  • John Mylopoulos


Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). LDA is used for concept and topic analysis to suggest candidate word-lists or topics that describe and relate software artifacts. However, these word-lists and topics are difficult to interpret in the absence of meaningful summary labels. Current attempts to interpret topics assume manual labelling and do not use domain-specific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using LDA from commit-log comments recovered from source control systems. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on three large-scale Relational Database Management System (RDBMS) projects: MySQL, PostgreSQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels that are relevant to these projects, and provide fresh insight into their evolving software development activities.


Software maintenance Repository mining Latent Dirichlet allocation Topic models 


  1. Baldi PF, Lopes CV, Linstead EJ, Bajracharya SK (2008) A theory of aspects as latent topics. In: Conference on object oriented programming systems languages and applications, pp  543–562. NashvilleGoogle Scholar
  2. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(4–5):993–1022. doi: 10.1162/jmlr.2003.3.4-5.993 zbMATHGoogle Scholar
  3. Bøegh J (2008) A new standard for quality requirements. IEEE Software 25(2):57–63. doi: 10.1109/MS.2008.30 CrossRefGoogle Scholar
  4. Boehm B, Brown JR, Lipow M (1976) Quantitative evaluation of software quality. In: International conference on software engineering, pp 592–605Google Scholar
  5. Chung L, Nixon BA, Yu ES, Mylopoulos J (1999) Non-functional requirements in software engineering. In: International series in software engineering, vol 5. Kluwer Academic, BostonGoogle Scholar
  6. Cleland-Huang J, Settimi R, Zou X, Solc P (2006) The detection and classification of non-functional requirements with application to early aspects. In: International requirements engineering conference, pp 39–48. Minneapolis, Minnesota. doi: 10.1109/RE.2006.65
  7. Ernst NA, Mylopoulos J (2010) On the perception of software quality requirements during the project lifecycle. In: International working conference on requirements engineering: foundation for software quality. Essen, GermanyGoogle Scholar
  8. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874MathSciNetCrossRefGoogle Scholar
  9. Fellbaum C (ed) (1998) WordNet: an electronic lexical database. MIT Press, CambridgezbMATHGoogle Scholar
  10. Few S (2006) Information dashboard design: the effective visual communication of data, 1st edn. O’Reilly Media. URL
  11. Flach P (2003) The geometry of roc space: understanding machine learning metrics through roc isometrics. In: Proc. 20th international conference on machine learning (ICML’03). AAAI Press, pp 194–201. URL
  12. Forman G, Scholz M (2010) Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. SIGKDD Explor Newsl 12:49–57. doi: 10.1145/1882471.1882479 CrossRefGoogle Scholar
  13. German DM (2003) The GNOME project: a case study of open source, global software development. Softw Process Improv Pract 8(4):201–215. doi: 10.1002/spip.189 CrossRefGoogle Scholar
  14. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11(1):10–18. URL CrossRefGoogle Scholar
  15. Hindle A, Godfrey MW, Holt RC (2007) Release pattern discovery via partitioning: methodology and case study. In: International workshop on mining software repositories at ICSE, pp 19–27. Minneapolis, MN. doi: 10.1109/MSR.2007.28
  16. Hindle A, German DM, Holt R (2008) What do large commits tell us?: a taxonomical study of large commits. In: MSR ’08: Proceedings of the 2008 international working conference on mining software repositories. ACM, New York, pp 99–108. doi: 10.1145/1370750.1370773 CrossRefGoogle Scholar
  17. Hindle A, Godfrey MW, Holt RC (2009) What’s hot and what’s not: windowed developer topic analysis. In: International conference on software maintenance, pp 339–348. Edmonton, Alberta, Canada. doi: 10.1109/ICSM.2009.5306310 Google Scholar
  18. Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: International conference on mining software repositoriesGoogle Scholar
  19. ISO (2001) Software engineering—product quality—part 1: quality model. Tech. rep., International Standards Organization - JTC 1/SC 7Google Scholar
  20. Kayed A, Hirzalla N, Samhan A, Alfayoumi M (2009) Towards an ontology for software product quality attributes. In: International conference on internet and web applications and services, pp 200–204. doi: 10.1109/ICIW.2009.36
  21. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp 1137–1143. Toronto. URL
  22. Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering, pp 214–223. doi: 10.1109/WCRE.2004.10
  23. Massey B (2002) Where do open source requirements come from (and what should we do about it)? In: Workshop on Open source software engineering at ICSE. Orlando, FL, USAGoogle Scholar
  24. McCall J (1977) Factors in software quality: preliminary handbook on software quality for an acquisiton manager, vols 1–3. General Electric. URL
  25. Mei Q, Shen X, Zhai C (2007) Automatic labeling of multinomial topic models. In: International conference on knowledge discovery and data mining, pp 490–499. San Jose, California. doi: 10.1145/1281192.1281246
  26. Mockus A, Votta L (2000) Identifying reasons for software changes using historic databases. In: International conference on software maintenance, pp 120–130. San Jose, CA. doi: 10.1109/ICSM.2000.883028. URL
  27. Scacchi W, Jensen C, Noll J, Elliott M (2005) Multi-modal modeling, analysis and validation of open source software requirements processes. In: International conference on open source systems, vol 1, pp 1–8. Genoa, ItalyGoogle Scholar
  28. Treude C, Storey MA (2009) ConcernLines: a timeline view of co-occurring concerns. In: International conference on software engineering, pp 575–578. VancouverGoogle Scholar
  29. Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook, 2nd edn. SpringerGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Abram Hindle
    • 1
  • Neil A. Ernst
    • 2
  • Michael W. Godfrey
    • 3
  • John Mylopoulos
    • 4
  1. 1.Dept. of Computing ScienceUniversity of AlbertaEdmontonCanada
  2. 2.Dept. of Computer ScienceUniversity of British ColumbiaVancouverCanada
  3. 3.David Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada
  4. 4.Dept. Information Eng. and Computer ScienceUniversity of TrentoTrentoItaly

Personalised recommendations