Skip to main content
Log in

Semantic topic models for source code analysis

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Topic modeling techniques have been recently applied to analyze and model source code. Such techniques exploit the textual content of source code to provide automated support for several basic software engineering activities. Despite these advances, applications of topic modeling in software engineering are frequently suboptimal. This can be attributed to the fact that current state-of-the-art topic modeling techniques tend to be data intensive. However, the textual content of source code, embedded in its identifiers, comments, and string literals, tends to be sparse in nature. This prevents classical topic modeling techniques, typically used to model natural language texts, to generate proper models when applied to source code. Furthermore, the operational complexity and multi-parameter calibration often associated with conventional topic modeling techniques raise important concerns about their feasibility as data analysis models in software engineering. Motivated by these observations, in this paper we propose a novel approach for topic modeling designed for source code. The proposed approach exploits the basic assumptions of the cluster hypothesis and information theory to discover semantically coherent topics in software systems. Ten software systems from different application domains are used to empirically calibrate and configure the proposed approach. The usefulness of generated topics is empirically validated using human judgment. Furthermore, a case study that demonstrates thet operation of the proposed approach in analyzing code evolution is reported. The results show that our approach produces stable, more interpretable, and more expressive topics than classical topic modeling techniques without the necessity for extensive parameter calibration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://nlp.stanford.edu/

  2. https://github.com/seelprojects/STAC

  3. https://dumps.wikimedia.org/

  4. https://books.google.com/ngrams

  5. http://agile.csc.ncsu.edu/iTrust/wiki/doku.php

  6. http://ant.apache.org/ivy/

  7. https://sourceforge.net/projects/jhotdraw/

  8. http://javahmo.sourceforge.net/

  9. http://developers.itextpdf.com/itext-java

  10. https://github.com/MegaMek

  11. http://prefuse.org/download/

  12. http://www.jedit.org/

  13. http://ant.apache.org/

  14. https://www.jbidwatcher.com/

  15. http://jgibblda.sourceforge.net/

  16. https://github.com/FredJul/Flym

References

  • Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: International Conference on Program Comprehension, pp. 103–112

  • Aggarwal C, Zhai C (2012) A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer

  • Andrzejewski D, Mulhern A, Liblit B, Zhu X (2007) Statistical debugging using latent topic models. In: European conference on Machine Learning, pp. 6–17

  • Anquetil N, Fourrier C, Lethbridge T (1999) Experiments with clustering as a software remodularization method. In: Working Conference on Reverse Engineering, pp. 235–255

  • Anquetil N, Lethbridge T (1998) Assessing the relevance of identifier names in a legacy software system. In: Conference of the Centre for Advanced Studies on Collaborative Research, pp. 4–14

  • Asuncion H, Asuncion A, Taylor R (2010) Software traceability with topic modeling. In: International Conference on Software Engineering, pp. 95–104

  • Baldi P, Lopes C, Linstead E, Bajracharya S (2008) A theory of aspects as latent topics. ACM SIGPLAN Not 43(10):543–562

    Article  Google Scholar 

  • Barua A, Thomas S, Hassan A (2014) What are developers talking about? An analysis of topics and trends in stack overflow. Empir Softw Eng 19(3):619–654

    Article  Google Scholar 

  • Bavota G, Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2014) Improving software modularization via automated analysis of latent topics and dependencies. ACM Trans Softw Eng Methodol 23(1):1–33

    Article  Google Scholar 

  • Bavota G, Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2014) Methodbook: Recommending move method refactorings via Relational Topic Models. IEEE Trans Softw Eng 40(7):671–694

    Article  Google Scholar 

  • Bieman J, Kang B (1995) Cohesion and reuse in an object-oriented system. SIGSOFT Software Engineering Notes 20(SI):259–262

    Article  Google Scholar 

  • Biggers L, Bocovich C, Capshaw R, Eddy B, Etzkorn L, Kraft N (2014) Configuring Latent Dirichlet Allocation based feature location. Empir Softw Eng 19(3):465–500

    Article  Google Scholar 

  • Binkley D, Heinz D, Lawrie D, Overfelt J (2014) Understanding LDA in source code analysis. In: International Conference on Program Comprehension, pp. 26–36

  • Blei D, Griffiths T, Jordan M, Tenenbaum J (2003) Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems

  • Blei D, Ng A, Jordan M (2003) Latent Dirichlet Allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Budiu R, Royer C, Pirolli P (2007) Modeling information scent: A comparison of LSA, PMI and GLSA similarity measures on common tests and corpora. In: Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pp. 314–332

  • Bullinaria J, Levy J (2007) Extracting semantic representations from word co-occurrence statistics: A computational study. Behav Res Methods 39(3):510–526

    Article  Google Scholar 

  • van Rijsbergen CJ (1979) Information Retrieval. Butterworths

  • Caprile B, Tonella P (2000) Restructuring program identifier names. In: International Conference on Software Maintenance, pp. 97–107

  • Chang J (2010) Not-so-latent Dirichlet allocation: Collapsed Gibbs sampling using human judgments. In: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 131–138

  • Chang J, Boyd-Graber J, Gerrish S, Wang C, Blei D (2009) Reading tea leaves: How humans interpret topic models. Curran Associates, pp 288–296

  • Chen T, Thomas S, Nagappan M, Hassan A (2012) Explaining software defects using topic models. In: Working Conference on Mining Software Repositories, pp. 189–198

  • Chidamber S, Kemerer C (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493

    Article  Google Scholar 

  • Church K, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29

    Google Scholar 

  • Cilibrasi R, Vitanyi P (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383

    Article  Google Scholar 

  • De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichelle S (2012) Using IR methods for labeling source code artifacts: Is it worthwhile?. In: International Conference on Program Comprehension, pp. 193–202

  • Dean A, Voss D (1999) Design and Analysis of Experiments. Springer

  • Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  • Deißenböck F, Pizka M (2005) Concise and consistent naming. In: International Workshop on Program Comprehension, pp. 97–106

  • Demmel J, Kahan W (1990) Accurate singular values of bidiagonal matrices. J Sci Stat Comput 11(5):87– 912

    Article  MathSciNet  MATH  Google Scholar 

  • Gabel M, Zhendong S (2010) A study of the uniqueness of source code. In: ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 147–156

  • Gethers M, Poshyvanyk D (2010) Using relational topic models to capture coupling among classes in object-oriented software systems. In: International Conference on Software Maintenance, pp. 1–10

  • Gethers M, Savage T, Di Penta M, Oliveto R, Poshyvanyk D, De Lucia A (2011) Codetopics: which topic am I coding now?. In: International Conference on Software Engineering, pp. 1034–1036

  • Girolami M, Kabán A (2003) On an equivalence between PLSI and LDA. In: International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434

  • Gracia J, Trillo R, Espinoza M, Mena E (2006) Querying the web: A multiontology disambiguation method. In: International Conference on Web Engineering, pp. 241–248

  • Grant S, Cordy J (2010) Estimating the optimal number of latent concepts in source code analysis. In: International Working Conference on Source Code Analysis and Manipulation, pp. 65–74

  • Griffiths T, Steyvers M (2004) Finding scientific topics. In: The National Academy of Sciences, pp. 5228–5235

  • Haiduc S, Marcus A (2008) On the use of domain terms in source code. In: IEEE International Conference on Program Comprehension, pp. 113–122

  • Hearst M, Pedersen J (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: international ACM SIGIR conference on Research and development in information retrieval, pp. 76–84

  • Hindle A, Barr E, Su Z, Gabel M, Devanbu P (2012) On the naturalness of software. In: International Conference on Software Maintenance, pp. 837–847

  • Hindle A, Bird C, Zimmermann T, Nagappan N (2015) Do topics make sense to managers and developers Empir Softw Eng 20(2):479–515

    Article  Google Scholar 

  • Hofmann T (1999) Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57

  • Howard M, Gupta S, Pollock L, Vijay-Shanker K (2013) Automatically mining software-based, semantically-similar words from comment-code mappings. In: Working Conference on Mining Software Repositories, pp. 377–386

  • Khatiwada S, Kelly M, Mahmoud A (2016) Stac: A tool for static textual analysis of code. In: International Conference on Program Comprehension, pp. 1–3

  • Koltcov S, Koltsova O, Nikolenko S (2014) Latent Dirichlet Allocation: Stability and applications to studies of user-generated content. In: ACM Conference on Web Science, pp. 161–165

  • Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243

    Article  Google Scholar 

  • Lau J, Newman D, Karimi S, Baldwin T (2010) Best topic word selection for topic labelling. In: International Conference on Computational Linguistics, pp. 605–613

  • Li M, Chen X, Li X, Ma B (2004) Vitanyi: The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264

    Article  MATH  Google Scholar 

  • Linstead E, Lopes C, Baldi P (2008) An application of Latent Dirichlet Allocation to analyzing software evolution. In: International Conference on Machine Learning and Applications, pp. 813–818

  • Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2007) Mining concepts from code with probabilistic topic models. In: International Conference on Automated Software Engineering, pp. 461–464

  • Liu Y, Poshyvanyk D, Ferenc R, Gyimothy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: International Conference on Software Maintenance, pp. 233–242

  • Lo D, Nagappan N, Zimmermann T (2015) How practitioners perceive the relevance of software engineering research. In: Joint Meeting on Foundations of Software Engineering, pp. 415–425

  • Lohar S, Amornborvornwong S, Zisman A, Cleland-Huang J (2013) Improving trace accuracy through data-driven configuration and composition of tracing features. In: Joint Meeting on Foundations of Software Engineering, pp. 378–388

  • Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug localization using Latent Dirichlet Allocation. In: Working Conference on Reverse Engineering, pp. 155–164

  • Lukins S, Kraft N, Etzkorn L (2010) Bug localization using Latent Dirichlet Allocation. Inf Softw Technol 52(9):972–990

    Article  Google Scholar 

  • Mahmoud A, Bradshaw G (2015) Estimating semantic relatedness in source code. ACM Trans Softw Eng Methodol 25(1):1–35

    Article  Google Scholar 

  • Mahmoud A, Niu N (2015) On the role of semantics in automated requirements tracing. Requir Eng 20(3):281–300

    Article  Google Scholar 

  • Mancoridis S, Mitchell B, Chen Y, Gansner E (1999) Bunch: A clustering tool for the recovery and maintenance of software system structures. In: International Conference on Software Maintenance, pp. 50–59

  • Maskeri G, Sarkar S, Heafield K (2008) Mining business topics in source code using Latent Dirichlet Allocation. In: India software engineering conference, pp. 113–120

  • Meghan K, Revelle, Poshyvanyk D (2009) Using Latent Dirichlet Allocation for automatic categorization of software. In: International Working Conference on Mining Software Repositories, pp. 163–166

  • Mei Q, Shen X, Zhai C (2007) Automatic labeling of multinomial topic models. In: International Conference on Knowledge Discovery and Data Mining, pp. 490–499

  • Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: National Conference on Artificial Intelligence, pp. 775–780

  • Mimno D, Wallach H, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: The Conference on Empirical Methods in Natural Language Processing, pp. 262–272

  • Mitchell B, Mancoridis S (2006) On the automatic modularization of software systems using the Bunch tool. IEEE Trans Softw Eng 32(3):193–208

    Article  Google Scholar 

  • Neuhaus S, Zimmermann T (2010) Security trend analysis with CVE topic models. In: International Symposium on Software Reliability Engineering, pp. 111–120

  • Newman D, Bonilla E, Buntine W (2011) Improving topic coherence with regularized topic models. In: Neural Information Processing Systems, pp. 496–504

  • Newman D, Han Lau J, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108

  • Newman D, Noh Y, Talley E, Karimi S, Baldwin T (2010) Evaluating topic models for digital libraries. In: Annual Joint Conference on Digital Libraries, pp. 215–224

  • Nguyen A, Nguyen T, Al-Kofahi J, Nguyen H, Nguyen T (2011) A topic-based approach for narrowing the search space of buggy files from a bug report. In: Automated Software Engineering, pp. 263– 272

  • Niu N, Mahmoud A (2012) Enhancing candidate link generation for requirements tracing: The cluster hypothesis revisited. In: IEEE International Requirements Engineering Conference, pp. 81–90

  • Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: International Conference on Program Comprehension, pp. 68–71

  • Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In: International Conference on Software Engineering, pp. 522–531

  • Panichella1 A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2016) Parameterizing and assembling IR-based solutions for SE tasks using genetic algorithms. In: International Conference on Software Analysis, Evolution, and Reengineering, pp. 522–531

  • Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed gibbs sampling for Latent Dirichlet Allocation. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 569–577

  • Porter F (1997) An algorithm for suffix stripping. Morgan Kaufmann Publishers Inc, pp 313–316

  • Potapenko A, Vorontsov K (2013) Robust PLSA Performs Better Than LDA. The MIT Press, pp 784–787

  • Recchia G, Jones M (2009) More data trumps smarter algorithms: Comparing Pointwise Mutual Information with Latent Semantic Analysis. Behav Res Methods 41 (3):647–656

    Article  Google Scholar 

  • Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  • Savage T, Dit B, Gethers M, Poshyvanyk D (2010) TopicXP: Exploring topics in source code using Latent Dirichlet Allocation. In: IEEE International Conference on Software Maintenance, pp. 1–6

  • Schaeffer S (2007) Graph clustering. Computer Science Review 1(1):27–64

    Article  MATH  Google Scholar 

  • Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 208–215

  • Sontag D, Roy D (2011) Complexity of inference in latent dirichlet allocation. In: Shawe-Taylor J., Zemel R.S., Bartlett P.L., Pereira F., Weinberger K.Q. (eds) Advances in Neural Information Processing Systems 24, pp. 1008–1016. Curran Associates, Inc

  • Sridhara G, Hill E, Pollock L, Vijay-Shanker K (2008) Identifying word relations in software: A comparative study of semantic similarity tools. In: International Conference on Program Comprehension, pp. 123–132

  • Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 952–961

  • Steyvers M, Griffiths T (2007) Probabilistic topic models. Psychology Press, pp 427–448

  • Teh Y, Newman D, Welling M (2007) A collapsed variational bayesian inference algorithm for Latent Dirichlet Allocation. In: Advances in Neural Information Processing Systems 19

  • Than K, Ho TB (2012) Fully sparse topic models. In: European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 490–505

  • Thomas S (2011) Mining software repositories using topic models. In: International Conference on Software Engineering, pp. 1138–1139

  • Thomas S, Adams B, Hassan A, Blostein D (2010) Validating the use of topic models for software evolution. In: IEEE Working Conference on Source Code Analysis and Manipulation, pp. 55–64

  • Thomas S, Hemmati H, Hassan A, Blostein D (2014) Static test case prioritization using topic models. Empir Softw Eng 19(1):182–212

    Article  Google Scholar 

  • Tian Y, Lo D, Lawall J (2014) Automated construction of a software-specific word similarity database. In: IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering, pp. 44–53

  • Turney P (2001) Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: European Conference on Machine Learning, pp. 491–502

  • Tzerpos V, Holt R (2000) ACDC: An algorithm for comprehension-driven clustering. In: Working Conference on Reverse Engineering, pp. 258–267

  • Wallach H, Mimno D, McCallum A (2009) Rethinking LDA: Why priors matter. In: Bengio Y., Schuurmans D., Lafferty J., Williams C., Culotta A. (eds) Advances in Neural Information Processing Systems 22, pp. 1973–1981. Curran Associates, Inc

  • Wallach H, Murray I, Salakhutdinov R, Mimno D (2009) Evaluation methods for topic models. In: International Conference on Machine Learning, pp. 1105–1112

  • Yang J, Tan L (2014) Swordnet: Inferring semantically related words from software context. Empirical Software Engingeering 19(6):1856–1886

    Article  MathSciNet  Google Scholar 

  • Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: International Conference on Information and Knowledge Management, pp. 515–524

Download references

Acknowledgments

The authors would like to thank our study participants for their time and feedback and the Institutional Review Board (IRB) at LSU for approving our research. This work was supported by the Louisiana Board of Regents Research Competitiveness Subprogram (LA BoR-RCS), contract number: LEQSF(2015-18)-RD-A-07.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anas Mahmoud.

Additional information

Communicated by: Denys Poshyvanyk

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahmoud, A., Bradshaw, G. Semantic topic models for source code analysis. Empir Software Eng 22, 1965–2000 (2017). https://doi.org/10.1007/s10664-016-9473-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-016-9473-1

Keywords

Navigation