Semantic topic models for source code analysis

Mahmoud, Anas; Bradshaw, Gary

doi:10.1007/s10664-016-9473-1

Semantic topic models for source code analysis

Published: 22 November 2016

Volume 22, pages 1965–2000, (2017)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Anas Mahmoud¹ &
Gary Bradshaw²

1474 Accesses
13 Citations
4 Altmetric
Explore all metrics

Abstract

Topic modeling techniques have been recently applied to analyze and model source code. Such techniques exploit the textual content of source code to provide automated support for several basic software engineering activities. Despite these advances, applications of topic modeling in software engineering are frequently suboptimal. This can be attributed to the fact that current state-of-the-art topic modeling techniques tend to be data intensive. However, the textual content of source code, embedded in its identifiers, comments, and string literals, tends to be sparse in nature. This prevents classical topic modeling techniques, typically used to model natural language texts, to generate proper models when applied to source code. Furthermore, the operational complexity and multi-parameter calibration often associated with conventional topic modeling techniques raise important concerns about their feasibility as data analysis models in software engineering. Motivated by these observations, in this paper we propose a novel approach for topic modeling designed for source code. The proposed approach exploits the basic assumptions of the cluster hypothesis and information theory to discover semantically coherent topics in software systems. Ten software systems from different application domains are used to empirically calibrate and configure the proposed approach. The usefulness of generated topics is empirically validated using human judgment. Furthermore, a case study that demonstrates thet operation of the proposed approach in analyzing code evolution is reported. The results show that our approach produces stable, more interpretable, and more expressive topics than classical topic modeling techniques without the necessity for extensive parameter calibration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic modeling in software engineering research

Article Open access 06 September 2021

Studying software logging using topic models

Article 30 January 2018

A survey on the use of topic models when mining software repositories

Article 10 September 2015

Notes

References

Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: International Conference on Program Comprehension, pp. 103–112
Aggarwal C, Zhai C (2012) A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer
Andrzejewski D, Mulhern A, Liblit B, Zhu X (2007) Statistical debugging using latent topic models. In: European conference on Machine Learning, pp. 6–17
Anquetil N, Fourrier C, Lethbridge T (1999) Experiments with clustering as a software remodularization method. In: Working Conference on Reverse Engineering, pp. 235–255
Anquetil N, Lethbridge T (1998) Assessing the relevance of identifier names in a legacy software system. In: Conference of the Centre for Advanced Studies on Collaborative Research, pp. 4–14
Asuncion H, Asuncion A, Taylor R (2010) Software traceability with topic modeling. In: International Conference on Software Engineering, pp. 95–104
Baldi P, Lopes C, Linstead E, Bajracharya S (2008) A theory of aspects as latent topics. ACM SIGPLAN Not 43(10):543–562
Article Google Scholar
Barua A, Thomas S, Hassan A (2014) What are developers talking about? An analysis of topics and trends in stack overflow. Empir Softw Eng 19(3):619–654
Article Google Scholar
Bavota G, Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2014) Improving software modularization via automated analysis of latent topics and dependencies. ACM Trans Softw Eng Methodol 23(1):1–33
Article Google Scholar
Bavota G, Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2014) Methodbook: Recommending move method refactorings via Relational Topic Models. IEEE Trans Softw Eng 40(7):671–694
Article Google Scholar
Bieman J, Kang B (1995) Cohesion and reuse in an object-oriented system. SIGSOFT Software Engineering Notes 20(SI):259–262
Article Google Scholar
Biggers L, Bocovich C, Capshaw R, Eddy B, Etzkorn L, Kraft N (2014) Configuring Latent Dirichlet Allocation based feature location. Empir Softw Eng 19(3):465–500
Article Google Scholar
Binkley D, Heinz D, Lawrie D, Overfelt J (2014) Understanding LDA in source code analysis. In: International Conference on Program Comprehension, pp. 26–36
Blei D, Griffiths T, Jordan M, Tenenbaum J (2003) Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems
Blei D, Ng A, Jordan M (2003) Latent Dirichlet Allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Budiu R, Royer C, Pirolli P (2007) Modeling information scent: A comparison of LSA, PMI and GLSA similarity measures on common tests and corpora. In: Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pp. 314–332
Bullinaria J, Levy J (2007) Extracting semantic representations from word co-occurrence statistics: A computational study. Behav Res Methods 39(3):510–526
Article Google Scholar
van Rijsbergen CJ (1979) Information Retrieval. Butterworths
Caprile B, Tonella P (2000) Restructuring program identifier names. In: International Conference on Software Maintenance, pp. 97–107
Chang J (2010) Not-so-latent Dirichlet allocation: Collapsed Gibbs sampling using human judgments. In: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 131–138
Chang J, Boyd-Graber J, Gerrish S, Wang C, Blei D (2009) Reading tea leaves: How humans interpret topic models. Curran Associates, pp 288–296
Chen T, Thomas S, Nagappan M, Hassan A (2012) Explaining software defects using topic models. In: Working Conference on Mining Software Repositories, pp. 189–198
Chidamber S, Kemerer C (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493
Article Google Scholar
Church K, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
Google Scholar
Cilibrasi R, Vitanyi P (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383
Article Google Scholar
De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichelle S (2012) Using IR methods for labeling source code artifacts: Is it worthwhile?. In: International Conference on Program Comprehension, pp. 193–202
Dean A, Voss D (1999) Design and Analysis of Experiments. Springer
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Deißenböck F, Pizka M (2005) Concise and consistent naming. In: International Workshop on Program Comprehension, pp. 97–106
Demmel J, Kahan W (1990) Accurate singular values of bidiagonal matrices. J Sci Stat Comput 11(5):87– 912
Article MathSciNet MATH Google Scholar
Gabel M, Zhendong S (2010) A study of the uniqueness of source code. In: ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 147–156
Gethers M, Poshyvanyk D (2010) Using relational topic models to capture coupling among classes in object-oriented software systems. In: International Conference on Software Maintenance, pp. 1–10
Gethers M, Savage T, Di Penta M, Oliveto R, Poshyvanyk D, De Lucia A (2011) Codetopics: which topic am I coding now?. In: International Conference on Software Engineering, pp. 1034–1036
Girolami M, Kabán A (2003) On an equivalence between PLSI and LDA. In: International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434
Gracia J, Trillo R, Espinoza M, Mena E (2006) Querying the web: A multiontology disambiguation method. In: International Conference on Web Engineering, pp. 241–248
Grant S, Cordy J (2010) Estimating the optimal number of latent concepts in source code analysis. In: International Working Conference on Source Code Analysis and Manipulation, pp. 65–74
Griffiths T, Steyvers M (2004) Finding scientific topics. In: The National Academy of Sciences, pp. 5228–5235
Haiduc S, Marcus A (2008) On the use of domain terms in source code. In: IEEE International Conference on Program Comprehension, pp. 113–122
Hearst M, Pedersen J (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: international ACM SIGIR conference on Research and development in information retrieval, pp. 76–84
Hindle A, Barr E, Su Z, Gabel M, Devanbu P (2012) On the naturalness of software. In: International Conference on Software Maintenance, pp. 837–847
Hindle A, Bird C, Zimmermann T, Nagappan N (2015) Do topics make sense to managers and developers Empir Softw Eng 20(2):479–515
Article Google Scholar
Hofmann T (1999) Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57
Howard M, Gupta S, Pollock L, Vijay-Shanker K (2013) Automatically mining software-based, semantically-similar words from comment-code mappings. In: Working Conference on Mining Software Repositories, pp. 377–386
Khatiwada S, Kelly M, Mahmoud A (2016) Stac: A tool for static textual analysis of code. In: International Conference on Program Comprehension, pp. 1–3
Koltcov S, Koltsova O, Nikolenko S (2014) Latent Dirichlet Allocation: Stability and applications to studies of user-generated content. In: ACM Conference on Web Science, pp. 161–165
Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243
Article Google Scholar
Lau J, Newman D, Karimi S, Baldwin T (2010) Best topic word selection for topic labelling. In: International Conference on Computational Linguistics, pp. 605–613
Li M, Chen X, Li X, Ma B (2004) Vitanyi: The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264
Article MATH Google Scholar
Linstead E, Lopes C, Baldi P (2008) An application of Latent Dirichlet Allocation to analyzing software evolution. In: International Conference on Machine Learning and Applications, pp. 813–818
Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2007) Mining concepts from code with probabilistic topic models. In: International Conference on Automated Software Engineering, pp. 461–464
Liu Y, Poshyvanyk D, Ferenc R, Gyimothy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: International Conference on Software Maintenance, pp. 233–242
Lo D, Nagappan N, Zimmermann T (2015) How practitioners perceive the relevance of software engineering research. In: Joint Meeting on Foundations of Software Engineering, pp. 415–425
Lohar S, Amornborvornwong S, Zisman A, Cleland-Huang J (2013) Improving trace accuracy through data-driven configuration and composition of tracing features. In: Joint Meeting on Foundations of Software Engineering, pp. 378–388
Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug localization using Latent Dirichlet Allocation. In: Working Conference on Reverse Engineering, pp. 155–164
Lukins S, Kraft N, Etzkorn L (2010) Bug localization using Latent Dirichlet Allocation. Inf Softw Technol 52(9):972–990
Article Google Scholar
Mahmoud A, Bradshaw G (2015) Estimating semantic relatedness in source code. ACM Trans Softw Eng Methodol 25(1):1–35
Article Google Scholar
Mahmoud A, Niu N (2015) On the role of semantics in automated requirements tracing. Requir Eng 20(3):281–300
Article Google Scholar
Mancoridis S, Mitchell B, Chen Y, Gansner E (1999) Bunch: A clustering tool for the recovery and maintenance of software system structures. In: International Conference on Software Maintenance, pp. 50–59
Maskeri G, Sarkar S, Heafield K (2008) Mining business topics in source code using Latent Dirichlet Allocation. In: India software engineering conference, pp. 113–120
Meghan K, Revelle, Poshyvanyk D (2009) Using Latent Dirichlet Allocation for automatic categorization of software. In: International Working Conference on Mining Software Repositories, pp. 163–166
Mei Q, Shen X, Zhai C (2007) Automatic labeling of multinomial topic models. In: International Conference on Knowledge Discovery and Data Mining, pp. 490–499
Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: National Conference on Artificial Intelligence, pp. 775–780
Mimno D, Wallach H, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: The Conference on Empirical Methods in Natural Language Processing, pp. 262–272
Mitchell B, Mancoridis S (2006) On the automatic modularization of software systems using the Bunch tool. IEEE Trans Softw Eng 32(3):193–208
Article Google Scholar
Neuhaus S, Zimmermann T (2010) Security trend analysis with CVE topic models. In: International Symposium on Software Reliability Engineering, pp. 111–120
Newman D, Bonilla E, Buntine W (2011) Improving topic coherence with regularized topic models. In: Neural Information Processing Systems, pp. 496–504
Newman D, Han Lau J, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108
Newman D, Noh Y, Talley E, Karimi S, Baldwin T (2010) Evaluating topic models for digital libraries. In: Annual Joint Conference on Digital Libraries, pp. 215–224
Nguyen A, Nguyen T, Al-Kofahi J, Nguyen H, Nguyen T (2011) A topic-based approach for narrowing the search space of buggy files from a bug report. In: Automated Software Engineering, pp. 263– 272
Niu N, Mahmoud A (2012) Enhancing candidate link generation for requirements tracing: The cluster hypothesis revisited. In: IEEE International Requirements Engineering Conference, pp. 81–90
Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: International Conference on Program Comprehension, pp. 68–71
Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In: International Conference on Software Engineering, pp. 522–531
Panichella1 A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2016) Parameterizing and assembling IR-based solutions for SE tasks using genetic algorithms. In: International Conference on Software Analysis, Evolution, and Reengineering, pp. 522–531
Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed gibbs sampling for Latent Dirichlet Allocation. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 569–577
Porter F (1997) An algorithm for suffix stripping. Morgan Kaufmann Publishers Inc, pp 313–316
Potapenko A, Vorontsov K (2013) Robust PLSA Performs Better Than LDA. The MIT Press, pp 784–787
Recchia G, Jones M (2009) More data trumps smarter algorithms: Comparing Pointwise Mutual Information with Latent Semantic Analysis. Behav Res Methods 41 (3):647–656
Article Google Scholar
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Savage T, Dit B, Gethers M, Poshyvanyk D (2010) TopicXP: Exploring topics in source code using Latent Dirichlet Allocation. In: IEEE International Conference on Software Maintenance, pp. 1–6
Schaeffer S (2007) Graph clustering. Computer Science Review 1(1):27–64
Article MATH Google Scholar
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 208–215
Sontag D, Roy D (2011) Complexity of inference in latent dirichlet allocation. In: Shawe-Taylor J., Zemel R.S., Bartlett P.L., Pereira F., Weinberger K.Q. (eds) Advances in Neural Information Processing Systems 24, pp. 1008–1016. Curran Associates, Inc
Sridhara G, Hill E, Pollock L, Vijay-Shanker K (2008) Identifying word relations in software: A comparative study of semantic similarity tools. In: International Conference on Program Comprehension, pp. 123–132
Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D (2012) Exploring topic coherence over many models and many topics. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 952–961
Steyvers M, Griffiths T (2007) Probabilistic topic models. Psychology Press, pp 427–448
Teh Y, Newman D, Welling M (2007) A collapsed variational bayesian inference algorithm for Latent Dirichlet Allocation. In: Advances in Neural Information Processing Systems 19
Than K, Ho TB (2012) Fully sparse topic models. In: European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 490–505
Thomas S (2011) Mining software repositories using topic models. In: International Conference on Software Engineering, pp. 1138–1139
Thomas S, Adams B, Hassan A, Blostein D (2010) Validating the use of topic models for software evolution. In: IEEE Working Conference on Source Code Analysis and Manipulation, pp. 55–64
Thomas S, Hemmati H, Hassan A, Blostein D (2014) Static test case prioritization using topic models. Empir Softw Eng 19(1):182–212
Article Google Scholar
Tian Y, Lo D, Lawall J (2014) Automated construction of a software-specific word similarity database. In: IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering, pp. 44–53
Turney P (2001) Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: European Conference on Machine Learning, pp. 491–502
Tzerpos V, Holt R (2000) ACDC: An algorithm for comprehension-driven clustering. In: Working Conference on Reverse Engineering, pp. 258–267
Wallach H, Mimno D, McCallum A (2009) Rethinking LDA: Why priors matter. In: Bengio Y., Schuurmans D., Lafferty J., Williams C., Culotta A. (eds) Advances in Neural Information Processing Systems 22, pp. 1973–1981. Curran Associates, Inc
Wallach H, Murray I, Salakhutdinov R, Mimno D (2009) Evaluation methods for topic models. In: International Conference on Machine Learning, pp. 1105–1112
Yang J, Tan L (2014) Swordnet: Inferring semantically related words from software context. Empirical Software Engingeering 19(6):1856–1886
Article MathSciNet Google Scholar
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: International Conference on Information and Knowledge Management, pp. 515–524

Download references

Acknowledgments

The authors would like to thank our study participants for their time and feedback and the Institutional Review Board (IRB) at LSU for approving our research. This work was supported by the Louisiana Board of Regents Research Competitiveness Subprogram (LA BoR-RCS), contract number: LEQSF(2015-18)-RD-A-07.

Author information

Authors and Affiliations

Division of Computer Science and Engineering, Louisiana State University, Baton Rouge, LA, 70803, USA
Anas Mahmoud
Department of Psychology, Mississippi State University, Mississippi State, MS, 39762, USA
Gary Bradshaw

Authors

Anas Mahmoud
View author publications
You can also search for this author in PubMed Google Scholar
Gary Bradshaw
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anas Mahmoud.

Additional information

Communicated by: Denys Poshyvanyk

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mahmoud, A., Bradshaw, G. Semantic topic models for source code analysis. Empir Software Eng 22, 1965–2000 (2017). https://doi.org/10.1007/s10664-016-9473-1

Download citation

Published: 22 November 2016
Issue Date: August 2017
DOI: https://doi.org/10.1007/s10664-016-9473-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic topic models for source code analysis

Abstract

Access this article

Similar content being viewed by others

Topic modeling in software engineering research

Studying software logging using topic models

A survey on the use of topic models when mining software repositories

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semantic topic models for source code analysis

Abstract

Access this article

Similar content being viewed by others

Topic modeling in software engineering research

Studying software logging using topic models

A survey on the use of topic models when mining software repositories

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation