Mining Source Code Topics Through Topic Model and Words Embedding

  • Wei Emma ZhangEmail author
  • Quan Z. Sheng
  • Ermyas Abebe
  • M. Ali Babar
  • Andi Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10086)


Developers nowadays can leverage existing systems to build their own applications. However, a lack of documentation hinders the process of software system reuse. We examine the problem of mining topics (i.e., topic extraction) from source code, which can facilitate the comprehension of the software systems. We propose a topic extraction method, Embedded Topic Extraction (EmbTE), that considers word semantics, which are never considered in mining topics from source code, by leveraging word embedding techniques. We also adopt Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) to extract topics from source code. Moreover, an automated term selection algorithm is proposed to identify the most contributory terms from source code for the topic extraction task. The empirical studies on Github ( Java projects show that EmbTE outperforms other methods in terms of providing more coherent topics. The results also indicate that method name, method comments, class names and class comments are the most contributory types of terms to source code topic extraction.


Source code mining Topic model Word embedding 


  1. 1.
    Allamanis, M., Sutton, C.A.: Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR 2013), pp. 207–216, San Francisco, CA, USA, May 2013Google Scholar
  2. 2.
    Asuncion, H.U., Asuncion, A.U., Taylor, R.N.: Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pp. 95–104, Cape Town, South Africa, May 2010Google Scholar
  3. 3.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)CrossRefzbMATHGoogle Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pp. 28–36, Baltimore, Maryland, USA, January 2003Google Scholar
  6. 6.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)zbMATHGoogle Scholar
  7. 7.
    Haefliger, S., Krogh, G.V., Spaeth, S.: Code reuse in open source software. Manage. Sci. 54(1), 180–193 (2008)CrossRefGoogle Scholar
  8. 8.
    Haiduc, S., Aponte, J., Marcus, A.: Supporting program comprehension with source code summarization. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pp. 223–226, Cape Town, South Africa, May 2010Google Scholar
  9. 9.
    Haiduc, S., Aponte, J., Moreno, L., Marcus, A.: On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering (WCRE 2010), pp. 35–44, Beverly, MA, USA, October 2010Google Scholar
  10. 10.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefGoogle Scholar
  11. 11.
    Lukins, S.K., Kraft, N.A., Etzkorn, L.H.: Bug localization using latent Dirichlet allocation. Inf. Softw. Technol. 52(9), 972–990 (2010)CrossRefGoogle Scholar
  12. 12.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013)Google Scholar
  13. 13.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), pp. 3111–3119, Lake Tahoe, United States, December 2013Google Scholar
  14. 14.
    Moreno, L., Aponte, J., Sridhara, G., Marcus, A., Pollock, L.L., Vijay-Shanker, K.: Automatic generation of natural language summaries for java classes. In: Proceedings of the 21st IEEE International Conference on Program Comprehension (ICPC 2013), pp. 23–32, San Francisco, NC, USA, May 2013Google Scholar
  15. 15.
    Niu, L., Dai, X., Zhang, J., Chen, J.: Topic2Vec: learning distributed representations of topics. In: Proceedings of the International Conference on Asian Language Processing 2015 (IALP 2015), pp. 193–196, Suzhou, China, October 2015Google Scholar
  16. 16.
    Rama, G.M., Sarkar, S., Heafield, K.: Mining business topics in source code using latent Dirichlet allocation. In: Proceedings of the 1st Annual India Software Engineering Conference (ISEC 2008), pp. 113–120, Hyderabad, India, February 2008Google Scholar
  17. 17.
    Rodeghero, P., McMillan, C., McBurney, P.W., Bosch, N., D’Mello, S.K.: Improving automated source code summarization via an eye-tracking study of programmers. In: Proceedings of the 36th International Conference on Software Engineering (ICSE 2014), pp. 390–401, Hyderabad, India, June 2014Google Scholar
  18. 18.
    Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM 2015), pp. 399–408, Shanghai, China, February 2015Google Scholar
  19. 19.
    Sridhara, G., Pollock, L.L., Vijay-Shanker, K.: Automatically detecting and describing high level actions within methods. In: Proceedings of the 33rd International Conference on Software Engineering (ICSE 2011), pp. 101–110, Waikiki, Honolulu, HI, USA, May 2011Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Wei Emma Zhang
    • 1
    Email author
  • Quan Z. Sheng
    • 1
  • Ermyas Abebe
    • 2
  • M. Ali Babar
    • 1
  • Andi Zhou
    • 2
  1. 1.School of Computer ScienceThe University of AdelaideAdelaideAustralia
  2. 2.IBM ResearchMelbourneAustralia

Personalised recommendations