Abstract
Using topic models to mine domain topics from source code has been a promising way for developers to comprehend the functional concerns implemented in the source code of a software system. However, not all the topics mined from source code are domain topics that represent functional concerns of the software. Besides domain topics, other topics may represent cross-cutting concerns or other concerns. These topics are noises in the context of helping developers to comprehend the functional concerns. In this paper, we propose an approach to filter out noises and mine Cohesive Domain Topics (CDTs) from source code. A topic is a CDT if its associated words represent certain functional concern and its associated source code elements collaboratively implement the functional concern. Firstly, we propose a series of Filtering Heuristics to filter out programming related information in source code which may bring in noises. Then, we mine raw topics from source code using Latent Dirichlet Allocation. Finally, based on the structural relationships among the source code elements associated to a topic, we propose a novel metric called Topic Cohesion to identify CDTs from the raw topics. Experimental results on a set of open source software show that our approach can effectively filter out noises and obtain CDTs from source code.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abran, A., Moore, J., Bourque, P., Dupuis, R., Tripp, L.: Guide to the software engineering body of knowledge, 2004 version. IEEE Computer Society 1 (2004)
Gethers, M., Savage, T., Di Penta, M., Oliveto, R., Poshyvanyk, D., De Lucia, A.: Codetopics: Which topic am i coding now? In: 33rd International Conference on Software Engineering (ICSE), pp. 1034–1036. IEEE (2011)
Savage, T., Dit, B., Gethers, M., Poshyvanyk, D.: Topicxp: Exploring topics in source code using latent dirichlet allocation. In: IEEE International Conference on Software Maintenance (ICSM), pp. 1–6. IEEE (2010)
Maskeri, G., Sarkar, S., Heafield, K.: Mining business topics in source code using latent dirichlet allocation. In: Proceedings of the 1st India Software Engineering Conference, pp. 113–120. ACM (2008)
Abebe, S., Tonella, P.: Towards the extraction of domain concepts from the identifiers. In: 18th Working Conference on Reverse Engineering (WCRE), pp. 77–86. IEEE (2011)
Kuhn, A., Ducasse, S., GÃrba, T.: Semantic clustering: Identifying topics in source code. Information and Software Technology 49(3), 230–243 (2007)
Liu, Y., Poshyvanyk, D., Ferenc, R., Gyimóthy, T., Chrisochoides, N.: Modeling class cohesion as mixtures of latent topics. In: IEEE International Conference on Software Maintenance (ICSM), pp. 233–242. IEEE (2009)
Baldi, P., Lopes, C., Linstead, E., Bajracharya, S.: A theory of aspects as latent topics. In: ACM Sigplan Notices, vol. 43, pp. 543–562. ACM (2008)
Steyvers, M., Griffiths, T.: Probabilistic topic models. Handbook of Latent Semantic Analysis 427(7), 424–440 (2007)
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
Asuncion, H., Asuncion, A., Taylor, R.: Software traceability with topic modeling. In: 32nd ACM/IEEE International Conference on Software Engineering (ICSE), pp. 95–104. ACM (2010)
Tian, K., Revelle, M., Poshyvanyk, D.: Using latent dirichlet allocation for automatic categorization of software. In: 6th IEEE International Working Conference on Mining Software Repositories (MSR), pp. 163–166. IEEE (2009)
Kawaguchi, S., Garg, P., Matsushita, M., Inoue, K.: Mudablue: An automatic categorization system for open source repositories. Journal of Systems and Software 79(7), 939–953 (2006)
Thomas, S., Adams, B., Hassan, A., Blostein, D.: Modeling the evolution of topics in source code histories. In: 8th Working Conference on Mining Software Repositories, MSR (2011)
Lukins, S., Kraft, N., Etzkorn, L.: Bug localization using latent dirichlet allocation. Information and Software Technology 52(9), 972–990 (2010)
Adams, B., Jiang, Z., Hassan, A.: Identifying crosscutting concerns using historical code changes. In: 32nd ACM/IEEE International Conference on Software Engineering (ICSE), pp. 305–314. ACM (2010)
Bieman, J., Kang, B.: Cohesion and reuse in an object-oriented system. In: ACM SIGSOFT Software Engineering Notes, vol. 20, pp. 259–262. ACM (1995)
Briand, L., Wüst, J., Daly, J., Victor Porter, D.: Exploring the relationships between design measures and software quality in object-oriented systems. Journal of Systems and Software 51(3), 245–273 (2000)
Chidamber, S., Darcy, D., Kemerer, C.: Managerial use of metrics for object-oriented software: An exploratory analysis. IEEE Transactions on Software Engineering 24(8), 629–639 (1998)
Etzkorn, L., Davis, C.: Automatically identifying reusable oo legacy code. Computer 30(10), 66–71 (1997)
Briand, L., Daly, J., Wüst, J.: A unified framework for cohesion measurement in object-oriented systems. Empirical Software Engineering 3(1), 65–117 (1998)
De Lucia, A., Oliveto, R., Vorraro, L.: Using structural and semantic metrics to improve class cohesion. In: IEEE International Conference on Software Maintenance (ICSM), pp. 27–36. IEEE (2008)
Marcus, A., Poshyvanyk, D., Ferenc, R.: Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Transactions on Software Engineering 34(2), 287–300 (2008)
Meyers, T., Binkley, D.: An empirical study of slice-based cohesion and coupling metrics. ACM Transactions on Software Engineering and Methodology (TOSEM)Â 17(1), 2 (2007)
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 5228–5235 (2004)
Oliveto, R., Gethers, M., Poshyvanyk, D., De Lucia, A.: On the equivalence of information retrieval methods for automated traceability link recovery. In: 18th International Conference on Program Comprehension (ICPC), pp. 68–71. IEEE (2010)
Dit, B., Revelle, M., Gethers, M., Poshyvanyk, D.: Feature location in source code: A taxonomy and survey. Journal of Software Maintenance and Evolution: Research and Practice (2011)
Ali, N., Guéhéneuc, Y., Antoniol, G.: Factors impacting the inputs of traceability recovery approaches. Software and Systems Traceability, 99–127 (2012)
McMillan, C., Poshyvanyk, D., Revelle, M.: Combining textual and structural analysis of software artifacts for traceability link recovery. In: ICSE Workshop on Traceability in Emerging Forms of Software Engineering, pp. 41–48. IEEE (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xie, B., Li, M., Jin, J., Zhao, J., Zou, Y. (2013). Mining Cohesive Domain Topics from Source Code. In: Favaro, J., Morisio, M. (eds) Safe and Secure Software Reuse. ICSR 2013. Lecture Notes in Computer Science, vol 7925. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38977-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-38977-1_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38976-4
Online ISBN: 978-3-642-38977-1
eBook Packages: Computer ScienceComputer Science (R0)