Abstract
Context
One of the most time-consuming tasks for developers is the comprehension of new code bases. An effective approach to aid this process is to label source code files with meaningful annotations, which can help developers understand the content and functionality of a code base quicker. However, most existing solutions for code annotation focus on project-level classification: manually labelling individual files is time-consuming, error-prone and hard to scale.
Objective
The work presented in this paper aims to automate the annotation of files by leveraging project-level labels; and using the file-level annotations to annotate items at larger levels of granularity, for example, packages and a whole project.
Method
We propose a novel approach to annotate source code files using a weak labelling approach and a subsequent hierarchical aggregation. We investigate whether this approach is effective in achieving multi-granular annotations of software projects, which can aid developers in understanding the content and functionalities of a code base more quickly.
Results
Our evaluation uses a combination of human assessment and automated metrics to evaluate the annotations’ quality. Our approach correctly annotated 50% of files and more than 50% of packages. Moreover, the information captured at the file-level allowed us to identify, on average, three new relevant labels for any given project. We can conclude that the proposed approach is a convenient and promising way to generate noisy (not precise) annotations for files. Furthermore, hierarchical aggregation effectively preserves the information captured at file-level, and it can be propagated to packages and the overall project itself.
Conclusions
We can conclude that the proposed approach is a convenient and promising way to generate noisy (not precise) annotations for files. Furthermore, hierarchical aggregation effectively preserves the information captured at file-level, and it can be propagated to packages and the overall project itself.
Similar content being viewed by others
Data Availability Statement
The dataset used and generated artefacts are available in a Zenodo repository: https://zenodo.org/record/7943882. The code is available at the following repository: https://github.com/SasCezar/CodeGraphClassification
Notes
In natural language, a hypernym describes a broader term, whereas a hyponym is a more specialised word. For example, ‘Deep Learning’ is the hypernym, while ‘Convolutional Neural Network (or ‘CNN’) is the hyponym.
References
Ajienka N, Capiluppi A (2016) Semantic coupling between classes: Corpora or identifiers? In: Proceedings of the 10th ACM/IEEE international symposium on empirical software engineering and measurement, ESEM ’16. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2961111.2962622
Allal LB, Li R, Kocetkov D, Mou C, Akiki C, Ferrandis CM, Muennighoff N, Mishra M, Gu A, Dey M, Umapathi LK, Anderson CJ, Zi Y, Lamy-Poirier J, Schoelkopf H, Troshin S, Abulkhanov D, Romero M, Lappert M, Toni FD, del Río BG, Liu Q, Bose S, Bhattacharyya U, Zhuo TY, Yu I, Villegas P, Zocca M, Mangrulkar S, Lansky D, Nguyen H, Contractor D, Villa L, Li J, Bahdanau D, Jernite Y, Hughes S, Fried D, Guha A, de Vries H, von Werra L (2023) Santacoder: don’t reach for the stars! https://doi.org/10.48550/arXiv.2301.03988
Alon U, Zilberstein M, Levy O, Yahav E (2019) Code2vec: Learning distributed representations of code. Proc ACM Program Lang 3(POPL). https://doi.org/10.1145/3290353
Altarawy D, Shahin H, Mohammed A, Meng N (2018) Lascad : language-agnostic software categorization and similar application detection. J Syst Softw 142:21–34. https://doi.org/10.1016/j.jss.2018.04.018
Bharti SK, Babu KS (2017) Automatic keyword extraction for text summarization: a survey. arXiv:1704.03242
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146. https://doi.org/10.1162/tacl_a_00051. https://www.aclweb.org/anthology/Q17-1010
Briand L (2012) Embracing the engineering side of software engineering. IEEE Softw 29(4):96–96. https://doi.org/10.1109/MS.2012.86
Briand LC, Bianculli D, Nejati S, Pastore F, Sabetzadeh M (2017) The case for context-driven software engineering research: Generalizability is overrated. IEEE Softw 34(5):72–75. https://doi.org/10.1109/MS.2017.3571562
Bronstein MM, Bruna J, LeCun Y, Szlam A, Vandergheynst P (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Proc Mag 34(4):18–42. https://doi.org/10.1109/MSP.2017.2693418
Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt A (2020) Yake! keyword extraction from single documents using multiple local features. Inf Sci 509:257–289. https://doi.org/10.1016/j.ins.2019.09.013. https://www.sciencedirect.com/science/article/pii/S0020025519308588
Compton R, Frank E, Patros P, Koay A (2020) Embedding java classes with code2vec: improvements from variable obfuscation. In: Kim S, Gousios G, Nadi S, Hejderup J (eds) MSR ’20: 17th international conference on mining software repositories, Seoul, Republic of Korea, 29-30 June, 2020, ACM, pp 243–253. https://doi.org/10.1145/3379597.3387445
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423
Di Rocco J, Di Ruscio D, Di Sipio C, Nguyen P, Rubei R (2020) Topfilter: an approach to recommend relevant github topics. In: Proceedings of the 14th ACM / IEEE international symposium on empirical software engineering and measurement (ESEM), ESEM ’20. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3382494.3410690
Efstathiou V, Chatzilenas C, Spinellis D (2018) Word embeddings for the software engineering domain. In: Zaidman A, Kamei Y, Hill E (eds) Proceedings of the 15th international conference on mining software repositories, MSR 2018, Gothenburg, Sweden, May 28-29, 2018, ACM, pp 38–41. https://doi.org/10.1145/3196398.3196448
Endres DM, Schindelin JE (2003) A new metric for probability distributions. IEEE Trans Inf Theory 49(7):1858–1860. https://doi.org/10.1109/TIT.2003.813506
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) Codebert: a pre-trained model for programming and natural languages. arXiv:2002.08155
Firoozeh N, Nazarenko A, Alizon F, Daille B (2020) Keyword extraction: Issues and methods. Nat Lang Eng 26(3):259–291. https://doi.org/10.1017/S1351324919000457
Firth J (1957) Studies in linguistic analysis. Publications of the Philological Society. Blackwell. https://books.google.nl/books?id=JWktAAAAMAAJ
Fontana FA, Pigazzini I, Roveda R, Tamburri DA, Zanoni M, Nitto ED (2017) Arcan: a tool for architectural smells detection. In: 2017 IEEE international conference on software architecture workshops, ICSA Workshops 2017, Gothenburg, Sweden, April 5-7, 2017, IEEE Computer Society, pp 282–285. https://doi.org/10.1109/ICSAW.2017.16
Glass RL, Vessey I (1995) Contemporary application-domain taxonomies. IEEE Software 12(4):63–76. https://doi.org/10.1109/52.391837
Grover A, Leskovec J (2016) node2vec: Scalable feature learning for networks. In: Krishnapuram B, Shah M, Smola AJ, Aggarwal CC, Shen D, Rastogi R (eds) Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, USA, August 13-17, 2016, ACM, pp 855–864. https://doi.org/10.1145/2939672.2939754
Ieva C, Gotlieb A, Kaci S, Lazaar N (2019) Deploying smart program understanding on a large code base. In: IEEE international conference on artificial intelligence testing, AITest 2019, Newark, CA, USA, April 4-9, 2019, IEEE, pp 73–80. https://doi.org/10.1109/AITest.2019.000-4
Izadi M, Heydarnoori A, Gousios G (2021) Topic recommendation for software repositories using multi-label classification algorithms. Empir Softw Eng 26(5):93. https://doi.org/10.1007/s10664-021-09976-2
Izadi M, Nejati M, Heydarnoori A (2023) Semantically-enhanced topic recommendation systems for software projects. Empir Softw Eng 28(2):50. https://doi.org/10.1007/s10664-022-10272-w
Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the Eighth ACM SIGKDD international conference on knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada, ACM, pp 538–543. https://doi.org/10.1145/775047.775126
Kawaguchi S, Garg PK, Matsushita M, Inoue K (2004) Mudablue: an automatic categorization system for open source repositories. In: 11th asia-pacific software engineering conference (APSEC 2004), 30 November - 3 December 2004, Busan, Korea, IEEE Computer Society, pp 184–193. https://doi.org/10.1109/APSEC.2004.69
Khoreva A, Benenson R, Hosang JH, Hein M, Schiele B (2017) Simple does it: weakly supervised instance and semantic segmentation. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, pp 1665–1674. https://doi.org/10.1109/CVPR.2017.181
Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: identifying topics in source code. Inf Softw Technol 49(3):230–243. https://doi.org/10.1016/j.infsof.2006.10.017. https://www.sciencedirect.com/science/article/pii/S0950584906001820
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174. http://www.jstor.org/stable/2529310
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th international conference on machine learning, ICML 2014, Beijing, China, 21-26 June 2014, JMLR workshop and conference proceedings, vol 32, pp 1188–1196. JMLR.org. http://proceedings.mlr.press/v32/le14.html
LeClair A, Eberhart Z, McMillan C (2018) Adapting neural text classification for improved software categorization. In: 2018 IEEE international conference on software maintenance and evolution, ICSME 2018, Madrid, Spain, September 23-29, 2018, IEEE Computer Society, pp 461–472. https://doi.org/10.1109/ICSME.2018.00056
McMillan C, Grechanik M, Poshyvanyk D (2012) Detecting similar software applications. In: Proceedings of the 34th international conference on software engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland, ICSE ’12, IEEE Computer Society, pp 364-374. https://doi.org/10.1109/ICSE.2012.6227178
Mekala D, Gangal V, Shang J (2021) Coarse2fine: fine-grained text classification on coarsely-grained annotated data. In: Moens M, Huang X, Specia L, Yih SW (eds) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Association for Computational Linguistics, pp 583–594. https://doi.org/10.18653/v1/2021.emnlp-main.46
Mekala D, Zhang X, Shang J (2020) META: metadata-empowered weak supervision for text classification. In: Webber B, Cohn T, He Y, Liu Y (eds) Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16-20, 2020, Association for Computational Linguistics, pp 8351–8361. https://doi.org/10.18653/v1/2020.emnlp-main.670
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: Bengio Y, LeCun Y (eds) 1st international conference on learning representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. arXiv:1301.3781
Nguyen PT, Rocco JD, Rubei R, Ruscio DD (2018) Crosssim: exploiting mutual relationships to detect similar OSS projects. In: Bures T, Angelis L (eds) 44th Euromicro conference on software engineering and advanced applications, SEAA 2018, Prague, Czech Republic, August 29-31, 2018, IEEE Computer Society, pp 388–395. https://doi.org/10.1109/SEAA.2018.00069
Nguyen PT, Rocco JD, Rubei R, Ruscio DD (2020) An automated approach to assess the similarity of github repositories. Softw Qual J 28(2):595–631. https://doi.org/10.1007/s11219-019-09483-0
Ohashi H, Watanobe Y (2019) Convolutional neural network for classification of source codes. In: 13th IEEE international symposium on embedded multicore/many-core systems-on-chip, MCSoC 2019, Singapore, Singapore, October 1-4, 2019, IEEE, pp 194–200. https://doi.org/10.1109/MCSoC.2019.00035
Panichella A, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2013) How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In: Notkin D, Cheng BHC, Pohl K (eds) 35th international conference on software engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013, IEEE Computer Society, pp 522–531. https://doi.org/10.1109/ICSE.2013.6606598
Papandreou G, Chen L, Murphy K, Yuille AL (2015) Weakly- and semi-supervised learning of a DCNN for semantic image segmentation. arXiv:1502.02734
Qian Y, Zhang Y, Wen Q, Ye Y, Zhang C (2022) Rep2vec: Repository embedding via heterogeneous graph adversarial contrastive learning. In: Zhang A, Rangwala H (eds) KDD ’22: The 28th ACM SIGKDD conference on knowledge discovery and data mining, Washington, DC, USA, August 14 - 18, 2022, ACM, pp 1390–1400. https://doi.org/10.1145/3534678.3539324
Rademacher F, Sachweh S, Zündorf A (2020) A modeling method for systematic architecture reconstruction of microservice-based software systems. In: Nurcan S, Reinhartz-Berger I, Soffer P, Zdravkovic J (eds) Enterprise, business-process and information systems modeling - 21st international conference, BPMDS 2020, 25th International Conference, EMMSAD 2020, Held at CAiSE 2020, Grenoble, France, June 8-9, 2020, Proceedings, Lecture Notes in Business Information Processing, vol 387. Springer, pp 311–326. https://doi.org/10.1007/978-3-030-49418-6_21
Ratner A, Hancock B, Dunnmon J, Sala F, Pandey S, Ré C (2019) Training complex models with multi-task weak supervision. In: The thirty-third AAAI conference on artificial intelligence, AAAI 2019, the thirty-first innovative applications of artificial intelligence conference, IAAI 2019, the ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 4763–4771. AAAI Press. https://doi.org/10.1609/aaai.v33i01.33014763
Rocco JD, Ruscio DD, Sipio CD, Nguyen PT, Rubei R (2023) Hybridrec: a recommender system for tagging github repositories. Appl Intell 53(8):9708–9730. https://doi.org/10.1007/s10489-022-03864-y
Rokon MOF, Yan P, Islam R, Faloutsos M (2021) Repo2vec: a comprehensive embedding approach for determining repository similarity. In: IEEE international conference on software maintenance and evolution, ICSME 2021, Luxembourg, September 27 - October 1, 2021, IEEE, pp 355–365. https://doi.org/10.1109/ICSME52107.2021.00038
Sas C, Capiluppi A (2022) Antipatterns in software classification taxonomies. J Syst Softw 190:111343. https://doi.org/10.1016/j.jss.2022.111343. https://www.sciencedirect.com/science/article/pii/S0164121222000826
Sas C, Capiluppi A (2023) Weak labelling for file-level source code classification. In: Zhang T, Xia X, Novielli N (eds) IEEE international conference on software analysis, evolution and reengineering, SANER 2023, Taipa, Macao, March 21-24, 2023, IEEE, pp 698–702. https://doi.org/10.1109/SANER56733.2023.00074
Sas C, Capiluppi A, Sipio CD, Rocco JD, Di Ruscio D (2023) Gitranking: a ranking of github topics for software classification using active sampling. Practice and Experience, Software. https://doi.org/10.1002/spe.3238. https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.3238
Savage T, Dit B, Gethers M, Poshyvanyk D (2010) Topicxp: exploring topics in source code using latent dirichlet allocation. In: Marinescu R, Lanza M, Marcus A (eds) 26th IEEE international conference on software maintenance (ICSM 2010), September 12-18, 2010, Timisoara, Romania, IEEE Computer Society, pp 1–6 . https://doi.org/10.1109/ICSM.2010.5609654
Shang J, Qu M, Liu J, Kaplan LM, Han J, Peng J (2016) Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. arXiv:1610.09769
Sharma A, Thung F, Kochhar PS, Sulistya A, Lo D (2017) Cataloging github repositories. In: Mendes E, Counsell S, Petersen K (eds) Proceedings of the 21st international conference on evaluation and assessment in software engineering, EASE 2017, Karlskrona, Sweden, June 15-16, 2017, ACM, pp 314–319. https://doi.org/10.1145/3084226.3084287
Sipio CD, Rubei R, Ruscio DD, Nguyen PT (2020) A multinomial naïve bayesian (MNB) network to automatically recommend topics for github repositories. In: Li J, Jaccheri L, Dingsøyr T, Chitchyan R (eds) EASE ’20: Evaluation and Assessment in Software Engineering, Trondheim, Norway, April 15-17, 2020, ACM, pp 71–80. https://doi.org/10.1145/3383219.3383227
Sun X, Liu X, Li B, Li B, Lo D (2017) Liao L (2017) Clustering classes in packages for program comprehension. Sci Program 3787053(1–3787053):15. https://doi.org/10.1155/2017/3787053
Theeten B, Vandeputte F, Van Cutsem T (2019) Import2vec: learning embeddings for software libraries. In: Proceedings of the 16th international conference on mining software repositories, MSR 2019, 26-27 May 2019, Montreal, Canada, pp 18–28. https://doi.org/10.1109/MSR.2019.00014
Tian K, Revelle M, Poshyvanyk D (2009) Using latent dirichlet allocation for automatic categorization of software. In: Godfrey MW, Whitehead J (eds) Proceedings of the 6th international working conference on mining software repositories, MSR 2009 (Co-located with ICSE), Vancouver, BC, Canada, May 16-17, 2009, Proceedings, IEEE Computer Society, pp 163–166. https://doi.org/10.1109/MSR.2009.5069496
Ugurel S, Krovetz R, Giles CL (2002) What’s the code?: automatic classification of source code archives. In: Proceedings of the Eighth ACM SIGKDD international conference on knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada, ACM, pp 639–644. https://doi.org/10.1145/775047.775141
Vásquez ML, Holtzhauer A, Poshyvanyk D (2016) On automatically detecting similar android apps. In: 24th IEEE international conference on program comprehension, ICPC 2016, Austin, TX, USA, May 16-17, 2016, IEEE Computer Society, pp 1–10. https://doi.org/10.1109/ICPC.2016.7503721
Vásquez ML, McMillan C, Poshyvanyk D, Grechanik M (2014) On using machine learning to automatically classify software applications into domain categories. Empir Softw Eng 19(3):582–618. https://doi.org/10.1007/s10664-012-9230-z
Vrandečić, D (2012) Wikidata: a new platform for collaborative data collection. In: Proceedings of the 21st international conference on world wide web, WWW ’12 Companion, Association for Computing Machinery, New York, NY, USA, pp 1063-1064. https://doi.org/10.1145/2187980.2188242
Walker A, Laird I, Cerny T (2021) On automatic software architecture reconstruction of microservice applications. In: Kim H, Kim KJ, Park S (eds) Information Science and Applications, Springer Singapore, Singapore, pp 223–234. https://doi.org/10.1007/978-981-33-6385-4_21
Wei T, Mao Z, Shi J, Li Y, Zhang M (2022) A survey on extreme multi-label learning. https://doi.org/10.48550/arXiv.2210.03968
Widyasari R, Zhao Z, Le-Cong T, Kang HJ, Lo D (2023) Topic recommendation for github repositories: How far can extreme multi-label learning go? In: Zhang T, Xia X, Novielli N (eds.), IEEE international conference on software analysis, evolution and reengineering, SANER 2023, Taipa, Macao, March 21-24, 2023, IEEE, pp 167–178. https://doi.org/10.1109/SANER56733.2023.00025
Xia X, Bao L, Lo D, Xing Z, Hassan AE, Li S (2018) Measuring program comprehension: a large-scale field study with professionals. IEEE Trans Softw Eng 44(10):951–976. https://doi.org/10.1109/TSE.2017.2734091
Zhang J, Hsieh C, Yu Y, Zhang C, Ratner A (2022) A survey on programmatic weak supervision. arXiv:2202.05433
Zhang Y, Xu FF, Li S, Meng Y, Wang X, Li Q, Han J (2019) Higitclass: keyword-driven hierarchical classification of github repositories. In: Wang J, Shim K, Wu X (eds) 2019 IEEE international conference on data mining, ICDM 2019, Beijing, China, November 8-11, 2019, IEEE, pp 876–885. https://doi.org/10.1109/ICDM.2019.00098
Zhou Y, Wu J, Sun Y (2021) Ghtrec: a personalized service to recommend github trending repositories for developers. In: Chang CK, Daminai E, Fan J, Ghodous P, Maximilien M, Wang Z, Ward R, Zhang J (eds.) 2021 IEEE international conference on web Services, ICWS 2021, Chicago, IL, USA, September 5-10, 2021, IEEE, pp 314–323. https://doi.org/10.1109/ICWS53863.2021.00049
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declared that they have no conflict of interest.
Additional information
Communicated by: Xin Peng.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sas, C., Capiluppi, A. Multi-granular software annotation using file-level weak labelling. Empir Software Eng 29, 12 (2024). https://doi.org/10.1007/s10664-023-10423-7
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-023-10423-7