Abstract
Professional developers, and especially students learning to program, often write poor documentation. While automated assessment for programming is becoming more common in educational settings, often using unit tests for code functionality and static analysis for code quality, documentation assessment is typically limited to detecting the presence and the correct formatting of a docstring based on a specified style guide. We aim to investigate how machine learning can be utilised to aid in automating the assessment of documentation quality. We classify a large set of publicly available human-annotated relevance scores between a natural language string and a code string, using traditional approaches, such as Logistic Regression and Random Forest, fine-tuned large language models, such as BERT and GPT, and Low-Rank Adaptation of large language models. Our most accurate mode was a fine-tuned CodeBERT model, resulting in a test accuracy of 89%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Data Availability Statement
All our raw data, data processing, model training and results can be found on GitHub (Data processing repository: https://github.com/m-messer/Grading-Documentation-with-Machine-Learning).
Notes
- 1.
CheckStyle – A style guide enforcement utility: https://checkstyle.sourceforge.io/.
- 2.
In CodeSearchNet, documentation and natural language query are used interchangeably.
- 3.
HackerRank: https://www.hackerrank.com/.
- 4.
GitHub CoPilot: https://github.com/features/copilot.
- 5.
Weights and Biases: https://wandb.ai/site.
References
Aggarwal, K., Singh, Y., Chhabra, J.: An integrated measure of software maintainability. In: Annual Reliability and Maintainability Symposium. 2002 Proceedings (Cat. No. 02CH37318), pp. 235–241 (2002). https://doi.org/10.1109/RAMS.2002.981648
Aghajani, E., Nagy, C., Linares-Vásquez, M., et al.: Software documentation: the practitioners’ perspective. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 590–601. ICSE 2020. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3377811.3380405
Aghajani, E., Nagy, C., Vega-Márquez, O.L., et al.: Software documentation issues unveiled. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 1199–1210 (2019). https://doi.org/10.1109/ICSE.2019.00122
Akiba, T., Sano, S., Yanase, T., et al.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, pp. 2623–2631. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3292500.3330701
Brown, N.C.C., Kölling, M., McCall, D., et al.: Blackbox: a large scale repository of novice programmers’ activity. In: Proceedings of the 45th ACM Technical Symposium on Computer Science Education, SIGCSE 2014, pp. 223–228. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2538862.2538924
Brown, T., Mann, B., Ryder, N., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper%5Ffiles/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Chen, M., Tworek, J., Jun, H., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)
Clement, C.B., Drain, D., Timcheck, J., et al.: PyMT5: multi-mode translation of natural language and Python code with transformers. arXiv preprint arXiv:2010.03150 (2020)
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2019)
Feng, Z., Guo, D., Tang, D., et al.: CodeBERT: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)
de Freitas, A., Coffman, J., de Freitas, M., et al.: FalconCode: a multiyear dataset of Python code samples from an introductory computer science course. In: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, SIGCSE 2023, pp. 938–944. Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3545945.3569822
Gerdes, J.: Developing applications to automatically grade introductory visual basic courses. In: AMCIS 2017 Proceedings, August 2017. https://aisel.aisnet.org/amcis2017/ISEducation/Presentations/28
Hossin, M., Sulaiman, M.N.: A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manage. Process (IJDKP) 5, 1–11 (2015). https://doi.org/10.5121/ijdkp.2015.5201
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Husain, H., Wu, H.H., Gazit, T., et al.: CodeSearchNet challenge: evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2020)
King’s College London: King’s computational research, engineering and technology environment (CREATE) (2024). https://doi.org/10.18742/rnvf-m076
Koivisto, T., Hellas, A.: Evaluating CodeClusters for effectively providing feedback on code submissions. In: 2022 IEEE Frontiers in Education Conference (FIE), pp. 1–9 (2022). https://doi.org/10.1109/FIE56618.2022.9962751
LeClair, A., Haque, S., Wu, L., et al.: Improved code summarization via a graph neural network. In: Proceedings of the 28th International Conference on Program Comprehension, ICPC 2020, pp. 184–195. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3387904.3389268
Messer, M., Brown, N.C.C., Kölling, M., Shi, M.: Automated grading and feedback tools for programming education: a systematic review. ACM Trans. Comput. Educ. 24(1), 1–43 (2024). https://doi.org/10.1145/3636515
Messer, M., Brown, N.C.C., Kölling, M., et al.: Machine learning-based automated grading and feedback tools for programming: a meta-analysis. In: Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education, vol. 1, pp. 491–497. ITiCSE 2023. Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3587102.3588822
Muuli, E., et al.: Automatic assessment of programming assignments using image recognition. In: Lavoué, É., Drachsler, H., Verbert, K., Broisin, J., Pérez-Sanagustín, M. (eds.) EC-TEL 2017. LNCS, vol. 10474, pp. 153–163. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66610-5_12
Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283
Shi, E., Wang, Y., Du, L., et al.: On the evaluation of neural code summarization. In: Proceedings of the 44th International Conference on Software Engineering, ICSE 2022, pp. 1597–1608. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3510003.3510060
Treude, C., Middleton, J., Atapattu, T.: Beyond accuracy: assessing software documentation quality. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, pp. 1509–1512. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3368089.3417045
Walker, O., Russell, N.: Automatic assessment of the design quality of Python programs with personalized feedback. In: Proceedings of the 14th International Conference on Educational Data Mining, pp. 495–501 (2021)
Wolf, T., Debut, L., Sanh, V., et al.: HuggingFace’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2020)
Zhang, J., Wang, X., Zhang, H., et al.: Retrieval-based neural source code summarization. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE 2020, pp. 1385–1397. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3377811.3380383
Acknowledgements
We thank the King’s College Teaching Fund for funding our study and CREATE [16] for providing the high-performance cluster we used to train and evaluate our models.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Messer, M., Shi, M., Brown, N.C.C., Kölling, M. (2024). Grading Documentation with Machine Learning. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14829. Springer, Cham. https://doi.org/10.1007/978-3-031-64302-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-64302-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64301-9
Online ISBN: 978-3-031-64302-6
eBook Packages: Computer ScienceComputer Science (R0)