Skip to main content

Grading Documentation with Machine Learning

  • Conference paper
  • First Online:
Artificial Intelligence in Education (AIED 2024)

Abstract

Professional developers, and especially students learning to program, often write poor documentation. While automated assessment for programming is becoming more common in educational settings, often using unit tests for code functionality and static analysis for code quality, documentation assessment is typically limited to detecting the presence and the correct formatting of a docstring based on a specified style guide. We aim to investigate how machine learning can be utilised to aid in automating the assessment of documentation quality. We classify a large set of publicly available human-annotated relevance scores between a natural language string and a code string, using traditional approaches, such as Logistic Regression and Random Forest, fine-tuned large language models, such as BERT and GPT, and Low-Rank Adaptation of large language models. Our most accurate mode was a fine-tuned CodeBERT model, resulting in a test accuracy of 89%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Data Availability Statement

All our raw data, data processing, model training and results can be found on GitHub (Data processing repository: https://github.com/m-messer/Grading-Documentation-with-Machine-Learning).

Notes

  1. 1.

    CheckStyle – A style guide enforcement utility: https://checkstyle.sourceforge.io/.

  2. 2.

    In CodeSearchNet, documentation and natural language query are used interchangeably.

  3. 3.

    HackerRank: https://www.hackerrank.com/.

  4. 4.

    GitHub CoPilot: https://github.com/features/copilot.

  5. 5.

    Weights and Biases: https://wandb.ai/site.

References

  1. Aggarwal, K., Singh, Y., Chhabra, J.: An integrated measure of software maintainability. In: Annual Reliability and Maintainability Symposium. 2002 Proceedings (Cat. No. 02CH37318), pp. 235–241 (2002). https://doi.org/10.1109/RAMS.2002.981648

  2. Aghajani, E., Nagy, C., Linares-Vásquez, M., et al.: Software documentation: the practitioners’ perspective. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 590–601. ICSE 2020. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3377811.3380405

  3. Aghajani, E., Nagy, C., Vega-Márquez, O.L., et al.: Software documentation issues unveiled. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 1199–1210 (2019). https://doi.org/10.1109/ICSE.2019.00122

  4. Akiba, T., Sano, S., Yanase, T., et al.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, pp. 2623–2631. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3292500.3330701

  5. Brown, N.C.C., Kölling, M., McCall, D., et al.: Blackbox: a large scale repository of novice programmers’ activity. In: Proceedings of the 45th ACM Technical Symposium on Computer Science Education, SIGCSE 2014, pp. 223–228. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2538862.2538924

  6. Brown, T., Mann, B., Ryder, N., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper%5Ffiles/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  7. Chen, M., Tworek, J., Jun, H., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  8. Clement, C.B., Drain, D., Timcheck, J., et al.: PyMT5: multi-mode translation of natural language and Python code with transformers. arXiv preprint arXiv:2010.03150 (2020)

  9. Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2019)

  10. Feng, Z., Guo, D., Tang, D., et al.: CodeBERT: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

  11. de Freitas, A., Coffman, J., de Freitas, M., et al.: FalconCode: a multiyear dataset of Python code samples from an introductory computer science course. In: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, SIGCSE 2023, pp. 938–944. Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3545945.3569822

  12. Gerdes, J.: Developing applications to automatically grade introductory visual basic courses. In: AMCIS 2017 Proceedings, August 2017. https://aisel.aisnet.org/amcis2017/ISEducation/Presentations/28

  13. Hossin, M., Sulaiman, M.N.: A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manage. Process (IJDKP) 5, 1–11 (2015). https://doi.org/10.5121/ijdkp.2015.5201

  14. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  15. Husain, H., Wu, H.H., Gazit, T., et al.: CodeSearchNet challenge: evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2020)

  16. King’s College London: King’s computational research, engineering and technology environment (CREATE) (2024). https://doi.org/10.18742/rnvf-m076

  17. Koivisto, T., Hellas, A.: Evaluating CodeClusters for effectively providing feedback on code submissions. In: 2022 IEEE Frontiers in Education Conference (FIE), pp. 1–9 (2022). https://doi.org/10.1109/FIE56618.2022.9962751

  18. LeClair, A., Haque, S., Wu, L., et al.: Improved code summarization via a graph neural network. In: Proceedings of the 28th International Conference on Program Comprehension, ICPC 2020, pp. 184–195. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3387904.3389268

  19. Messer, M., Brown, N.C.C., Kölling, M., Shi, M.: Automated grading and feedback tools for programming education: a systematic review. ACM Trans. Comput. Educ. 24(1), 1–43 (2024). https://doi.org/10.1145/3636515

  20. Messer, M., Brown, N.C.C., Kölling, M., et al.: Machine learning-based automated grading and feedback tools for programming: a meta-analysis. In: Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education, vol. 1, pp. 491–497. ITiCSE 2023. Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3587102.3588822

  21. Muuli, E., et al.: Automatic assessment of programming assignments using image recognition. In: Lavoué, É., Drachsler, H., Verbert, K., Broisin, J., Pérez-Sanagustín, M. (eds.) EC-TEL 2017. LNCS, vol. 10474, pp. 153–163. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66610-5_12

    Chapter  Google Scholar 

  22. Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  23. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283

    Article  Google Scholar 

  24. Shi, E., Wang, Y., Du, L., et al.: On the evaluation of neural code summarization. In: Proceedings of the 44th International Conference on Software Engineering, ICSE 2022, pp. 1597–1608. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3510003.3510060

  25. Treude, C., Middleton, J., Atapattu, T.: Beyond accuracy: assessing software documentation quality. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, pp. 1509–1512. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3368089.3417045

  26. Walker, O., Russell, N.: Automatic assessment of the design quality of Python programs with personalized feedback. In: Proceedings of the 14th International Conference on Educational Data Mining, pp. 495–501 (2021)

    Google Scholar 

  27. Wolf, T., Debut, L., Sanh, V., et al.: HuggingFace’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2020)

  28. Zhang, J., Wang, X., Zhang, H., et al.: Retrieval-based neural source code summarization. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE 2020, pp. 1385–1397. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3377811.3380383

Download references

Acknowledgements

We thank the King’s College Teaching Fund for funding our study and CREATE [16] for providing the high-performance cluster we used to train and evaluate our models.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcus Messer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Messer, M., Shi, M., Brown, N.C.C., Kölling, M. (2024). Grading Documentation with Machine Learning. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14829. Springer, Cham. https://doi.org/10.1007/978-3-031-64302-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-64302-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-64301-9

  • Online ISBN: 978-3-031-64302-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics