Advertisement

On Data Analysis of Software Repositories

  • Dmitry NamiotEmail author
  • Vladimir Romanov
Conference paper
  • 17 Downloads
Part of the Communications in Computer and Information Science book series (CCIS, volume 1140)

Abstract

This article discusses the analysis of software repositories using data analysis methods. A review is made of methods for analyzing programs based on information retrieved from the program code stored in code repositories. A review is made of methods for analyzing programs based on information retrieved from the program code stored in repositories. The article reviews the works that apply methods of classification, clustering and depth learning in software development. For example, for classifying and predicting errors, changing the properties of code in the process of its evolution, detecting design flaws and debts, assist for code refactoring. The main ultimate goal for all models is, of course, an automation of programming. In practice, we are talking about more simple tasks. This includes, for example, information retrieval (program code), error prediction, clone detection, link analysis, evolution analysis, etc. Firstly, we discuss recurrent neural networks and their deployment for the analysis of software repositories. In the simplest case, recurrent networks model a programming language as a sequence of characters. Also, the paper covers clustering and topic modeling.

Keywords

Data science Recurrent neural networks Classification Clustering Software metrics Architectural technical debt Software repositories Software engineering 

References

  1. 1.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3(7–8), 1157–1182 (2003).  https://doi.org/10.1162/153244303322753616CrossRefzbMATHGoogle Scholar
  2. 2.
    AI Predicts Coding Mistakes Before Developers Make Them. https://futurism.com/ai-predicts-coding-mistakes-before-developers-make-them. Accessed 09 Nov 2018
  3. 3.
    Rich, C., Waters, R.C. (eds.): Readings in Artificial Intelligence and Software Engineering. Morgan Kaufmann Publishers Inc., San Francisco (1986)Google Scholar
  4. 4.
    The Unreasonable Effectiveness of Recurrent Neural Networks. http://karpathy.github.io/2015/05/21/rnn-effectiveness/. Accessed 09 Nov 2018
  5. 5.
    White, M., Vendome, C., Linares-Vásquez, M., Poshyvanyk, D.: Toward deep learning software repositories. In: Proceedings of the 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, Florence, pp. 334–345 (2015).  https://doi.org/10.1109/msr.2015.38
  6. 6.
    Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: Proceedings of the 34th International Conference on Software Engineering (ICSE), Zurich, pp. 837–847 (2012).  https://doi.org/10.1109/icse.2012.6227135
  7. 7.
    Nguyen, T.T., Nguyen, A.T., Nguyen, H.A., Nguyen, T.N.: A statistical semantic language model for source code. In: Proceedings of the 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013), pp. 532–542. ACM, New York (2013).  https://doi.org/10.1145/2491411.2491458
  8. 8.
    Afshan, S., McMinn, P., Stevenson, M.: Evolving readable string test inputs using a natural language model to reduce human oracle cost. In: Proceedings of the 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, Luxembourg, pp. 352–361 (2013).  https://doi.org/10.1109/icst.2013.11
  9. 9.
    Movshovitz-Attias, D., Cohen, W.W.: Natural language models for predicting programming comments. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 35–40. Association for Computational Linguistics (2013)Google Scholar
  10. 10.
    Allamanis, M., Sutton, C.A.: Mining source code repositories at massive scale using language modeling. In: Proceedings of the 10th Working Conference on Mining Software Repositories (MSR 2013), San Francisco, CA, USA, May 2013, pp. 207–216 (2013)Google Scholar
  11. 11.
    Campbell, J.C., Hindle, A., Amaral, J.N.: Syntax errors just aren’t natural: Improving error reporting with language models. In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014), pp. 252–261. ACM, New York (2014).  https://doi.org/10.1145/2597073.2597102
  12. 12.
    Tonella, P., Tiella, R., Nguyen, D.C.: Interpolated n-grams for model based testing. In: Proceedings of the 36th International Conference on Software Engineering (ICSE 2014), pp. 562–572. ACM, New York (2014).  https://doi.org/10.1145/2568225.2568242
  13. 13.
    Tu, Z., Su, Z., Devanbu, P.: On the localness of software. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014), pp. 269–280. ACM, New York (2014).  https://doi.org/10.1145/2635868.2635875
  14. 14.
    Allamanis, M., Barr, E.T., Bird, C., Sutton, C.: Learning natural coding conventions. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014), pp. 281–293. ACM, New York (2014).  https://doi.org/10.1145/2635868.2635883
  15. 15.
    Shepperd, M., Bowes, D., Hall, T.: Researcher bias: the use of machine learning in software defect prediction. IEEE Trans. Softw. Eng. 40(6), 603–616 (2014).  https://doi.org/10.1109/TSE.2014.2322358CrossRefGoogle Scholar
  16. 16.
    The tera-PROMISE Repository. http://openscience.us/repo. Accessed 09 Nov 2018
  17. 17.
    Malhotra, R.: A systematic review of machine learning techniques for software fault prediction. Appl. Soft Comput. 27, 504–518 (2015).  https://doi.org/10.1016/j.asoc.2014.11.023CrossRefGoogle Scholar
  18. 18.
    Di Martino, S., Ferrucci, F., Gravino, C., Sarro, F.: A genetic algorithm to configure support vector machines for predicting fault-prone components. In: Caivano, D., Oivo, M., Baldassarre, M.T., Visaggio, G. (eds.) PROFES 2011. LNCS, vol. 6759, pp. 247–261. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-21843-9_20CrossRefGoogle Scholar
  19. 19.
    Laradji, I.H., Alshayeb, M., Ghouti, L.: Software defect prediction using ensemble learning on selected features. Inf. Softw. Technol. 58, 388–402 (2015).  https://doi.org/10.1016/j.infsof.2014.07.005CrossRefGoogle Scholar
  20. 20.
    Kouroshfar, E., Mirakhorli, M., Bagheri, H., Xiao, L., Malek, S., Cai, Y.: A study on the role of software architecture in the evolution and quality of software. In: Proceedings of the 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, Florence, pp. 246–257 (2015).  https://doi.org/10.1109/msr.2015.30
  21. 21.
    Li, Z., Liang, P., Avgeriou, P., Guelfi, N., Ampatzoglou, A.: An empirical investigation of modularity metrics for indicating architectural technical debt. In: Proceedings of the 10th International ACM SIGSOFT Conference on Quality of Software Architectures (QoSA 2014), pp. 119–128. ACM, New York (2014).  https://doi.org/10.1145/2602576.2602581
  22. 22.
    Fernandes, E., Oliveira, J., Vale, G., Paiva, T., Figueiredo, E.: A review-based comparative study of bad smell detection tools. In: Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering (EASE 2016), Article 18, p. 18. ACM, New York (2016).  https://doi.org/10.1145/2915970.2915984
  23. 23.
    Blincoe, K., Harrison, F., Damian, D.K.: Ecosystems in GitHub and a method for ecosystem identification using reference coupling. In: Proceedings of the 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, Florence, pp. 202–211 (2015).  https://doi.org/10.1109/msr.2015.26
  24. 24.
    Chen, T.H., Thomas, S.W., Hassan, A.E.: A survey on the use of topic models when mining software repositories. Empirical Softw. Eng. 21(5), 1843–1919 (2016).  https://doi.org/10.1007/s10664-015-9402-8CrossRefGoogle Scholar
  25. 25.
    Thomas, S.W., Hassan, A.E., Blostein, D.: Mining unstructured software repositories. In: Mens, T., Serebrenik, A., Cleve, A. (eds.) Evolving Software Systems, pp. 139–162. Springer, Heidelberg (2014).  https://doi.org/10.1007/978-3-642-45398-4_5CrossRefGoogle Scholar
  26. 26.
    Thomas, S.W.: Mining unstructured software repositories using IR models. Ph.D. thesis, Queen’s University, Canada (2012)Google Scholar
  27. 27.
    Linstead, E., Rigor, P., Bajracharya, S., Lopes, C., Baldi, P.F.: Mining internet-scale software repositories. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems 20, pp. 929–936. Curran Associates, Red Hook (2008). http://papers.nips.cc/paper/3171-mining-internet-scale-software-repositories.pdf
  28. 28.
    Papas, D., Tjortjis, C.: Combining clustering and classification for software quality evaluation. In: Likas, A., Blekas, K., Kalles, D. (eds.) SETN 2014. LNCS (LNAI), vol. 8445, pp. 273–286. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-07064-3_22CrossRefGoogle Scholar
  29. 29.
    Shtern, M., Tzerpos, V.: Clustering methodologies for software engineering. Adv. Softw. Eng. 2012, 1 (2012). Article ID 792024.  https://doi.org/10.1155/2012/792024CrossRefGoogle Scholar
  30. 30.
    Naim, S.M., Damevski, K., Hossain, M.S.: Reconstructing and evolving software architectures using a coordinated clustering framework. Autom. Softw. Eng. 24(3), 543–572 (2017).  https://doi.org/10.1007/s10515-017-0211-8CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Lomonosov Moscow State UniversityMoscowRussia

Personalised recommendations