Progress in Artificial Intelligence

, Volume 7, Issue 3, pp 237–247 | Cite as

Emerging topics in mining software repositories

Machine learning in software repositories and datasets
  • Diego Güemes-Peña
  • Carlos López-Nozal
  • Raúl Marticorena-Sánchez
  • Jesús Maudes-Raedo
Regular Paper


A software process is a set of related activities that culminates in the production of a software package: specification, design, implementation, testing, evolution into new versions, and maintenance. There are also other supporting activities such as configuration and change management, quality assurance, project management, evaluation of user experience, etc. Software repositories are infrastructures to support all these activities. They can be composed with several systems that include code change management, bug tracking, code review, build system, release binaries, wikis, forums, etc. This position paper on mining software repositories presents a review and a discussion of research in this field over the past decade. We also identify applied machine learning strategies, current working topics, and future challenges for the improvement of company decision-making systems. Machine learning is defined as the process of discovering patterns in data. It can be applied to software repositories, since every change is recorded as data. Companies can then use these patterns as the basis for their decision-making systems and for knowledge discovery.


Data mining Machine learning Software repository Software process Software engineering 



We would like to thank the Ministerio de Economía y Competitividad of the Spanish Government for financing the project TIN2015-67534-P (MINECO/FEDER, UE) and the Junta de Castilla y León for financing the project BU085P17 (JCyL/FEDER, UE) both cofinanced from European Union FEDER funds.


  1. 1.
    Ali, N., Guhneuc, Y.G., Antoniol, G.: Trustrace: mining software repositories to improve the accuracy of requirement traceability links. IEEE Trans. Softw. Eng. 39(5), 725–741 (2013). CrossRefGoogle Scholar
  2. 2.
    Arnaoudova, V., Eshkevari, L., Penta, M., Oliveto, R., Antoniol, G., Guhneuc, Y.G.: Repent: analyzing the nature of identifier renamings. IEEE Trans. Softw. Eng. 40(5), 502–532 (2014). CrossRefGoogle Scholar
  3. 3.
    Bavota, G., Linares-Vsquez, M., Bernal-Crdenas, C., Di Penta, M., Oliveto, R., Poshyvanyk, D.: The impact of api change- and fault-proneness on the user ratings of android apps. IEEE Trans. Softw. Eng. 41(4), 384–407 (2015). CrossRefGoogle Scholar
  4. 4.
    Brown, W.H., Malveau, R.C., McCormick, H.W.S., Mowbray, T.J.: AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis, 1st edn. Wiley, New York, NY (1998)Google Scholar
  5. 5.
    Canfora, G., Cerulo, L., Cimitile, M., Di Penta, M.: How changes affect software entropy: an empirical study. Empir. Softw. Eng. 19(1), 1–38 (2014). CrossRefGoogle Scholar
  6. 6.
    Chen, T.H., Thomas, S., Hassan, A.: A survey on the use of topic models when mining software repositories. Empir. Softw. Eng. 21(5), 1843–1919 (2016). CrossRefGoogle Scholar
  7. 7.
    Chowdhury, S.A., Hindle, A.: Mining stackoverflow to filter out off-topic irc discussion. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp. 422–425. IEEE Press, Piscataway, NJ, USA (2015).
  8. 8.
    Dagenais, B., Robillard, M.: Recommending adaptive changes for framework evolution. ACM Trans. Softw. Eng. Methodol. 20(4), 9 (2011). CrossRefGoogle Scholar
  9. 9.
    Destefanis, G., Ortu, M., Counsell, S., Swift, S., Marchesi, M., Tonelli, R.: Software development: do good manners matter? PeerJ Comput. Sci. (2016). CrossRefGoogle Scholar
  10. 10.
    Dyer, R., Nguyen, H., Rajan, H., Nguyen, T.: Boa: Ultra-large-scale software repository and source-code mining. ACM Trans. Softw. Eng. Methodol. 25(1), 7 (2015). CrossRefGoogle Scholar
  11. 11.
    German, D.M.: A study of the contributors of postgresql. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR ’06, pp. 163–164. ACM, New York, NY, USA (2006).
  12. 12.
    Gonzalez-Barahona, J., Robles, G., Herraiz, I., Ortega, F.: Studying the laws of software evolution in a long-lived floss project. J. Softw. Evolut. Process 26(7), 589–612 (2014). CrossRefGoogle Scholar
  13. 13.
    Gonzlez-Barahona, J., Robles, G.: On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir. Softw. Eng. 17(1–2), 75–89 (2012). CrossRefGoogle Scholar
  14. 14.
    Goyal, A., Sardana, N.: Nrfixer: Sentiment based model for predicting the fixability of non-reproducible bugs. E-Inf. Softw. Eng. J. 11(1), 103–116 (2017). CrossRefGoogle Scholar
  15. 15.
    Grant, S., Betts, B.: Encouraging user behaviour with achievements: an empirical study. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp. 65–68. IEEE Press, Piscataway, NJ, USA (2013).
  16. 16.
    Guana, V., Rocha, F., Hindle, A., Stroulia, E.: Do the stars align? Multidimensional analysis of android’s layered architecture. In: 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), pp. 124–127 (2012)Google Scholar
  17. 17.
    Guzman, E., Azócar, D., Li, Y.: Sentiment analysis of commit comments in GitHub: an empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 352–355. ACM, New York, NY, USA (2014).
  18. 18.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  19. 19.
    Hammad, M., Hammad, M., Bani-Salameh, H.: Identifying designers and their design knowledge. Int. J. Softw. Eng. Its Appl. 7(6), 277–288 (2013). CrossRefGoogle Scholar
  20. 20.
    Han, J., Jung, W.: Extracting communication structure of a development organization from a software repository. Pers. Ubiquit. Comput. 18(6), 1413–1421 (2014). CrossRefGoogle Scholar
  21. 21.
    Hassan, A., Holt, R.: Replaying development history to assess the effectiveness of change propagation tools. Empir. Softw. Eng. 11(3), 335–367 (2006). CrossRefGoogle Scholar
  22. 22.
    Hindle, A.: Green mining: a methodology of relating software change and configuration to power consumption. Empir. Softw. Eng. 20(2), 374–409 (2015). CrossRefGoogle Scholar
  23. 23.
    Holmes, R., Walker, R.J.: A newbie’s guide to eclipse APIs. In: Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10–11, 2008, Proceedings, pp. 149–152 (2008).
  24. 24.
    Hora, A., Anquetil, N., Etien, A., Ducasse, S., Valente, M.: Automatic detection of system-specific conventions unknown to developers. J. Syst. Softw. 109, 192–204 (2015). CrossRefGoogle Scholar
  25. 25.
    Jacobson, I., Booch, G., Rumbaugh, J.: The Unified Software Development Process. Addison-Wesley Longman Publishing Co., Inc., Boston, MA (1999)Google Scholar
  26. 26.
    Kagdi, H., Gethers, M., Poshyvanyk, D., Hammad, M.: Assigning change requests to software developers. J. Softw. Evolut. Process 24(1), 3–33 (2012). CrossRefGoogle Scholar
  27. 27.
    Kamei, Y., Shihab, E., Adams, B., Hassan, A., Mockus, A., Sinha, A., Ubayashi, N.: A large-scale empirical study of just-in-time quality assurance. IEEE Trans. Softw. Eng. 39(6), 757–773 (2013). CrossRefGoogle Scholar
  28. 28.
    Khomh, F., Penta, M., Guhneuc, Y.G., Antoniol, G.: An exploratory study of the impact of antipatterns on class change- and fault-proneness. Empir. Softw. Eng. 17(3), 243–275 (2012). CrossRefGoogle Scholar
  29. 29.
    Kim, S., Shivaji, S., Whitehead Jr., E.: Kenyon-web: Reconfigurable web-based feature extractor. Vancouver, BC, pp. 287–288 (2009).
  30. 30.
    Kirbas, S., Caglayan, B., Hall, T., Counsell, S., Bowes, D., Sen, A., Bener, A.: The relationship between evolutionary coupling and defects in large industrial software. J. Softw. Evolut. Process (2017). CrossRefGoogle Scholar
  31. 31.
    Krinke, J., Gold, N., Jia, Y., Binkley, D.: Cloning and copying between gnome projects. In: 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), pp. 98–101 (2010)Google Scholar
  32. 32.
    Kumaresh, S., Baskaran, R.: Mining software repositories for defect categorization. J. Commun. Softw. Syst. 11(1), 31–36 (2015)CrossRefGoogle Scholar
  33. 33.
    Lehman, M.M., Belady, L.A. (eds.): Program Evolution: Processes of Software Change, 1st edn. Academic Press Professional Inc, San Diego, CA (1985)Google Scholar
  34. 34.
    Li, H., Shang, W., Zou, Y., Hassan, A.E.: Towards just-in-time suggestions for log changes. Empir. Softw. Eng. 22(4), 1831–1865 (2017). CrossRefGoogle Scholar
  35. 35.
    Linares-Vásquez, M., Vendome, C., Tufano, M., Poshyvanyk, D.: How developers micro-optimize android apps. J. Syst. Softw. 130, 1–23 (2017). CrossRefGoogle Scholar
  36. 36.
    Linstead, E., Rigor, P., Bajracharya, S., Lopes, C., Baldi, P.: Mining eclipse developer contributions via author-topic models. In: Fourth International Workshop on Mining Software Repositories (MSR’07:ICSE Workshops 2007), pp. 30–30 (2007)Google Scholar
  37. 37.
    López-Fernández, L., Robles, G., Gonzalez-Barahona, J., Herraiz, I.: Applying social network analysis techniques to community-driven libre software projects. Int. J. Inf. Technol. Web Eng. (IJITWE) 1(3), 27–48 (2006). CrossRefGoogle Scholar
  38. 38.
    Louridas, P., Ebert, C.: Machine learning. IEEE Softw. 33(5), 110–115 (2016). CrossRefGoogle Scholar
  39. 39.
    Munaiah, N., Camilo, F., Wigham, W., Meneely, A., Nagappan, M.: Do bugs foreshadow vulnerabilities? An in-depth study of the chromium project. Empir. Softw. Eng. 22(3), 1305–1347 (2017). CrossRefGoogle Scholar
  40. 40.
    Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating github for engineered software projects. Empir. Softw. Eng. 22(6), 3219–3253 (2017). CrossRefGoogle Scholar
  41. 41.
    Penta, M., Cerulo, L., Aversano, L.: The life and death of statically detected vulnerabilities: an empirical study. Inf. Softw. Technol. 51(10), 1469–1484 (2009). CrossRefGoogle Scholar
  42. 42.
    Porter, M.: An algorithm for suffix stripping. Program 3, 130–137 (1980)CrossRefGoogle Scholar
  43. 43.
    Prechelt, L., Pepper, A.: Why software repositories are not used for defect-insertion circumstance analysis more often: a case study. Inf. Softw. Technol. 56(10), 1377–1389 (2014). CrossRefGoogle Scholar
  44. 44.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Burlington (1993)Google Scholar
  45. 45.
    Rebouças, M., Santos, R.O., Pinto, G., Castor, F.: How does contributors’ involvement influence the build status of an open-source software project? In: Proceedings of the 14th International Conference on Mining Software Repositories, MSR ’17, pp. 475–478. IEEE Press, Piscataway, NJ, USA (2017).
  46. 46.
    Robles, G., Gonzalez-Barahona, J.: Contributor turnover in libre software projects. IFIP Int. Fed. Inf. Process. 203, 273–286 (2006). CrossRefGoogle Scholar
  47. 47.
    Santos, E.A., Hindle, A.: Judging a commit by its cover: Correlating commit message entropy with build status on travis-ci. In: Proceedings of the 13th International Conference on Mining Software Repositories, MSR ’16, pp. 504–507. ACM, New York, NY, USA (2016).
  48. 48.
    Schröter, A.: Msr challenge 2011: Eclipse, netbeans, firefox, and chrome. In: Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pp. 227–229. ACM, New York, NY, USA (2011).
  49. 49.
    Shihab, E., Jiang, Z.M., Hassan, A.E.: On the use of internet relay chat (IRC) meetings by developers of the gnome gtk+ project. In: 2009 6th IEEE International Working Conference on Mining Software Repositories, pp. 107–110 (2009)Google Scholar
  50. 50.
    Sun, X., Li, B., Duan, Y., Shi, W., Liu, X.: Mining software repositories for automatic interface recommendation. Sci. Program. (2016). CrossRefGoogle Scholar
  51. 51.
    Sun, X., Li, B., Leung, H., Li, B., Li, Y.: Msr4sm: using topic models to effectively mining software repositories for software maintenance tasks. Inf. Softw. Technol. 66, 1–12 (2015). CrossRefGoogle Scholar
  52. 52.
    Sun, Y., Wang, Q., Yang, Y.: Frlink: improving the recovery of missing issue-commit links by revisiting file relevance. Inf. Softw. Technol. 84, 33–47 (2017). CrossRefGoogle Scholar
  53. 53.
    Tappolet, J., Kiefer, C., Bernstein, A.: Semantic web enabled software analysis. J. Web Semant. 8(2–3), 225–240 (2010). CrossRefGoogle Scholar
  54. 54.
    Teixeira, J., Robles, G., Gonzlez-Barahona, J.: Lessons learned from applying social network analysis on an industrial free/libre/open source software ecosystem. J. Internet Serv. Appl. 6(1), 14 (2015). CrossRefGoogle Scholar
  55. 55.
    Thummalapenta, S., Cerulo, L., Aversano, L., Di Penta, M.: An empirical study on the maintenance of source code clones. Empir. Softw. Eng. 15(1), 1–34 (2010). CrossRefGoogle Scholar
  56. 56.
    Vanya, A., Klusener, S., Premraj, R., Van Vliet, H.: Supporting software architects to improve their software system’s decomposition—lessons learned. J. Softw. Evolut. Process 25(3), 219–232 (2013). CrossRefGoogle Scholar
  57. 57.
    Vendome, C., Bavota, G., Penta, M., Linares-Vsquez, M., German, D., Poshyvanyk, D.: License usage and changes: a large-scale study on github. Empir. Softw. Eng. 22(3), 1537–1577 (2017). CrossRefGoogle Scholar
  58. 58.
    Voinea, L., Telea, A.: Visual querying and analysis of large software repositories. Empir. Softw. Eng. 14(3), 316–340 (2009). CrossRefGoogle Scholar
  59. 59.
    Xuan, J., Jiang, H., Hu, Y., Ren, Z., Zou, W., Luo, Z., Wu, X.: Towards effective bug triage with software data reduction techniques. IEEE Trans. Knowl. Data Eng. 27(1), 264–280 (2015). CrossRefGoogle Scholar
  60. 60.
    Yamashita, K., Kamei, Y., McIntosh, S., Hassan, A., Ubayashi, N.: Magnet or sticky? Measuring project characteristics from the perspective of developer attraction and retention. J. Inf. Process. 24(2), 339–348 (2016). CrossRefGoogle Scholar
  61. 61.
    Yuan, Z., Yu, L.L., Liu, C.: Bug prediction method for fine-grained source code changes. Ruan Jian Xue Bao/J. Softw. 25(11), 2499–2517 (2014). CrossRefGoogle Scholar
  62. 62.
    Zamani, S., Lee, S., Shokripour, R., Anvik, J.: A feature location approach supported by time-aware weighting of terms associated with developer expertise profiles. Knowl. Inf. Syst. 49(2), 629–659 (2016). CrossRefGoogle Scholar
  63. 63.
    Zhou, M., Mockus, A.: Who will stay in the floss community? Modeling participant’s initial behavior. IEEE Trans. Software Eng. 41(1), 82–99 (2015). CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of BurgosBurgosSpain

Personalised recommendations