Advertisement

Creating Evolving Project Data Sets in Software Engineering

  • Tomasz LewowskiEmail author
  • Lech Madeyski
Chapter
Part of the Studies in Computational Intelligence book series (SCI, volume 851)

Abstract

While the amount of research in the area of software engineering is ever increasing, it is still a challenge to select a research data set. Quite a number of data sets have been proposed, but we still lack a systematic approach to creating ones that would evolve together with the industry. We aim to present a systematic method of selecting data sets of industry-relevant software projects for the purposes of software engineering research. We present a set of guidelines for filtering GitHub projects and implement those guidelines in a form of an R script. In particular, we select mostly projects from the biggest industrial open source contributors and remove projects in the first quartile in any of several categories from the data set. We use the latest GitHub GraphQL API to select the desired set of repositories. We evaluate the technique on Java projects. Presented technique systematizes methods for creating software development data sets and their evolution. Proposed algorithm has reasonable precision—between 0.65 and 0.80—and can be used as a baseline for further refinements.

Keywords

Dataset selection Software project dataset Dataset evolution Software engineering Reproducible research Mining software repositories 

Notes

Acknowledgements

This work has been conducted as a part of research and development project POIR.01.01.01-00-0792/16 supported by the National Centre for Research and Development (NCBiR). We would like to thank Tomasz Korzeniowski and Marek Skrajnowski from code quest sp. z o.o. for all of the comments and feedback from the real-world software engineering environment.

References

  1. 1.
    Madeyski, L.: Test-Driven Development: An Empirical Evaluation of Agile Practice. Springer, (Heidelberg, London, New York) (2010).  https://doi.org/10.1007/978-3-642-04288-1CrossRefGoogle Scholar
  2. 2.
    Rafique, Y., Misic, V.B.: The effects of test-driven development on external quality and productivity: A meta-analysis. IEEE Trans. Softw. Eng. 39(6), 835–856 (2013)CrossRefGoogle Scholar
  3. 3.
    Madeyski, L., Kawalerowicz, M.: Continuous Test-Driven Development: A Preliminary Empirical Evaluation using Agile Experimentation in Industrial Settings. In: Towards a Synergistic Combination of Research and Practice in Software Engineering, Studies in Computational Intelligence, vol. 733, pp. 105–118. Springer (2018).  https://doi.org/10.1007/978-3-319-65208-5_8Google Scholar
  4. 4.
    Arisholm, E., Gallis, H., Dybå, T., Sjøberg, D.I.K.: Evaluating Pair Programming with Respect to System Complexity and Programmer Expertise. IEEE Transactions on Software Engineering 33(2), 65–86 (2007)CrossRefGoogle Scholar
  5. 5.
    Dybå, T., Dingsøyr, T.: Empirical studies of agile software development: A systematic review. Information and Software Technology 50(9–10), 833–859 (2008)CrossRefGoogle Scholar
  6. 6.
    Tempero, E., Anslow, C., Dietrich, J., Han, T., Li, J., Lumpe, M., Melton, H., Noble, J.: The qualitas corpus: A curated collection of java code for empirical studies. In: 2010 Asia Pacific Software Engineering Conference, pp. 336–345 (2010).  https://doi.org/10.1109/APSEC.2010.46
  7. 7.
    Ortu, M., Destefanis, G., Adams, B., Murgia, A., Marchesi, M., Tonelli, R.: The jira repository dataset: Understanding social aspects of software development. In: Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE ’15, pp. 1:1–1:4. ACM, New York, NY, USA (2015).  https://doi.org/10.1145/2810146.2810147. http://doi.acm.org/10.1145/2810146.2810147
  8. 8.
    Lamastra, C.R.: Software innovativeness. a comparison between proprietary and free/open source solutions offered by italian smes. R&D Management 39(2), 153–169 (2009).  https://doi.org/10.1111/j.1467-9310.2009.00547.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9310.2009.00547.xCrossRefGoogle Scholar
  9. 9.
    MacCormack, A., Rusnak, J., Baldwin, C.Y.: Exploring the structure of complex software designs: An empirical study of open source and proprietary code. Management Science 52(7), 1015–1030 (2006). 10.1287/mnsc.1060.0552. https://doi.org/10.1287/mnsc.1060.0552CrossRefGoogle Scholar
  10. 10.
    Pruett, J., Choi, N.: A comparison between select open source and proprietary integrated library systems. Library Hi Tech 31(3), 435–454 (2013).  https://doi.org/10.1108/LHT-01-2013-0003CrossRefGoogle Scholar
  11. 11.
    Bird, C., Pattison, D., D’Souza, R., Filkov, V., Devanbu, P.: Latent social structure in open source projects. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, SIGSOFT ’08/FSE-16, pp. 24–35. ACM, New York, NY, USA (2008).  https://doi.org/10.1145/1453101.1453107. http://doi.acm.org/10.1145/1453101.1453107
  12. 12.
    Vasudevan, A.R., Harshini, E., Selvakumar, S.: Ssenet-2011: A network intrusion detection system dataset and its comparison with kdd cup 99 dataset. In: 2011 Second Asian Himalayas International Conference on Internet (AH-ICI), pp. 1–5 (2011).  https://doi.org/10.1109/AHICI.2011.6113948
  13. 13.
    Madeyski, L.: Training data preparation method. Tech. rep., code quest (research project NCBiR POIR.01.01.01-00-0792/16) (2019)Google Scholar
  14. 14.
    Raemaekers, S., van Deursen, A., Visser, J.: The maven repository dataset of metrics, changes, and dependencies. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 221–224 (2013).  https://doi.org/10.1109/MSR.2013.6624031
  15. 15.
    Habayeb, M., Miranskyy, A., Murtaza, S.S., Buchanan, L., Bener, A.: The firefox temporal defect dataset. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp. 498–501. IEEE Press, Piscataway, NJ, USA (2015). http://dl.acm.org/citation.cfm?id=2820518.2820597
  16. 16.
    Lamkanfi, A., Prez, J., Demeyer, S.: The eclipse and mozilla defect tracking dataset: A genuine dataset for mining bug information. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 203–206 (2013).  https://doi.org/10.1109/MSR.2013.6624028
  17. 17.
    Ohira, M., Kashiwa, Y., Yamatani, Y., Yoshiyuki, H., Maeda, Y., Limsettho, N., Fujino, K., Hata, H., Ihara, A., Matsumoto, K.: A dataset of high impact bugs: Manually-classified issue reports. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 518–521 (2015).  https://doi.org/10.1109/MSR.2015.78
  18. 18.
    Filó, T.G., Bigonha, M.A., Ferreira, K.A.: Statistical dataset on software metrics in object-oriented systems. SIGSOFT Softw. Eng. Notes 39(5), 1–6 (2014).  https://doi.org/10.1145/2659118.2659130CrossRefGoogle Scholar
  19. 19.
    Open-source version control system for machine learning projects. https://dvc.org/. Accessed: 2019-04-23
  20. 20.
    dat:// a peer-to-peer protocol. https://datproject.org/. Accessed: 2019-04-23
  21. 21.
    Gousios, G.: The ghtorent dataset and tool suite. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp. 233–236. IEEE Press, Piscataway, NJ, USA (2013). http://dl.acm.org/citation.cfm?id=2487085.2487132
  22. 22.
    Cosentino, V., Izquierdo, J.L.C., Cabot, J.: Findings from github: Methods, datasets and limitations. In: 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), pp. 137–141 (2016).  https://doi.org/10.1109/MSR.2016.023
  23. 23.
    Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: The promises and perils of mining github. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 92–101. ACM, New York, NY, USA (2014).  https://doi.org/10.1145/2597073.2597074. http://doi.acm.org/10.1145/2597073.2597074
  24. 24.
    Guzman, E., Azócar, D., Li, Y.: Sentiment analysis of commit comments in github: An empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 352–355. ACM, New York, NY, USA (2014).  https://doi.org/10.1145/2597073.2597118. http://doi.acm.org/10.1145/2597073.2597118
  25. 25.
    Pletea, D., Vasilescu, B., Serebrenik, A.: Security and emotion: Sentiment analysis of security discussions on github. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 348–351. ACM, New York, NY, USA (2014).  https://doi.org/10.1145/2597073.2597117. http://doi.acm.org/10.1145/2597073.2597117
  26. 26.
    Sawant, A.A., Bacchelli, A.: A dataset for api usage. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp. 506–509. IEEE Press, Piscataway, NJ, USA (2015). http://dl.acm.org/citation.cfm?id=2820518.2820599
  27. 27.
    Badashian, A.S., Esteki, A., Gholipour, A., Hindle, A., Stroulia, E.: Involvement, contribution and influence in github and stack overflow. In: Proceedings of 24th Annual International Conference on Computer Science and Software Engineering, CASCON ’14, pp. 19–33. IBM Corp., Riverton, NJ, USA (2014). http://dl.acm.org/citation.cfm?id=2735522.2735527
  28. 28.
    Awesome empirical software engineering resources. https://github.com/dspinellis/awesome-msr. Accessed: 2019-03-31
  29. 29.
    Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: PROMISE’2010: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pp. 9:1–9:10. ACM (2010).  https://doi.org/10.1145/1868328.1868342
  30. 30.
    Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating github for engineered software projects. Empirical Software Engineering 22(6), 3219–3253 (2017)CrossRefGoogle Scholar
  31. 31.
    Smith, T.M., McCartney, R., Gokhale, S.S., Kaczmarczyk, L.C.: Selecting open source software projects to teach software engineering. In: Proceedings of the 45th ACM Technical Symposium on Computer Science Education, SIGCSE 14, pp. 397–402. ACM, New York, NY, USA (2014)Google Scholar
  32. 32.
    Tamburri, D.A., Palomba, F., Serebrenik, A., Zaidman, A.: Discovering community patterns in open-source: a systematic approach and its evaluation. Empirical Software Engineering (2018)Google Scholar
  33. 33.
    Falessi, D., Smith, W., Serebrenik, A.: Stress: A semi-automated, fully replicable approach for project selection. In: 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 151–156 (2017)Google Scholar
  34. 34.
    Gebru, T., Morgenstern, J.H., Vecchione, B., Vaughan, J.W., Wallach, H.M., Daumé, H., Crawford, K.: Datasheets for datasets. CoRR abs/1803.09010 (2018)Google Scholar
  35. 35.
    Asay, M.: Who really contributes to open source (2018). https://www.infoworld.com/article/3253948/who-really-contributes-to-open-source.html. [Online; posted 7-February-2018; Accessed 23-April-2019]
  36. 36.
    Madeyski, L., Kitchenham, B.: reproducer: Reproduce Statistical Analyses and Meta-Analyses (2019). http://madeyski.e-informatyka.pl/reproducible-research/. R package version (http://CRAN.R-project.org/package=reproducer)
  37. 37.
    Madeyski, L., Kitchenham, B.: Would wider adoption of reproducible research be beneficial for empirical software engineering research? Journal of Intelligent & Fuzzy Systems 32(2), 1509–1521 (2017).  https://doi.org/10.3233/JIFS-169146CrossRefGoogle Scholar
  38. 38.
    Madeyski, L., Kitchenham, B.: Effect Sizes and their Variance for AB/BA Crossover Design Studies. Empirical Software Engineering 23(4), 1982–2017 (2018).  https://doi.org/10.1007/s10664-017-9574-5CrossRefGoogle Scholar
  39. 39.
    Sharma, A., Thung, F., Kochhar, P.S., Sulistya, A., Lo, D.: Cataloging github repositories. In: Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, EASE’17, pp. 314–319. ACM, New York, NY, USA (2017)Google Scholar
  40. 40.
    Tiobe index. https://www.tiobe.com/tiobe-index/. Accessed: 2019-04-24

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Faculty of Computer Science and ManagementWroclaw University of Science and TechnologyWroclawPoland

Personalised recommendations