Standing on shoulders or feet? An extended study on the usage of the MSR data papers

Abstract

The establishment of the Mining Software Repositories (MSR) data showcase conference track has encouraged researchers to provide data sets as a basis for further empirical studies. The objective of this study is to examine the usage of data papers published in the MSR proceedings in terms of use frequency, users, and use purpose. Data track papers were collected from the MSR data showcase track and through the manual inspection of older MSR proceedings. The use of data papers was established through manual citation searching followed by reading the citing studies and dividing them into strong and weak citations. Contrary to weak, strong citations truly use the data set of a data paper. Data papers were then manually clustered based on their content, whereas their strong citations were classified by hand according to the knowledge areas of the Guide to the Software Engineering Body of Knowledge. A survey study on 108 authors and users of data papers provided further insights regarding motivation and effort in data paper production, encouraging and discouraging factors in data set use, and future desired direction regarding data papers. We found that 65% of the data papers have been used in other studies, with a long-tail distribution in the number of strong citations. Weak citations to data papers usually refer to them as an example. MSR data papers are cited in total less than other MSR papers. A considerable number of the strong citations stem from the teams that authored the data papers. Publications providing Version Control System (VCS) primary and derived data are the most frequent data papers and the most often strongly cited ones. Enhanced developer data papers are the least common ones, and the second least frequently strongly cited. Data paper authors tend to gather data in the context of other research. Users of data sets appreciate high data quality and are discouraged by lack of replicability of data set construction. Data related to machine learning or derived from the manufacturing sector are two suggestions of the respondents for future data papers. Overall, data papers have provided the foundation for a significant number of studies, but there is room for improvement in their utilization. This can be done by setting a higher bar for their publication, by encouraging their use, by promoting open science initiatives, and by providing incentives for the enrichment of existing data collections.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    1968 ACM Turing Award Lecture (Hamming 1969)

  2. 2.

    https://github.com/acmsigsoft/artifact-evaluation

  3. 3.

    https://github.com/dspinellis/awesome-msr

  4. 4.

    https://doi.org/10.5281/zenodo.3709219

  5. 5.

    https://www.webofknowledge.com

  6. 6.

    https://scholar.google.com/

  7. 7.

    https://www.scopus.com/

  8. 8.

    https://dl.acm.org/

  9. 9.

    https://dblp.org/

  10. 10.

    https://arxiv.org/corr

  11. 11.

    https://ghtorrent.org

  12. 12.

    https://github.com/ghtorrent/ghtorrent.org

References

  1. Aivaloglou E, Hermans F, Moreno-León J, Robles G (2017) A dataset of scratch programs: scraped, shaped and scored. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.45. IEEE Press, Piscataway, MSR ’17, pp 511–514

  2. Allix K, Bissyandé TF, Klein J, Le Traon Y (2016) Androzoo: collecting millions of android apps for the research community. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903508. ACM, New York, MSR ’16, pp 468–471

  3. Almakadmeh M, Abran A (2017) The ISBSG software project repository: an analysis from six sigma measurement perspective for software defect estimation. Journal of Software Engineering and Applications 10(8):693–720. https://doi.org/10.4236/jsea.2017.108038

    Article  Google Scholar 

  4. Altinger H, Siegl S, Dajsuren Y, Wotawa F (2015) A novel industry grade dataset for fault prediction based on model-driven developed automotive embedded software. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.72. IEEE Press, Piscataway, MSR ’15, pp 494–497

  5. Amann S, Nadi S, Nguyen HA, Nguyen TN, Mezini M (2016) MUBench: a benchmark for API-misuse detectors. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903506. ACM, New York, MSR ’16, pp 464–467

  6. Baldassari B, Preux P (2014) Understanding software evolution: the Maisqual Ant data set. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597136. ACM, New York, MSR ’14, pp 424–427

  7. Barik T, Lubick K, Smith J, Slankas J, Murphy-Hill E (2015) FUSE: a reproducible, extendable, internet-scale corpus of spreadsheets. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.70. IEEE Press, Piscataway, MSR ’15, pp 486–489

  8. Binkley D, Lawrie D, Pollock L, Hill E, Vijay-Shanker K (2013) A dataset for evaluating identifier splitters. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624055. IEEE Press, Piscataway, MSR ’13, pp 401–404

  9. Bloemen R, Amrit C, Kuhlmann S, Ordóñez Matamoros G (2014) Gentoo package dependencies over time. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597131. ACM, New York, MSR ’14, pp 404–407

  10. Boisvert RF (2016) Incentivizing reproducibility. Commun ACM 59 (10):5–5. https://doi.org/10.1145/2994031

    Article  Google Scholar 

  11. Bourque P, Fair RE (eds) (2014) Guide to the Software Engineering Body of Knowledge, version 3.0 edn. IEEE Computer Society, New York, http://www.swebok.org

  12. Bradford SC (1985) Sources of information on specific subjects 1934. Journal of Information Science 10(4):176–180. https://doi.org/10.1177/016555158501000407

    Article  Google Scholar 

  13. Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80(4):571–583. https://doi.org/10.1016/j.jss.2006.07.009

    Article  Google Scholar 

  14. Butler S, Wermelinger M, Yu Y, Sharp H (2013) INVocD: identifier name vocabulary dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624056. IEEE Press, Piscataway, MSR ’13, pp 405–408

  15. Chametzky B (2016) Coding in classic grounded theory: I’ve done an interview; now what? Sociology Mind 06:163–172. https://doi.org/10.4236/sm.2016.64014

    Article  Google Scholar 

  16. Chatzidimitriou KC, Papamichail MD, Diamantopoulos T, Tsapanos M, Symeonidis AL (2018) npm-miner: an infrastructure for measuring the quality of the npm registry. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196465. ACM, New York, MSR ’18, pp 42–45

  17. Cheikhi L, Abran A (2013) PROMISE and ISBSG software engineering data repositories: a survey. In: Proceedings of the joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement. https://doi.org/10.1109/IWSM-Mensura.2013.13. IEEE Press, Piscataway, IWSM-Mensura ’13, pp 17–24

  18. Conklin M, Howison J, Crowston K (2005) Collaboration using OSSmole: a repository of FLOSS data and analyses. In: Proceedings of the 2nd international workshop on mining software repositories. https://doi.org/10.1145/1082983.1083164. ACM, New York, MSR ’05, pp 1–5

  19. Cukic B (2005) Guest editor’s introduction: the promise of public software engineering data repositories. IEEE Software 22 (6):20–22. https://doi.org/10.1109/MS.2005.153

    Article  Google Scholar 

  20. Dit B, Holtzhauer A, Poshyvanyk D, Kagdi H (2013) A dataset from change history to support evaluation of software maintenance tasks. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, Piscataway, MSR ’13, pp 131–134. https://doi.org/10.1109/MSR.2013.6624019

  21. Efstathiou V, Chatzilenas C, Spinellis D (2018) Word embeddings for the software engineering domain. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196448. ACM, New York, MSR ’18, pp 38–41

  22. Farah G, Tejada JS, Correal D (2014) OpenHub: a scalable architecture for the analysis of software quality attributes. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597135. ACM, New York, MSR ’14, pp 420–423

  23. Ferenc R, Tóth Z, Ladányi G, Siket I, Gyimóthy T (2018) A public unified bug dataset for Java. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering. https://doi.org/10.1145/3273934.3273936. ACM, New York, PROMISE ’18, pp 12–21

  24. de Freitas FG, de Souza JT (2011) Ten years of search based software engineering: a bibliometric analysis. In: Cohen MB, Ó Cinnéide M (eds) Proceedings of the 3rd international symposium on search based software engineering. https://link.springer.com/chapter/10.1007/978-3-642-23716-4_5. Springer, Berlin, SSBSE ’11, pp 18–32

  25. Fujiwara K, Hata H, Makihara E, Fujihara Y, Nakayama N, Iida H, Matsumoto K (2014) Kataribe: a hosting service of historage repositories. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597125. ACM, New York, MSR ’14, pp 380–383

  26. Furnham A (1986) Response bias, social desirability and dissimulation. Personality and Individual Differences 7(3):385–400. https://doi.org/10.1016/0191-8869(86)90014-0

    Article  Google Scholar 

  27. Gao J, Yang X, Jiang Y, Liu H, Ying W, Zhang X (2018) JBench: a dataset of data races for concurrency testing. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196451. ACM, New York, MSR ’18, pp 6–9

  28. Geiger FX, Malavolta I, Pascarella L, Palomba F, Di Nucci D, Bacchelli A (2018) A graph-based dataset of commit history of real-world android apps. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196460. ACM, New York, MSR ’18, pp 30–33

  29. German DM, Adams B, Hassan AE (2015) A dataset of the activity of the Git super-repository of Linux in 2012. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.66. IEEE Press, Piscataway, MSR ’15, pp 470–473

  30. Gkortzis A, Mitropoulos D, Spinellis D (2018) VulinOSS: a dataset of security vulnerabilities in open-source systems. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196454. ACM, New York, MSR ’18, pp 18–21

  31. Glaser B, Strauss A (1967) The Discovery of Grounded Theory: Strategies for Qualitative Research Observations (Chicago Ill.), Aldine Publishing

  32. Glass RL (1994) An assessment of systems and software engineering scholars and institutions. Journal of Systems and Software 27(1):63–67. https://doi.org/10.1016/0164-1212(94)90115-5

    Article  Google Scholar 

  33. Goeminne M, Claes M, Mens T (2013) A historical dataset for the Gnome ecosystem. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624032. IEEE Press, Piscataway, MSR ’13 , pp 225–228

  34. Gonzalez-Barahona JM, Robles G, Izquierdo-Cortazar D (2015) The MetricsGrimoire database collection. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.68. IEEE Press, Piscataway, MSR ’15, pp 478–481

  35. Gousios G (2013) The GHTorrent dataset and tool suite. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624034. IEEE Press, Piscataway, MSR ’13, pp 233–236

  36. Gousios G, Zaidman A (2014) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597122. ACM, New York, MSR ’14, pp 368–371

  37. Gousios G, Vasilescu B, Serebrenik A, Zaidman A (2014) Lean GHTorrent: GitHub data on demand. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597126. ACM, New York, MSR ’14, pp 384–387

  38. Gousios G, Storey MA, Bacchelli A (2016) Work practices and challenges in pull-based development: the contributor’s perspective. In: Proceedings of the 38th international conference on software engineering. https://doi.org/10.1145/2884781.2884826. Association for Computing Machinery, New York, ICSE ’16, pp 285–296

  39. Gu Y (2004) Global knowledge management research: a bibliometric analysis. Scientometrics 61(2):171–190. https://doi.org/10.1023/B:SCIE.0000041647.01086.f4

    Article  Google Scholar 

  40. Habayeb M, Miranskyy A, Murtaza SS, Buchanan L, Bener AB (2015) The Firefox temporal defect dataset. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.73. IEEE Press, Piscataway, MSR ’15, pp 498–501

  41. Hamasaki K, Kula RG, Yoshida N, Cruz AEC, Fujiwara K, Iida H (2013) Who does what during a code review? Datasets of OSS peer review repositories. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624003. IEEE Press, Piscataway, MSR ’13, pp 49–52

  42. Hamming R (1969) One man’s view of computer science. Journal of the ACM 16(1):3–12. https://doi.org/10.1145/321495.321497

    Article  Google Scholar 

  43. Hardwicke TE, Ioannidis JP (2018) Mapping the universe of registered reports. Nature Human Behaviour 2(11):793–796

    Article  Google Scholar 

  44. Harman M, Mansouri SA, Zhang Y (2009) Search based software engineering: a comprehensive analysis and review of trends techniques and applications. Tech. Rep. TR-09-03, Department of Computer Science, King’s College London, and Brunel Business School, Brunel University, London, UK, https://www.researchgate.net/profile/Yuanyuan_Zhang12/publication/228671024_Search_Based_Software_Engineering_A_Comprehensive_Analysis_and_Review_of_Trends_Techniques_and_Applications/links/00b4951811ba6a40eb000000.pdf

  45. Janjic W, Hummel O, Schumacher M, Atkinson C (2013) An unabridged source code dataset for research in software reuse. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624047. IEEE Press, Piscataway, MSR ’13, pp 339–342

  46. Karakoidas V, Mitropoulos D, Louridas P, Gousios G, Spinellis D (2015) Generating the blueprints of the Java ecosystem. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.76. IEEE Press, Piscataway, MSR ’15, pp 510–513

  47. Keivanloo I, Forbes C, Hmood A, Erfani M, Neal C, Peristerakis G, Rilling J (2012) A linked data platform for mining software repositories. In: Proceedings of the 9th working conference on mining software repositories. https://doi.org/10.1109/MSR.2012.6224296. IEEE Press, Piscataway, MSR ’12, pp 32–35

  48. Kim S, Zimmermann T, Kim M, Hassan A, Mockus A, Girba T, Pinzger M, Whitehead EJ Jr, Zeller A (2006) TA-RE: an exchange language for mining software repositories. In: Proceedings of the 3rd international workshop on mining software repositories. https://doi.org/10.1145/1137983.1137990. ACM, New York, MSR ’06, pp 22–25

  49. Kitchenham B (2004) Procedures for performing systematic reviews. Tech. Rep. TR/SE-0401, Department of Computer Science, Keele University, Keele, Staffs, UK, http://www.it.hiof.no/!haraldh/misc/2016-08-22-smat/Kitchenham-Systematic-Review-2004.pdf

  50. Kitchenham B, Pfleeger SL (2003) Principles of survey research: Part 6: data analysis. SIGSOFT Softw Eng Notes 28(2):24–27. https://doi.org/10.1145/638750.638758

    Article  Google Scholar 

  51. Kitchenham BA, Pfleeger SL (2002a) Principles of survey research: Part 2: designing a survey. SIGSOFT Softw Eng Notes 27(1):18–20. https://doi.org/10.1145/566493.566495

  52. Kitchenham BA, Pfleeger SL (2002b) Principles of survey research: Part 3: constructing a survey instrument. https://doi.org/10.1145/511152.511155, vol 27, pp 20–24

  53. Kitchenham BA, Pfleeger SL (2002c) Principles of survey research: Part 4: questionnaire evaluation. SIGSOFT Softw Eng Notes 27(3):20–23. https://doi.org/10.1145/638574.638580

  54. Kitchenham BA, Pfleeger SL (2002d) Principles of survey research: Part 5: populations and samples. SIGSOFT Softw Eng Notes 27(5):17–20. https://doi.org/10.1145/571681.571686

  55. Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam KE, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28(8):721–734. https://doi.org/10.1109/TSE.2002.1027796

    Article  Google Scholar 

  56. Kotti Z, Spinellis D (2019) Standing on shoulders or feet?: The usage of the MSR data papers. In: Proceedings of the 16th international conference on mining software repositories. https://doi.org/10.1109/MSR.2019.00085. IEEE Press, Piscataway, MSR ’19, pp 565–576

  57. von Krogh G, von Hippel E (2006) The promise of research on open source software. Management Science 52(7):975–983. https://doi.org/10.1287/mnsc.1060.0560

    Article  Google Scholar 

  58. Krüger S, Späth J, Ali K, Bodden E, Mezini M (2018) CrySL: an extensible approach to validating the correct usage of cryptographic APIs. In: Millstein T (ed) Proceedings of the 32nd European conference on object-oriented programming. https://doi.org/10.4230/LIPIcs.ECOOP.2018.10, vol 109. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, ECOOP ’18, pp 10:1–10:27

  59. Krüger S, Späth J, Ali K, Bodden E, Mezini M (2019) CrySL: an extensible approach to validating the correct usage of cryptographic APIs. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2019.2948910

  60. Krutz DE, Le W (2014) A code clone oracle. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597127. ACM, New York, MSR ’14, pp 388–391

  61. Krutz DE, Mirakhorli M, Malachowsky SA, Ruiz A, Peterson J, Filipski A, Smith J (2015) A dataset of open-source android applications. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.79. IEEE Press, Piscataway, MSR ’15, pp 522–525

  62. Kupferschmidt K (2018) More and more scientists are preregistering their studies. should you? Science. https://doi.org/10.1126/science.aav4786

  63. Lamkanfi A, Pérez J, Demeyer S (2013) The Eclipse and Mozilla defect tracking dataset: a genuine dataset for mining bug information. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624028. IEEE Press, Piscataway, MSR ’13, pp 203–206

  64. Lavazza L, Santillo L (2012) Historical data repositories in software engineering: status and possible improvements. In: Proceedings of the 2012 joint conference of the 22nd international workshop on software measurement and the 2012 seventh international conference on software process and product measurement. https://doi.org/10.1109/IWSM-MENSURA.2012.39, pp 221–225

  65. Lazar A, Ritchey S, Sharif B (2014) Generating duplicate bug datasets. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597128. ACM, New York, MSR ’14, pp 392–395

  66. Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. In: Proceedings of the 4th international workshop on predictor models in software engineering. https://doi.org/10.1145/1370788.1370799. ACM, New York, PROMISE ’08, pp 39–44

  67. Lotka AJ (1926) The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences 16(12):317–323. http://www.jstor.org/stable/24529203

    Google Scholar 

  68. MacLean AC, Knutson CD (2013) Apache commits: social network dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624020. IEEE Press, Piscataway, MSR ’13, pp 135–138

  69. Madeyski L, Kawalerowicz M (2017) Continuous defect prediction: the idea and a related dataset. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.46. IEEE Press, Piscataway, MSR ’17, pp 515–518

  70. Markovtsev V, Long W (2018) Public Git archive: a big code dataset for all. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196464. ACM, New York, MSR ’18, pp 34–37

  71. Martins P, Achar R, Lopes CV (2018) 50K-C: a dataset of compilable, and compiled, Java projects. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196450. ACM, New York, MSR ’18, pp 1–5

  72. Mauczka A, Brosch F, Schanes C, Grechenig T (2015) Dataset of developer-labeled commit messages. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.71. IEEE Press, Piscataway, MSR ’15, pp 490–493

  73. Merton RK (1968) The Matthew effect in science. Science 159 (3810):56–63

    Article  Google Scholar 

  74. Mierle K, Laven K, Roweis S, Wilson G (2005) Mining student CVS repositories for performance indicators. In: Proceedings of the 2nd international workshop on mining software repositories. https://doi.org/10.1145/1082983.1083150. ACM, New York, MSR ’05, pp 1–5

  75. Mitropoulos D, Karakoidas V, Louridas P, Gousios G, Spinellis D (2014) The bug catalog of the Maven ecosystem. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597123. ACM, New York, MSR ’14, pp 372–375

  76. Mukadam M, Bird C, Rigby PC (2013) Gerrit software code review data from android. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624002. IEEE Press, Piscataway, MSR ’13, pp 45–48

  77. Murakami H, Higo Y, Kusumoto S (2014) A dataset of clone references with gaps. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597133. ACM, New York, MSR ’14, pp 412–415

  78. Noten J, Mengerink JGM, Serebrenik A (2017) A data set of OCL expressions on GitHub. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.52. IEEE Press, Piscataway, MSR ’17, pp 531–534

  79. Novielli N, Calefato F, Lanubile F (2018) A gold standard for emotion annotation in Stack Overflow. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196453. ACM, New York, MSR ’18, pp 14–17

  80. Nussbaum L, Zacchiroli S (2010) The ultimate debian database: consolidating bazaar metadata for quality assurance and data mining. In: Proceedings of the 7th working conference on mining software repositories. https://doi.org/10.1109/MSR.2010.5463277. IEEE Press, Piscataway, MSR ’10, p 10

  81. Ohira M, Kashiwa Y, Yamatani Y, Yoshiyuki H, Maeda Y, Limsettho N, Fujino K, Hata H, Ihara A, Matsumoto K (2015) A dataset of high impact bugs: manually-classified issue reports. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.78. IEEE Press, Piscataway, MSR ’15, pp 518–521

  82. Ortu M, Murgia A, Destefanis G, Tourani P, Tonelli R, Marchesi M, Adams B (2016) The emotional side of software developers in JIRA. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903505. ACM, New York, MSR ’16, pp 480–483

  83. Paixao M, Krinke J, Han D, Harman M (2018) CROP: linking code reviews to source code changes. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196466. ACM, New York, MSR ’18, pp 46–49

  84. Palomba F, Nucci DD, Tufano M, Bavota G, Oliveto R, Poshyvanyk D, De Lucia A (2015) Landfill: an open dataset of code smells with public evaluation. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.69. IEEE Press, Piscataway, MSR ’15, pp 482–485

  85. Passos L, Czarnecki K (2014) A dataset of feature additions and feature removals from the Linux kernel. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597124. ACM, New York, MSR ’14, pp 376–379

  86. Petersen K, Feldt R, Mujtaba S, Mattsson M (2008) Systematic mapping studies in software engineering. In: Proceedings of the 12th international conference on evaluation and assessment in software engineering. http://dl.acm.org/citation.cfm?id=2227115.2227123. BCS Learning & Development Ltd., Swindon, EASE ’08, pp 68–77

  87. Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: An update. Information and Software Technology 64:1–18. https://doi.org/10.1016/j.infsof.2015.03.007

    Article  Google Scholar 

  88. Pfleeger SL, Kitchenham BA (2001) Principles of survey research: Part 1: turning lemons into lemonade. SIGSOFT Softw Eng Notes 26(6):16–18. https://doi.org/10.1145/505532.505535

    Article  Google Scholar 

  89. Piezunka H, Dahlander L (2015) Distant search, narrow attention: how crowding alters organizations’ filtering of suggestions in crowdsourcing. Academy of Management Journal 58(3):856–880. https://doi.org/10.5465/amj.2012.0458

    Article  Google Scholar 

  90. Ponzanelli L, Mocci A, Lanza M (2015) StORMeD: stack overflow ready made data. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.67. IEEE Press, Piscataway, MSR ’15 , pp 474–477

  91. Proksch S, Amann S, Nadi S, Mezini M (2016) A dataset of simplified syntax trees for C#. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903507. ACM, New York, MSR ’16 , pp 476–479

  92. Raemaekers S, Van Deursen A, Visser J (2013) The Maven repository dataset of metrics, changes, and dependencies. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624031. IEEE Press, Piscataway, MSR ’13, pp 221–224

  93. Robles G (2010) Replicating MSR: a study of the potential replicability of papers published in the mining software repositories proceedings. In: Proceedings of the 7th working conference on mining software repositories. https://doi.org/10.1109/MSR.2010.5463348. IEEE Press, Piscataway, MSR ’10, pp 171–180

  94. Robles G, Arjona Reina L, Serebrenik A, Vasilescu B, González-Barahona JM (2014) FLOSS 2013: a survey dataset about free software contributors: Challenges for curating, sharing, and combining. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597129. ACM, New York, MSR ’14, pp 396–399

  95. Robles G, Ho-Quang T, Hebig R, Chaudron MRV, Fernandez MA (2017) An extensive dataset of UML models in GitHub. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.48. IEEE Press, Piscataway, MSR ’17, pp 519–522

  96. Sadat M, Bener AB, Miranskyy AV (2017) Rediscovery datasets: connecting duplicate reports. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.50. IEEE Press, Piscataway, MSR ’17, pp 527–530

  97. Saha RK, Lyu Y, Lam W, Yoshida H, Prasad MR (2018) Bugs.jar: A large-scale, diverse dataset of real-world Java bugs. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196473. ACM, New York, MSR ’18, pp 10–13

  98. Saini V, Sajnani H, Ossher J, Lopes CV (2014) A dataset for Maven artifacts and bug patterns found in them. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597134. ACM, New York, MSR ’14, pp 416–419

  99. Salinger S, Plonka L, Prechelt L (2008) A coding scheme development methodology using Grounded Theory for qualitative analysis of pair programming. Human Technology: An Interdisciplinary Journal on Humans in ICT Environments 4. https://doi.org/10.17011/ht/urn.200804151350

  100. Sawant AA, Bacchelli A (2015) A dataset for API usage. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.75. IEEE Press, Piscataway, MSR ’15, pp 506–509

  101. Sayyad Shirabad J, Menzies T (2005) The PROMISE repository of software engineering databases, School of Information Technology and Engineering, University of Ottawa, Canada, http://promise.site.uottawa.ca/SERepository

  102. Schermann G, Zumberi S, Cito J (2018) Structured information on state and evolution of Dockerfiles on GitHub. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196456. ACM, New York, MSR ’18, pp 26–29

  103. Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in Empirical Software Engineering. Empirical Software Engineering 13 (2):211–218. https://doi.org/10.1007/s10664-008-9060-1

    Article  Google Scholar 

  104. Sigelaman L (1981) Question-order effects on presidential popularity. Public Opinion Quarterly 45(2):199–207. https://academic.oup.com/poq/article-pdf/45/2/199/5432386/45-2-199.pdf

    Article  Google Scholar 

  105. Spacco J, Strecker J, Hovemeyer D, Pugh W (2005) Software repository mining with Marmoset: an automated programming project snapshot and testing system. In: Proceedings of the 2nd international workshop on mining software repositories. https://doi.org/10.1145/1082983.1083149. ACM, New York, MSR ’05, pp 1–5

  106. Spinellis D (2015) A repository with 44 years of Unix evolution. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.64. IEEE Press, Piscataway, MSR ’15, pp 462–465

  107. Spinellis D (2018) Documented Unix facilities over 48 years. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196476. ACM, New York, MSR ’18, pp 58–61

  108. Squire M (2013a) Apache-affiliated twitter screen names: a dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624043. IEEE Press, Piscataway, MSR ’13, pp 305–308

  109. Squire M (2013b) Project roles in the Apache software foundation: a dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624042. IEEE Press, Piscataway, MSR ’13, pp 301–304

  110. Squire M (2016) Data sets: the circle of life in Ruby hosting, 2003-2015. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903509. ACM, New York, MSR ’16, pp 452–459

  111. Squire M (2018) Data sets describing the circle of life in Ruby hosting, 2003–2016. Empirical Software Engineering 23(2):1123–1152. https://doi.org/10.1007/s10664-017-9581-6

    Article  Google Scholar 

  112. Trockman A, Zhou S, Kästner C, Vasilescu B (2018) Adding sparkle to social coding: an empirical study of repository badges in the npm ecosystem. In: Proceedings of the 40th international conference on software engineering. https://doi.org/10.1145/3180155.3180209. ACM, New York, ICSE ’18, pp 511–522

  113. Vasilescu B, Serebrenik A, Mens T (2013) A historical dataset of software engineering conferences. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624051. IEEE Press, Piscataway, MSR ’13, pp 373–376

  114. Vasilescu B, Serebrenik A, Filkov V (2015) A data set for social diversity studies of GitHub teams. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.77. IEEE Press, Piscataway, MSR ’15, pp 514–517

  115. Wagstrom P, Jergensen C, Sarma A (2013) A network of rails: a graph dataset of Ruby on rails and associated projects. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624033. IEEE Press, Piscataway, MSR ’13, pp 229–232

  116. Wallace D (1998) Enhancing competitiveness via a public fault and failure data repository. In: Proceedings of the third ieee international high-assurance systems engineering symposium. https://doi.org/10.1109/HASE.1998.731610, pp 178–185

  117. Webster J, Watson RT (2002) Analyzing the past to prepare for the future: writing a literature review. MIS Quarterly 26(2):xiii–xxiii

    Google Scholar 

  118. Wermelinger M, Yu Y (2015) An architectural evolution dataset. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.74. IEEE Press, Piscataway, MSR ’15, pp 502–505

  119. Williams JR, Di Ruscio D, Matragkas N, Di Rocco J, Kolovos DS (2014) Models of OSS project meta-information: a dataset of three forges. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597132. ACM, New York, MSR ’14, pp 408–411

  120. Wong WE, Tse T, Glass RL, Basili VR, Chen TY (2011) An assessment of systems and software engineering scholars and institutions (2003–2007 and 2004–2008). Journal of Systems and Software 84(1):162–168. https://doi.org/10.1016/j.jss.2010.09.036

    Article  Google Scholar 

  121. Xu Y, Zhou M (2018) A multi-level dataset of Linux kernel patchwork. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196475. ACM, New York, MSR ’18, pp 54–57

  122. Yamashita A, Abtahizadeh SA, Khomh F, Guéhéneuc YG (2017) Software evolution and quality data from controlled, multiple, industrial case studies. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.44. IEEE Press, Piscataway, MSR ’17, pp 507–510

  123. Yamashita A, Petrillo F, Khomh F, Guéhéneuc YG (2018) Developer interaction traces backed by IDE screen recordings from Think aloud sessions. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196457. ACM, New York, MSR ’18, pp 50–53

  124. Yang X, Kula RG, Yoshida N, Iida H (2016) Mining the modern code review repositories: a dataset of people, process and product. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903504. ACM, New York, MSR ’16, pp 460–463

  125. Yu Y, Li Z, Yin G, Wang T, Wang H (2018) A dataset of duplicate pull-requests in GitHub. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196455. ACM, New York, MSR ’18, pp 22–25

  126. Zacchiroli S (2015) The Debsources dataset: two decades of Debian source code metadata. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.65. IEEE Press, Piscataway, MSR ’15, pp 466–469

  127. Zhang C, Hindle A (2014) A green miner’s dataset: mining the impact of software change on energy consumption. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597130. ACM, New York, MSR ’14, pp 400–403

  128. Zhu C, Li Y, Rubin J, Chechik M (2017) A dataset for dynamic discovery of semantic changes in version controlled software histories. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.49. IEEE Press, Piscataway, MSR ’17, pp 523–526

  129. Zhu J, Zhou M, Mei H (2016) Multi-extract and multi-level dataset of Mozilla issue tracking history. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903502. ACM, New York, MSR ’16, pp 472–475

  130. Zimmermann T, Di Penta M, Kim S (eds) (2013a) Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, IEEE Computer Society, http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6597024

  131. Zimmermann T, Di Penta M, Kim S, German DM, Bacchelli A (2013b) Welcome from the chairs. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. https://doi.org/10.1109/MSR.2013.6623995 , pp iii–viii

  132. Zogaan W, Sharma P, Mirahkorli M, Arnaoudova V (2017) Datasets from fifteen years of automated requirements traceability research: current state, characteristics, and quality. In: Proceedings of the 25th international requirements engineering conference, IEEE, RE ’17. https://doi.org/10.1109/RE.2017.80, pp 110–121

Download references

Acknowledgements

Panos Louridas provided insightful comments on this manuscript. Furthermore, Georgios Gousios’s suggestions regarding the refinement of the questionnaire were crucial for the survey attainment. This work has received funding from: the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825328; the gsrt 2016–2017 Research Support (EP-2844-01); and the Research Centre of the Athens University of Economics and Business, under the Original Scientific Publications framework 2019.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Zoe Kotti.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: Yasutaka Kamei and Andy Zaidman

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kotti, Z., Kravvaritis, K., Dritsa, K. et al. Standing on shoulders or feet? An extended study on the usage of the MSR data papers. Empir Software Eng (2020). https://doi.org/10.1007/s10664-020-09834-7

Download citation

Keywords

  • Software engineering data
  • Bibliometrics
  • Survey study
  • Mining software repositories
  • Data paper
  • Reproducibility