Skip to main content
Log in

Standing on shoulders or feet? An extended study on the usage of the MSR data papers

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

The establishment of the Mining Software Repositories (MSR) data showcase conference track has encouraged researchers to provide data sets as a basis for further empirical studies. The objective of this study is to examine the usage of data papers published in the MSR proceedings in terms of use frequency, users, and use purpose. Data track papers were collected from the MSR data showcase track and through the manual inspection of older MSR proceedings. The use of data papers was established through manual citation searching followed by reading the citing studies and dividing them into strong and weak citations. Contrary to weak, strong citations truly use the data set of a data paper. Data papers were then manually clustered based on their content, whereas their strong citations were classified by hand according to the knowledge areas of the Guide to the Software Engineering Body of Knowledge. A survey study on 108 authors and users of data papers provided further insights regarding motivation and effort in data paper production, encouraging and discouraging factors in data set use, and future desired direction regarding data papers. We found that 65% of the data papers have been used in other studies, with a long-tail distribution in the number of strong citations. Weak citations to data papers usually refer to them as an example. MSR data papers are cited in total less than other MSR papers. A considerable number of the strong citations stem from the teams that authored the data papers. Publications providing Version Control System (VCS) primary and derived data are the most frequent data papers and the most often strongly cited ones. Enhanced developer data papers are the least common ones, and the second least frequently strongly cited. Data paper authors tend to gather data in the context of other research. Users of data sets appreciate high data quality and are discouraged by lack of replicability of data set construction. Data related to machine learning or derived from the manufacturing sector are two suggestions of the respondents for future data papers. Overall, data papers have provided the foundation for a significant number of studies, but there is room for improvement in their utilization. This can be done by setting a higher bar for their publication, by encouraging their use, by promoting open science initiatives, and by providing incentives for the enrichment of existing data collections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. 1968 ACM Turing Award Lecture (Hamming 1969)

  2. https://github.com/acmsigsoft/artifact-evaluation

  3. https://github.com/dspinellis/awesome-msr

  4. https://doi.org/10.5281/zenodo.3709219

  5. https://www.webofknowledge.com

  6. https://scholar.google.com/

  7. https://www.scopus.com/

  8. https://dl.acm.org/

  9. https://dblp.org/

  10. https://arxiv.org/corr

  11. https://ghtorrent.org

  12. https://github.com/ghtorrent/ghtorrent.org

References

  • Aivaloglou E, Hermans F, Moreno-León J, Robles G (2017) A dataset of scratch programs: scraped, shaped and scored. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.45. IEEE Press, Piscataway, MSR ’17, pp 511–514

  • Allix K, Bissyandé TF, Klein J, Le Traon Y (2016) Androzoo: collecting millions of android apps for the research community. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903508. ACM, New York, MSR ’16, pp 468–471

  • Almakadmeh M, Abran A (2017) The ISBSG software project repository: an analysis from six sigma measurement perspective for software defect estimation. Journal of Software Engineering and Applications 10(8):693–720. https://doi.org/10.4236/jsea.2017.108038

    Article  Google Scholar 

  • Altinger H, Siegl S, Dajsuren Y, Wotawa F (2015) A novel industry grade dataset for fault prediction based on model-driven developed automotive embedded software. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.72. IEEE Press, Piscataway, MSR ’15, pp 494–497

  • Amann S, Nadi S, Nguyen HA, Nguyen TN, Mezini M (2016) MUBench: a benchmark for API-misuse detectors. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903506. ACM, New York, MSR ’16, pp 464–467

  • Baldassari B, Preux P (2014) Understanding software evolution: the Maisqual Ant data set. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597136. ACM, New York, MSR ’14, pp 424–427

  • Barik T, Lubick K, Smith J, Slankas J, Murphy-Hill E (2015) FUSE: a reproducible, extendable, internet-scale corpus of spreadsheets. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.70. IEEE Press, Piscataway, MSR ’15, pp 486–489

  • Binkley D, Lawrie D, Pollock L, Hill E, Vijay-Shanker K (2013) A dataset for evaluating identifier splitters. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624055. IEEE Press, Piscataway, MSR ’13, pp 401–404

  • Bloemen R, Amrit C, Kuhlmann S, Ordóñez Matamoros G (2014) Gentoo package dependencies over time. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597131. ACM, New York, MSR ’14, pp 404–407

  • Boisvert RF (2016) Incentivizing reproducibility. Commun ACM 59 (10):5–5. https://doi.org/10.1145/2994031

    Article  Google Scholar 

  • Bourque P, Fair RE (eds) (2014) Guide to the Software Engineering Body of Knowledge, version 3.0 edn. IEEE Computer Society, New York, http://www.swebok.org

  • Bradford SC (1985) Sources of information on specific subjects 1934. Journal of Information Science 10(4):176–180. https://doi.org/10.1177/016555158501000407

    Article  Google Scholar 

  • Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80(4):571–583. https://doi.org/10.1016/j.jss.2006.07.009

    Article  Google Scholar 

  • Butler S, Wermelinger M, Yu Y, Sharp H (2013) INVocD: identifier name vocabulary dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624056. IEEE Press, Piscataway, MSR ’13, pp 405–408

  • Chametzky B (2016) Coding in classic grounded theory: I’ve done an interview; now what? Sociology Mind 06:163–172. https://doi.org/10.4236/sm.2016.64014

    Article  Google Scholar 

  • Chatzidimitriou KC, Papamichail MD, Diamantopoulos T, Tsapanos M, Symeonidis AL (2018) npm-miner: an infrastructure for measuring the quality of the npm registry. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196465. ACM, New York, MSR ’18, pp 42–45

  • Cheikhi L, Abran A (2013) PROMISE and ISBSG software engineering data repositories: a survey. In: Proceedings of the joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement. https://doi.org/10.1109/IWSM-Mensura.2013.13. IEEE Press, Piscataway, IWSM-Mensura ’13, pp 17–24

  • Conklin M, Howison J, Crowston K (2005) Collaboration using OSSmole: a repository of FLOSS data and analyses. In: Proceedings of the 2nd international workshop on mining software repositories. https://doi.org/10.1145/1082983.1083164. ACM, New York, MSR ’05, pp 1–5

  • Cukic B (2005) Guest editor’s introduction: the promise of public software engineering data repositories. IEEE Software 22 (6):20–22. https://doi.org/10.1109/MS.2005.153

    Article  Google Scholar 

  • Dit B, Holtzhauer A, Poshyvanyk D, Kagdi H (2013) A dataset from change history to support evaluation of software maintenance tasks. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, Piscataway, MSR ’13, pp 131–134. https://doi.org/10.1109/MSR.2013.6624019

  • Efstathiou V, Chatzilenas C, Spinellis D (2018) Word embeddings for the software engineering domain. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196448. ACM, New York, MSR ’18, pp 38–41

  • Farah G, Tejada JS, Correal D (2014) OpenHub: a scalable architecture for the analysis of software quality attributes. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597135. ACM, New York, MSR ’14, pp 420–423

  • Ferenc R, Tóth Z, Ladányi G, Siket I, Gyimóthy T (2018) A public unified bug dataset for Java. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering. https://doi.org/10.1145/3273934.3273936. ACM, New York, PROMISE ’18, pp 12–21

  • de Freitas FG, de Souza JT (2011) Ten years of search based software engineering: a bibliometric analysis. In: Cohen MB, Ó Cinnéide M (eds) Proceedings of the 3rd international symposium on search based software engineering. https://link.springer.com/chapter/10.1007/978-3-642-23716-4_5. Springer, Berlin, SSBSE ’11, pp 18–32

  • Fujiwara K, Hata H, Makihara E, Fujihara Y, Nakayama N, Iida H, Matsumoto K (2014) Kataribe: a hosting service of historage repositories. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597125. ACM, New York, MSR ’14, pp 380–383

  • Furnham A (1986) Response bias, social desirability and dissimulation. Personality and Individual Differences 7(3):385–400. https://doi.org/10.1016/0191-8869(86)90014-0

    Article  Google Scholar 

  • Gao J, Yang X, Jiang Y, Liu H, Ying W, Zhang X (2018) JBench: a dataset of data races for concurrency testing. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196451. ACM, New York, MSR ’18, pp 6–9

  • Geiger FX, Malavolta I, Pascarella L, Palomba F, Di Nucci D, Bacchelli A (2018) A graph-based dataset of commit history of real-world android apps. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196460. ACM, New York, MSR ’18, pp 30–33

  • German DM, Adams B, Hassan AE (2015) A dataset of the activity of the Git super-repository of Linux in 2012. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.66. IEEE Press, Piscataway, MSR ’15, pp 470–473

  • Gkortzis A, Mitropoulos D, Spinellis D (2018) VulinOSS: a dataset of security vulnerabilities in open-source systems. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196454. ACM, New York, MSR ’18, pp 18–21

  • Glaser B, Strauss A (1967) The Discovery of Grounded Theory: Strategies for Qualitative Research Observations (Chicago Ill.), Aldine Publishing

  • Glass RL (1994) An assessment of systems and software engineering scholars and institutions. Journal of Systems and Software 27(1):63–67. https://doi.org/10.1016/0164-1212(94)90115-5

    Article  Google Scholar 

  • Goeminne M, Claes M, Mens T (2013) A historical dataset for the Gnome ecosystem. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624032. IEEE Press, Piscataway, MSR ’13 , pp 225–228

  • Gonzalez-Barahona JM, Robles G, Izquierdo-Cortazar D (2015) The MetricsGrimoire database collection. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.68. IEEE Press, Piscataway, MSR ’15, pp 478–481

  • Gousios G (2013) The GHTorrent dataset and tool suite. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624034. IEEE Press, Piscataway, MSR ’13, pp 233–236

  • Gousios G, Zaidman A (2014) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597122. ACM, New York, MSR ’14, pp 368–371

  • Gousios G, Vasilescu B, Serebrenik A, Zaidman A (2014) Lean GHTorrent: GitHub data on demand. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597126. ACM, New York, MSR ’14, pp 384–387

  • Gousios G, Storey MA, Bacchelli A (2016) Work practices and challenges in pull-based development: the contributor’s perspective. In: Proceedings of the 38th international conference on software engineering. https://doi.org/10.1145/2884781.2884826. Association for Computing Machinery, New York, ICSE ’16, pp 285–296

  • Gu Y (2004) Global knowledge management research: a bibliometric analysis. Scientometrics 61(2):171–190. https://doi.org/10.1023/B:SCIE.0000041647.01086.f4

    Article  Google Scholar 

  • Habayeb M, Miranskyy A, Murtaza SS, Buchanan L, Bener AB (2015) The Firefox temporal defect dataset. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.73. IEEE Press, Piscataway, MSR ’15, pp 498–501

  • Hamasaki K, Kula RG, Yoshida N, Cruz AEC, Fujiwara K, Iida H (2013) Who does what during a code review? Datasets of OSS peer review repositories. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624003. IEEE Press, Piscataway, MSR ’13, pp 49–52

  • Hamming R (1969) One man’s view of computer science. Journal of the ACM 16(1):3–12. https://doi.org/10.1145/321495.321497

    Article  Google Scholar 

  • Hardwicke TE, Ioannidis JP (2018) Mapping the universe of registered reports. Nature Human Behaviour 2(11):793–796

    Article  Google Scholar 

  • Harman M, Mansouri SA, Zhang Y (2009) Search based software engineering: a comprehensive analysis and review of trends techniques and applications. Tech. Rep. TR-09-03, Department of Computer Science, King’s College London, and Brunel Business School, Brunel University, London, UK, https://www.researchgate.net/profile/Yuanyuan_Zhang12/publication/228671024_Search_Based_Software_Engineering_A_Comprehensive_Analysis_and_Review_of_Trends_Techniques_and_Applications/links/00b4951811ba6a40eb000000.pdf

  • Janjic W, Hummel O, Schumacher M, Atkinson C (2013) An unabridged source code dataset for research in software reuse. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624047. IEEE Press, Piscataway, MSR ’13, pp 339–342

  • Karakoidas V, Mitropoulos D, Louridas P, Gousios G, Spinellis D (2015) Generating the blueprints of the Java ecosystem. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.76. IEEE Press, Piscataway, MSR ’15, pp 510–513

  • Keivanloo I, Forbes C, Hmood A, Erfani M, Neal C, Peristerakis G, Rilling J (2012) A linked data platform for mining software repositories. In: Proceedings of the 9th working conference on mining software repositories. https://doi.org/10.1109/MSR.2012.6224296. IEEE Press, Piscataway, MSR ’12, pp 32–35

  • Kim S, Zimmermann T, Kim M, Hassan A, Mockus A, Girba T, Pinzger M, Whitehead EJ Jr, Zeller A (2006) TA-RE: an exchange language for mining software repositories. In: Proceedings of the 3rd international workshop on mining software repositories. https://doi.org/10.1145/1137983.1137990. ACM, New York, MSR ’06, pp 22–25

  • Kitchenham B (2004) Procedures for performing systematic reviews. Tech. Rep. TR/SE-0401, Department of Computer Science, Keele University, Keele, Staffs, UK, http://www.it.hiof.no/!haraldh/misc/2016-08-22-smat/Kitchenham-Systematic-Review-2004.pdf

  • Kitchenham B, Pfleeger SL (2003) Principles of survey research: Part 6: data analysis. SIGSOFT Softw Eng Notes 28(2):24–27. https://doi.org/10.1145/638750.638758

    Article  Google Scholar 

  • Kitchenham BA, Pfleeger SL (2002a) Principles of survey research: Part 2: designing a survey. SIGSOFT Softw Eng Notes 27(1):18–20. https://doi.org/10.1145/566493.566495

  • Kitchenham BA, Pfleeger SL (2002b) Principles of survey research: Part 3: constructing a survey instrument. https://doi.org/10.1145/511152.511155, vol 27, pp 20–24

  • Kitchenham BA, Pfleeger SL (2002c) Principles of survey research: Part 4: questionnaire evaluation. SIGSOFT Softw Eng Notes 27(3):20–23. https://doi.org/10.1145/638574.638580

  • Kitchenham BA, Pfleeger SL (2002d) Principles of survey research: Part 5: populations and samples. SIGSOFT Softw Eng Notes 27(5):17–20. https://doi.org/10.1145/571681.571686

  • Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam KE, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28(8):721–734. https://doi.org/10.1109/TSE.2002.1027796

    Article  Google Scholar 

  • Kotti Z, Spinellis D (2019) Standing on shoulders or feet?: The usage of the MSR data papers. In: Proceedings of the 16th international conference on mining software repositories. https://doi.org/10.1109/MSR.2019.00085. IEEE Press, Piscataway, MSR ’19, pp 565–576

  • von Krogh G, von Hippel E (2006) The promise of research on open source software. Management Science 52(7):975–983. https://doi.org/10.1287/mnsc.1060.0560

    Article  Google Scholar 

  • Krüger S, Späth J, Ali K, Bodden E, Mezini M (2018) CrySL: an extensible approach to validating the correct usage of cryptographic APIs. In: Millstein T (ed) Proceedings of the 32nd European conference on object-oriented programming. https://doi.org/10.4230/LIPIcs.ECOOP.2018.10, vol 109. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, ECOOP ’18, pp 10:1–10:27

  • Krüger S, Späth J, Ali K, Bodden E, Mezini M (2019) CrySL: an extensible approach to validating the correct usage of cryptographic APIs. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2019.2948910

  • Krutz DE, Le W (2014) A code clone oracle. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597127. ACM, New York, MSR ’14, pp 388–391

  • Krutz DE, Mirakhorli M, Malachowsky SA, Ruiz A, Peterson J, Filipski A, Smith J (2015) A dataset of open-source android applications. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.79. IEEE Press, Piscataway, MSR ’15, pp 522–525

  • Kupferschmidt K (2018) More and more scientists are preregistering their studies. should you? Science. https://doi.org/10.1126/science.aav4786

  • Lamkanfi A, Pérez J, Demeyer S (2013) The Eclipse and Mozilla defect tracking dataset: a genuine dataset for mining bug information. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624028. IEEE Press, Piscataway, MSR ’13, pp 203–206

  • Lavazza L, Santillo L (2012) Historical data repositories in software engineering: status and possible improvements. In: Proceedings of the 2012 joint conference of the 22nd international workshop on software measurement and the 2012 seventh international conference on software process and product measurement. https://doi.org/10.1109/IWSM-MENSURA.2012.39, pp 221–225

  • Lazar A, Ritchey S, Sharif B (2014) Generating duplicate bug datasets. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597128. ACM, New York, MSR ’14, pp 392–395

  • Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. In: Proceedings of the 4th international workshop on predictor models in software engineering. https://doi.org/10.1145/1370788.1370799. ACM, New York, PROMISE ’08, pp 39–44

  • Lotka AJ (1926) The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences 16(12):317–323. http://www.jstor.org/stable/24529203

    Google Scholar 

  • MacLean AC, Knutson CD (2013) Apache commits: social network dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624020. IEEE Press, Piscataway, MSR ’13, pp 135–138

  • Madeyski L, Kawalerowicz M (2017) Continuous defect prediction: the idea and a related dataset. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.46. IEEE Press, Piscataway, MSR ’17, pp 515–518

  • Markovtsev V, Long W (2018) Public Git archive: a big code dataset for all. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196464. ACM, New York, MSR ’18, pp 34–37

  • Martins P, Achar R, Lopes CV (2018) 50K-C: a dataset of compilable, and compiled, Java projects. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196450. ACM, New York, MSR ’18, pp 1–5

  • Mauczka A, Brosch F, Schanes C, Grechenig T (2015) Dataset of developer-labeled commit messages. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.71. IEEE Press, Piscataway, MSR ’15, pp 490–493

  • Merton RK (1968) The Matthew effect in science. Science 159 (3810):56–63

    Article  Google Scholar 

  • Mierle K, Laven K, Roweis S, Wilson G (2005) Mining student CVS repositories for performance indicators. In: Proceedings of the 2nd international workshop on mining software repositories. https://doi.org/10.1145/1082983.1083150. ACM, New York, MSR ’05, pp 1–5

  • Mitropoulos D, Karakoidas V, Louridas P, Gousios G, Spinellis D (2014) The bug catalog of the Maven ecosystem. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597123. ACM, New York, MSR ’14, pp 372–375

  • Mukadam M, Bird C, Rigby PC (2013) Gerrit software code review data from android. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624002. IEEE Press, Piscataway, MSR ’13, pp 45–48

  • Murakami H, Higo Y, Kusumoto S (2014) A dataset of clone references with gaps. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597133. ACM, New York, MSR ’14, pp 412–415

  • Noten J, Mengerink JGM, Serebrenik A (2017) A data set of OCL expressions on GitHub. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.52. IEEE Press, Piscataway, MSR ’17, pp 531–534

  • Novielli N, Calefato F, Lanubile F (2018) A gold standard for emotion annotation in Stack Overflow. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196453. ACM, New York, MSR ’18, pp 14–17

  • Nussbaum L, Zacchiroli S (2010) The ultimate debian database: consolidating bazaar metadata for quality assurance and data mining. In: Proceedings of the 7th working conference on mining software repositories. https://doi.org/10.1109/MSR.2010.5463277. IEEE Press, Piscataway, MSR ’10, p 10

  • Ohira M, Kashiwa Y, Yamatani Y, Yoshiyuki H, Maeda Y, Limsettho N, Fujino K, Hata H, Ihara A, Matsumoto K (2015) A dataset of high impact bugs: manually-classified issue reports. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.78. IEEE Press, Piscataway, MSR ’15, pp 518–521

  • Ortu M, Murgia A, Destefanis G, Tourani P, Tonelli R, Marchesi M, Adams B (2016) The emotional side of software developers in JIRA. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903505. ACM, New York, MSR ’16, pp 480–483

  • Paixao M, Krinke J, Han D, Harman M (2018) CROP: linking code reviews to source code changes. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196466. ACM, New York, MSR ’18, pp 46–49

  • Palomba F, Nucci DD, Tufano M, Bavota G, Oliveto R, Poshyvanyk D, De Lucia A (2015) Landfill: an open dataset of code smells with public evaluation. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.69. IEEE Press, Piscataway, MSR ’15, pp 482–485

  • Passos L, Czarnecki K (2014) A dataset of feature additions and feature removals from the Linux kernel. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597124. ACM, New York, MSR ’14, pp 376–379

  • Petersen K, Feldt R, Mujtaba S, Mattsson M (2008) Systematic mapping studies in software engineering. In: Proceedings of the 12th international conference on evaluation and assessment in software engineering. http://dl.acm.org/citation.cfm?id=2227115.2227123. BCS Learning & Development Ltd., Swindon, EASE ’08, pp 68–77

  • Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: An update. Information and Software Technology 64:1–18. https://doi.org/10.1016/j.infsof.2015.03.007

    Article  Google Scholar 

  • Pfleeger SL, Kitchenham BA (2001) Principles of survey research: Part 1: turning lemons into lemonade. SIGSOFT Softw Eng Notes 26(6):16–18. https://doi.org/10.1145/505532.505535

    Article  Google Scholar 

  • Piezunka H, Dahlander L (2015) Distant search, narrow attention: how crowding alters organizations’ filtering of suggestions in crowdsourcing. Academy of Management Journal 58(3):856–880. https://doi.org/10.5465/amj.2012.0458

    Article  Google Scholar 

  • Ponzanelli L, Mocci A, Lanza M (2015) StORMeD: stack overflow ready made data. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.67. IEEE Press, Piscataway, MSR ’15 , pp 474–477

  • Proksch S, Amann S, Nadi S, Mezini M (2016) A dataset of simplified syntax trees for C#. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903507. ACM, New York, MSR ’16 , pp 476–479

  • Raemaekers S, Van Deursen A, Visser J (2013) The Maven repository dataset of metrics, changes, and dependencies. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624031. IEEE Press, Piscataway, MSR ’13, pp 221–224

  • Robles G (2010) Replicating MSR: a study of the potential replicability of papers published in the mining software repositories proceedings. In: Proceedings of the 7th working conference on mining software repositories. https://doi.org/10.1109/MSR.2010.5463348. IEEE Press, Piscataway, MSR ’10, pp 171–180

  • Robles G, Arjona Reina L, Serebrenik A, Vasilescu B, González-Barahona JM (2014) FLOSS 2013: a survey dataset about free software contributors: Challenges for curating, sharing, and combining. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597129. ACM, New York, MSR ’14, pp 396–399

  • Robles G, Ho-Quang T, Hebig R, Chaudron MRV, Fernandez MA (2017) An extensive dataset of UML models in GitHub. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.48. IEEE Press, Piscataway, MSR ’17, pp 519–522

  • Sadat M, Bener AB, Miranskyy AV (2017) Rediscovery datasets: connecting duplicate reports. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.50. IEEE Press, Piscataway, MSR ’17, pp 527–530

  • Saha RK, Lyu Y, Lam W, Yoshida H, Prasad MR (2018) Bugs.jar: A large-scale, diverse dataset of real-world Java bugs. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196473. ACM, New York, MSR ’18, pp 10–13

  • Saini V, Sajnani H, Ossher J, Lopes CV (2014) A dataset for Maven artifacts and bug patterns found in them. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597134. ACM, New York, MSR ’14, pp 416–419

  • Salinger S, Plonka L, Prechelt L (2008) A coding scheme development methodology using Grounded Theory for qualitative analysis of pair programming. Human Technology: An Interdisciplinary Journal on Humans in ICT Environments 4. https://doi.org/10.17011/ht/urn.200804151350

  • Sawant AA, Bacchelli A (2015) A dataset for API usage. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.75. IEEE Press, Piscataway, MSR ’15, pp 506–509

  • Sayyad Shirabad J, Menzies T (2005) The PROMISE repository of software engineering databases, School of Information Technology and Engineering, University of Ottawa, Canada, http://promise.site.uottawa.ca/SERepository

  • Schermann G, Zumberi S, Cito J (2018) Structured information on state and evolution of Dockerfiles on GitHub. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196456. ACM, New York, MSR ’18, pp 26–29

  • Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in Empirical Software Engineering. Empirical Software Engineering 13 (2):211–218. https://doi.org/10.1007/s10664-008-9060-1

    Article  Google Scholar 

  • Sigelaman L (1981) Question-order effects on presidential popularity. Public Opinion Quarterly 45(2):199–207. https://academic.oup.com/poq/article-pdf/45/2/199/5432386/45-2-199.pdf

    Article  Google Scholar 

  • Spacco J, Strecker J, Hovemeyer D, Pugh W (2005) Software repository mining with Marmoset: an automated programming project snapshot and testing system. In: Proceedings of the 2nd international workshop on mining software repositories. https://doi.org/10.1145/1082983.1083149. ACM, New York, MSR ’05, pp 1–5

  • Spinellis D (2015) A repository with 44 years of Unix evolution. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.64. IEEE Press, Piscataway, MSR ’15, pp 462–465

  • Spinellis D (2018) Documented Unix facilities over 48 years. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196476. ACM, New York, MSR ’18, pp 58–61

  • Squire M (2013a) Apache-affiliated twitter screen names: a dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624043. IEEE Press, Piscataway, MSR ’13, pp 305–308

  • Squire M (2013b) Project roles in the Apache software foundation: a dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624042. IEEE Press, Piscataway, MSR ’13, pp 301–304

  • Squire M (2016) Data sets: the circle of life in Ruby hosting, 2003-2015. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903509. ACM, New York, MSR ’16, pp 452–459

  • Squire M (2018) Data sets describing the circle of life in Ruby hosting, 2003–2016. Empirical Software Engineering 23(2):1123–1152. https://doi.org/10.1007/s10664-017-9581-6

    Article  Google Scholar 

  • Trockman A, Zhou S, Kästner C, Vasilescu B (2018) Adding sparkle to social coding: an empirical study of repository badges in the npm ecosystem. In: Proceedings of the 40th international conference on software engineering. https://doi.org/10.1145/3180155.3180209. ACM, New York, ICSE ’18, pp 511–522

  • Vasilescu B, Serebrenik A, Mens T (2013) A historical dataset of software engineering conferences. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624051. IEEE Press, Piscataway, MSR ’13, pp 373–376

  • Vasilescu B, Serebrenik A, Filkov V (2015) A data set for social diversity studies of GitHub teams. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.77. IEEE Press, Piscataway, MSR ’15, pp 514–517

  • Wagstrom P, Jergensen C, Sarma A (2013) A network of rails: a graph dataset of Ruby on rails and associated projects. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624033. IEEE Press, Piscataway, MSR ’13, pp 229–232

  • Wallace D (1998) Enhancing competitiveness via a public fault and failure data repository. In: Proceedings of the third ieee international high-assurance systems engineering symposium. https://doi.org/10.1109/HASE.1998.731610, pp 178–185

  • Webster J, Watson RT (2002) Analyzing the past to prepare for the future: writing a literature review. MIS Quarterly 26(2):xiii–xxiii

    Google Scholar 

  • Wermelinger M, Yu Y (2015) An architectural evolution dataset. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.74. IEEE Press, Piscataway, MSR ’15, pp 502–505

  • Williams JR, Di Ruscio D, Matragkas N, Di Rocco J, Kolovos DS (2014) Models of OSS project meta-information: a dataset of three forges. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597132. ACM, New York, MSR ’14, pp 408–411

  • Wong WE, Tse T, Glass RL, Basili VR, Chen TY (2011) An assessment of systems and software engineering scholars and institutions (2003–2007 and 2004–2008). Journal of Systems and Software 84(1):162–168. https://doi.org/10.1016/j.jss.2010.09.036

    Article  Google Scholar 

  • Xu Y, Zhou M (2018) A multi-level dataset of Linux kernel patchwork. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196475. ACM, New York, MSR ’18, pp 54–57

  • Yamashita A, Abtahizadeh SA, Khomh F, Guéhéneuc YG (2017) Software evolution and quality data from controlled, multiple, industrial case studies. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.44. IEEE Press, Piscataway, MSR ’17, pp 507–510

  • Yamashita A, Petrillo F, Khomh F, Guéhéneuc YG (2018) Developer interaction traces backed by IDE screen recordings from Think aloud sessions. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196457. ACM, New York, MSR ’18, pp 50–53

  • Yang X, Kula RG, Yoshida N, Iida H (2016) Mining the modern code review repositories: a dataset of people, process and product. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903504. ACM, New York, MSR ’16, pp 460–463

  • Yu Y, Li Z, Yin G, Wang T, Wang H (2018) A dataset of duplicate pull-requests in GitHub. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196455. ACM, New York, MSR ’18, pp 22–25

  • Zacchiroli S (2015) The Debsources dataset: two decades of Debian source code metadata. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.65. IEEE Press, Piscataway, MSR ’15, pp 466–469

  • Zhang C, Hindle A (2014) A green miner’s dataset: mining the impact of software change on energy consumption. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597130. ACM, New York, MSR ’14, pp 400–403

  • Zhu C, Li Y, Rubin J, Chechik M (2017) A dataset for dynamic discovery of semantic changes in version controlled software histories. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.49. IEEE Press, Piscataway, MSR ’17, pp 523–526

  • Zhu J, Zhou M, Mei H (2016) Multi-extract and multi-level dataset of Mozilla issue tracking history. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903502. ACM, New York, MSR ’16, pp 472–475

  • Zimmermann T, Di Penta M, Kim S (eds) (2013a) Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, IEEE Computer Society, http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6597024

  • Zimmermann T, Di Penta M, Kim S, German DM, Bacchelli A (2013b) Welcome from the chairs. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. https://doi.org/10.1109/MSR.2013.6623995 , pp iii–viii

  • Zogaan W, Sharma P, Mirahkorli M, Arnaoudova V (2017) Datasets from fifteen years of automated requirements traceability research: current state, characteristics, and quality. In: Proceedings of the 25th international requirements engineering conference, IEEE, RE ’17. https://doi.org/10.1109/RE.2017.80, pp 110–121

Download references

Acknowledgements

Panos Louridas provided insightful comments on this manuscript. Furthermore, Georgios Gousios’s suggestions regarding the refinement of the questionnaire were crucial for the survey attainment. This work has received funding from: the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825328; the gsrt 2016–2017 Research Support (EP-2844-01); and the Research Centre of the Athens University of Economics and Business, under the Original Scientific Publications framework 2019.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zoe Kotti.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Yasutaka Kamei and Andy Zaidman

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kotti, Z., Kravvaritis, K., Dritsa, K. et al. Standing on shoulders or feet? An extended study on the usage of the MSR data papers. Empir Software Eng 25, 3288–3322 (2020). https://doi.org/10.1007/s10664-020-09834-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-020-09834-7

Keywords

Navigation