Abstract
The establishment of the Mining Software Repositories (MSR) data showcase conference track has encouraged researchers to provide data sets as a basis for further empirical studies. The objective of this study is to examine the usage of data papers published in the MSR proceedings in terms of use frequency, users, and use purpose. Data track papers were collected from the MSR data showcase track and through the manual inspection of older MSR proceedings. The use of data papers was established through manual citation searching followed by reading the citing studies and dividing them into strong and weak citations. Contrary to weak, strong citations truly use the data set of a data paper. Data papers were then manually clustered based on their content, whereas their strong citations were classified by hand according to the knowledge areas of the Guide to the Software Engineering Body of Knowledge. A survey study on 108 authors and users of data papers provided further insights regarding motivation and effort in data paper production, encouraging and discouraging factors in data set use, and future desired direction regarding data papers. We found that 65% of the data papers have been used in other studies, with a long-tail distribution in the number of strong citations. Weak citations to data papers usually refer to them as an example. MSR data papers are cited in total less than other MSR papers. A considerable number of the strong citations stem from the teams that authored the data papers. Publications providing Version Control System (VCS) primary and derived data are the most frequent data papers and the most often strongly cited ones. Enhanced developer data papers are the least common ones, and the second least frequently strongly cited. Data paper authors tend to gather data in the context of other research. Users of data sets appreciate high data quality and are discouraged by lack of replicability of data set construction. Data related to machine learning or derived from the manufacturing sector are two suggestions of the respondents for future data papers. Overall, data papers have provided the foundation for a significant number of studies, but there is room for improvement in their utilization. This can be done by setting a higher bar for their publication, by encouraging their use, by promoting open science initiatives, and by providing incentives for the enrichment of existing data collections.
Similar content being viewed by others
Notes
1968 ACM Turing Award Lecture (Hamming 1969)
References
Aivaloglou E, Hermans F, Moreno-León J, Robles G (2017) A dataset of scratch programs: scraped, shaped and scored. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.45. IEEE Press, Piscataway, MSR ’17, pp 511–514
Allix K, Bissyandé TF, Klein J, Le Traon Y (2016) Androzoo: collecting millions of android apps for the research community. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903508. ACM, New York, MSR ’16, pp 468–471
Almakadmeh M, Abran A (2017) The ISBSG software project repository: an analysis from six sigma measurement perspective for software defect estimation. Journal of Software Engineering and Applications 10(8):693–720. https://doi.org/10.4236/jsea.2017.108038
Altinger H, Siegl S, Dajsuren Y, Wotawa F (2015) A novel industry grade dataset for fault prediction based on model-driven developed automotive embedded software. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.72. IEEE Press, Piscataway, MSR ’15, pp 494–497
Amann S, Nadi S, Nguyen HA, Nguyen TN, Mezini M (2016) MUBench: a benchmark for API-misuse detectors. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903506. ACM, New York, MSR ’16, pp 464–467
Baldassari B, Preux P (2014) Understanding software evolution: the Maisqual Ant data set. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597136. ACM, New York, MSR ’14, pp 424–427
Barik T, Lubick K, Smith J, Slankas J, Murphy-Hill E (2015) FUSE: a reproducible, extendable, internet-scale corpus of spreadsheets. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.70. IEEE Press, Piscataway, MSR ’15, pp 486–489
Binkley D, Lawrie D, Pollock L, Hill E, Vijay-Shanker K (2013) A dataset for evaluating identifier splitters. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624055. IEEE Press, Piscataway, MSR ’13, pp 401–404
Bloemen R, Amrit C, Kuhlmann S, Ordóñez Matamoros G (2014) Gentoo package dependencies over time. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597131. ACM, New York, MSR ’14, pp 404–407
Boisvert RF (2016) Incentivizing reproducibility. Commun ACM 59 (10):5–5. https://doi.org/10.1145/2994031
Bourque P, Fair RE (eds) (2014) Guide to the Software Engineering Body of Knowledge, version 3.0 edn. IEEE Computer Society, New York, http://www.swebok.org
Bradford SC (1985) Sources of information on specific subjects 1934. Journal of Information Science 10(4):176–180. https://doi.org/10.1177/016555158501000407
Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80(4):571–583. https://doi.org/10.1016/j.jss.2006.07.009
Butler S, Wermelinger M, Yu Y, Sharp H (2013) INVocD: identifier name vocabulary dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624056. IEEE Press, Piscataway, MSR ’13, pp 405–408
Chametzky B (2016) Coding in classic grounded theory: I’ve done an interview; now what? Sociology Mind 06:163–172. https://doi.org/10.4236/sm.2016.64014
Chatzidimitriou KC, Papamichail MD, Diamantopoulos T, Tsapanos M, Symeonidis AL (2018) npm-miner: an infrastructure for measuring the quality of the npm registry. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196465. ACM, New York, MSR ’18, pp 42–45
Cheikhi L, Abran A (2013) PROMISE and ISBSG software engineering data repositories: a survey. In: Proceedings of the joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement. https://doi.org/10.1109/IWSM-Mensura.2013.13. IEEE Press, Piscataway, IWSM-Mensura ’13, pp 17–24
Conklin M, Howison J, Crowston K (2005) Collaboration using OSSmole: a repository of FLOSS data and analyses. In: Proceedings of the 2nd international workshop on mining software repositories. https://doi.org/10.1145/1082983.1083164. ACM, New York, MSR ’05, pp 1–5
Cukic B (2005) Guest editor’s introduction: the promise of public software engineering data repositories. IEEE Software 22 (6):20–22. https://doi.org/10.1109/MS.2005.153
Dit B, Holtzhauer A, Poshyvanyk D, Kagdi H (2013) A dataset from change history to support evaluation of software maintenance tasks. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, Piscataway, MSR ’13, pp 131–134. https://doi.org/10.1109/MSR.2013.6624019
Efstathiou V, Chatzilenas C, Spinellis D (2018) Word embeddings for the software engineering domain. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196448. ACM, New York, MSR ’18, pp 38–41
Farah G, Tejada JS, Correal D (2014) OpenHub: a scalable architecture for the analysis of software quality attributes. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597135. ACM, New York, MSR ’14, pp 420–423
Ferenc R, Tóth Z, Ladányi G, Siket I, Gyimóthy T (2018) A public unified bug dataset for Java. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering. https://doi.org/10.1145/3273934.3273936. ACM, New York, PROMISE ’18, pp 12–21
de Freitas FG, de Souza JT (2011) Ten years of search based software engineering: a bibliometric analysis. In: Cohen MB, Ó Cinnéide M (eds) Proceedings of the 3rd international symposium on search based software engineering. https://link.springer.com/chapter/10.1007/978-3-642-23716-4_5. Springer, Berlin, SSBSE ’11, pp 18–32
Fujiwara K, Hata H, Makihara E, Fujihara Y, Nakayama N, Iida H, Matsumoto K (2014) Kataribe: a hosting service of historage repositories. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597125. ACM, New York, MSR ’14, pp 380–383
Furnham A (1986) Response bias, social desirability and dissimulation. Personality and Individual Differences 7(3):385–400. https://doi.org/10.1016/0191-8869(86)90014-0
Gao J, Yang X, Jiang Y, Liu H, Ying W, Zhang X (2018) JBench: a dataset of data races for concurrency testing. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196451. ACM, New York, MSR ’18, pp 6–9
Geiger FX, Malavolta I, Pascarella L, Palomba F, Di Nucci D, Bacchelli A (2018) A graph-based dataset of commit history of real-world android apps. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196460. ACM, New York, MSR ’18, pp 30–33
German DM, Adams B, Hassan AE (2015) A dataset of the activity of the Git super-repository of Linux in 2012. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.66. IEEE Press, Piscataway, MSR ’15, pp 470–473
Gkortzis A, Mitropoulos D, Spinellis D (2018) VulinOSS: a dataset of security vulnerabilities in open-source systems. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196454. ACM, New York, MSR ’18, pp 18–21
Glaser B, Strauss A (1967) The Discovery of Grounded Theory: Strategies for Qualitative Research Observations (Chicago Ill.), Aldine Publishing
Glass RL (1994) An assessment of systems and software engineering scholars and institutions. Journal of Systems and Software 27(1):63–67. https://doi.org/10.1016/0164-1212(94)90115-5
Goeminne M, Claes M, Mens T (2013) A historical dataset for the Gnome ecosystem. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624032. IEEE Press, Piscataway, MSR ’13 , pp 225–228
Gonzalez-Barahona JM, Robles G, Izquierdo-Cortazar D (2015) The MetricsGrimoire database collection. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.68. IEEE Press, Piscataway, MSR ’15, pp 478–481
Gousios G (2013) The GHTorrent dataset and tool suite. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624034. IEEE Press, Piscataway, MSR ’13, pp 233–236
Gousios G, Zaidman A (2014) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597122. ACM, New York, MSR ’14, pp 368–371
Gousios G, Vasilescu B, Serebrenik A, Zaidman A (2014) Lean GHTorrent: GitHub data on demand. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597126. ACM, New York, MSR ’14, pp 384–387
Gousios G, Storey MA, Bacchelli A (2016) Work practices and challenges in pull-based development: the contributor’s perspective. In: Proceedings of the 38th international conference on software engineering. https://doi.org/10.1145/2884781.2884826. Association for Computing Machinery, New York, ICSE ’16, pp 285–296
Gu Y (2004) Global knowledge management research: a bibliometric analysis. Scientometrics 61(2):171–190. https://doi.org/10.1023/B:SCIE.0000041647.01086.f4
Habayeb M, Miranskyy A, Murtaza SS, Buchanan L, Bener AB (2015) The Firefox temporal defect dataset. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.73. IEEE Press, Piscataway, MSR ’15, pp 498–501
Hamasaki K, Kula RG, Yoshida N, Cruz AEC, Fujiwara K, Iida H (2013) Who does what during a code review? Datasets of OSS peer review repositories. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624003. IEEE Press, Piscataway, MSR ’13, pp 49–52
Hamming R (1969) One man’s view of computer science. Journal of the ACM 16(1):3–12. https://doi.org/10.1145/321495.321497
Hardwicke TE, Ioannidis JP (2018) Mapping the universe of registered reports. Nature Human Behaviour 2(11):793–796
Harman M, Mansouri SA, Zhang Y (2009) Search based software engineering: a comprehensive analysis and review of trends techniques and applications. Tech. Rep. TR-09-03, Department of Computer Science, King’s College London, and Brunel Business School, Brunel University, London, UK, https://www.researchgate.net/profile/Yuanyuan_Zhang12/publication/228671024_Search_Based_Software_Engineering_A_Comprehensive_Analysis_and_Review_of_Trends_Techniques_and_Applications/links/00b4951811ba6a40eb000000.pdf
Janjic W, Hummel O, Schumacher M, Atkinson C (2013) An unabridged source code dataset for research in software reuse. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624047. IEEE Press, Piscataway, MSR ’13, pp 339–342
Karakoidas V, Mitropoulos D, Louridas P, Gousios G, Spinellis D (2015) Generating the blueprints of the Java ecosystem. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.76. IEEE Press, Piscataway, MSR ’15, pp 510–513
Keivanloo I, Forbes C, Hmood A, Erfani M, Neal C, Peristerakis G, Rilling J (2012) A linked data platform for mining software repositories. In: Proceedings of the 9th working conference on mining software repositories. https://doi.org/10.1109/MSR.2012.6224296. IEEE Press, Piscataway, MSR ’12, pp 32–35
Kim S, Zimmermann T, Kim M, Hassan A, Mockus A, Girba T, Pinzger M, Whitehead EJ Jr, Zeller A (2006) TA-RE: an exchange language for mining software repositories. In: Proceedings of the 3rd international workshop on mining software repositories. https://doi.org/10.1145/1137983.1137990. ACM, New York, MSR ’06, pp 22–25
Kitchenham B (2004) Procedures for performing systematic reviews. Tech. Rep. TR/SE-0401, Department of Computer Science, Keele University, Keele, Staffs, UK, http://www.it.hiof.no/!haraldh/misc/2016-08-22-smat/Kitchenham-Systematic-Review-2004.pdf
Kitchenham B, Pfleeger SL (2003) Principles of survey research: Part 6: data analysis. SIGSOFT Softw Eng Notes 28(2):24–27. https://doi.org/10.1145/638750.638758
Kitchenham BA, Pfleeger SL (2002a) Principles of survey research: Part 2: designing a survey. SIGSOFT Softw Eng Notes 27(1):18–20. https://doi.org/10.1145/566493.566495
Kitchenham BA, Pfleeger SL (2002b) Principles of survey research: Part 3: constructing a survey instrument. https://doi.org/10.1145/511152.511155, vol 27, pp 20–24
Kitchenham BA, Pfleeger SL (2002c) Principles of survey research: Part 4: questionnaire evaluation. SIGSOFT Softw Eng Notes 27(3):20–23. https://doi.org/10.1145/638574.638580
Kitchenham BA, Pfleeger SL (2002d) Principles of survey research: Part 5: populations and samples. SIGSOFT Softw Eng Notes 27(5):17–20. https://doi.org/10.1145/571681.571686
Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, Emam KE, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28(8):721–734. https://doi.org/10.1109/TSE.2002.1027796
Kotti Z, Spinellis D (2019) Standing on shoulders or feet?: The usage of the MSR data papers. In: Proceedings of the 16th international conference on mining software repositories. https://doi.org/10.1109/MSR.2019.00085. IEEE Press, Piscataway, MSR ’19, pp 565–576
von Krogh G, von Hippel E (2006) The promise of research on open source software. Management Science 52(7):975–983. https://doi.org/10.1287/mnsc.1060.0560
Krüger S, Späth J, Ali K, Bodden E, Mezini M (2018) CrySL: an extensible approach to validating the correct usage of cryptographic APIs. In: Millstein T (ed) Proceedings of the 32nd European conference on object-oriented programming. https://doi.org/10.4230/LIPIcs.ECOOP.2018.10, vol 109. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, ECOOP ’18, pp 10:1–10:27
Krüger S, Späth J, Ali K, Bodden E, Mezini M (2019) CrySL: an extensible approach to validating the correct usage of cryptographic APIs. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2019.2948910
Krutz DE, Le W (2014) A code clone oracle. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597127. ACM, New York, MSR ’14, pp 388–391
Krutz DE, Mirakhorli M, Malachowsky SA, Ruiz A, Peterson J, Filipski A, Smith J (2015) A dataset of open-source android applications. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.79. IEEE Press, Piscataway, MSR ’15, pp 522–525
Kupferschmidt K (2018) More and more scientists are preregistering their studies. should you? Science. https://doi.org/10.1126/science.aav4786
Lamkanfi A, Pérez J, Demeyer S (2013) The Eclipse and Mozilla defect tracking dataset: a genuine dataset for mining bug information. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624028. IEEE Press, Piscataway, MSR ’13, pp 203–206
Lavazza L, Santillo L (2012) Historical data repositories in software engineering: status and possible improvements. In: Proceedings of the 2012 joint conference of the 22nd international workshop on software measurement and the 2012 seventh international conference on software process and product measurement. https://doi.org/10.1109/IWSM-MENSURA.2012.39, pp 221–225
Lazar A, Ritchey S, Sharif B (2014) Generating duplicate bug datasets. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597128. ACM, New York, MSR ’14, pp 392–395
Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. In: Proceedings of the 4th international workshop on predictor models in software engineering. https://doi.org/10.1145/1370788.1370799. ACM, New York, PROMISE ’08, pp 39–44
Lotka AJ (1926) The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences 16(12):317–323. http://www.jstor.org/stable/24529203
MacLean AC, Knutson CD (2013) Apache commits: social network dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624020. IEEE Press, Piscataway, MSR ’13, pp 135–138
Madeyski L, Kawalerowicz M (2017) Continuous defect prediction: the idea and a related dataset. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.46. IEEE Press, Piscataway, MSR ’17, pp 515–518
Markovtsev V, Long W (2018) Public Git archive: a big code dataset for all. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196464. ACM, New York, MSR ’18, pp 34–37
Martins P, Achar R, Lopes CV (2018) 50K-C: a dataset of compilable, and compiled, Java projects. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196450. ACM, New York, MSR ’18, pp 1–5
Mauczka A, Brosch F, Schanes C, Grechenig T (2015) Dataset of developer-labeled commit messages. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.71. IEEE Press, Piscataway, MSR ’15, pp 490–493
Merton RK (1968) The Matthew effect in science. Science 159 (3810):56–63
Mierle K, Laven K, Roweis S, Wilson G (2005) Mining student CVS repositories for performance indicators. In: Proceedings of the 2nd international workshop on mining software repositories. https://doi.org/10.1145/1082983.1083150. ACM, New York, MSR ’05, pp 1–5
Mitropoulos D, Karakoidas V, Louridas P, Gousios G, Spinellis D (2014) The bug catalog of the Maven ecosystem. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597123. ACM, New York, MSR ’14, pp 372–375
Mukadam M, Bird C, Rigby PC (2013) Gerrit software code review data from android. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624002. IEEE Press, Piscataway, MSR ’13, pp 45–48
Murakami H, Higo Y, Kusumoto S (2014) A dataset of clone references with gaps. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597133. ACM, New York, MSR ’14, pp 412–415
Noten J, Mengerink JGM, Serebrenik A (2017) A data set of OCL expressions on GitHub. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.52. IEEE Press, Piscataway, MSR ’17, pp 531–534
Novielli N, Calefato F, Lanubile F (2018) A gold standard for emotion annotation in Stack Overflow. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196453. ACM, New York, MSR ’18, pp 14–17
Nussbaum L, Zacchiroli S (2010) The ultimate debian database: consolidating bazaar metadata for quality assurance and data mining. In: Proceedings of the 7th working conference on mining software repositories. https://doi.org/10.1109/MSR.2010.5463277. IEEE Press, Piscataway, MSR ’10, p 10
Ohira M, Kashiwa Y, Yamatani Y, Yoshiyuki H, Maeda Y, Limsettho N, Fujino K, Hata H, Ihara A, Matsumoto K (2015) A dataset of high impact bugs: manually-classified issue reports. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.78. IEEE Press, Piscataway, MSR ’15, pp 518–521
Ortu M, Murgia A, Destefanis G, Tourani P, Tonelli R, Marchesi M, Adams B (2016) The emotional side of software developers in JIRA. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903505. ACM, New York, MSR ’16, pp 480–483
Paixao M, Krinke J, Han D, Harman M (2018) CROP: linking code reviews to source code changes. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196466. ACM, New York, MSR ’18, pp 46–49
Palomba F, Nucci DD, Tufano M, Bavota G, Oliveto R, Poshyvanyk D, De Lucia A (2015) Landfill: an open dataset of code smells with public evaluation. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.69. IEEE Press, Piscataway, MSR ’15, pp 482–485
Passos L, Czarnecki K (2014) A dataset of feature additions and feature removals from the Linux kernel. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597124. ACM, New York, MSR ’14, pp 376–379
Petersen K, Feldt R, Mujtaba S, Mattsson M (2008) Systematic mapping studies in software engineering. In: Proceedings of the 12th international conference on evaluation and assessment in software engineering. http://dl.acm.org/citation.cfm?id=2227115.2227123. BCS Learning & Development Ltd., Swindon, EASE ’08, pp 68–77
Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: An update. Information and Software Technology 64:1–18. https://doi.org/10.1016/j.infsof.2015.03.007
Pfleeger SL, Kitchenham BA (2001) Principles of survey research: Part 1: turning lemons into lemonade. SIGSOFT Softw Eng Notes 26(6):16–18. https://doi.org/10.1145/505532.505535
Piezunka H, Dahlander L (2015) Distant search, narrow attention: how crowding alters organizations’ filtering of suggestions in crowdsourcing. Academy of Management Journal 58(3):856–880. https://doi.org/10.5465/amj.2012.0458
Ponzanelli L, Mocci A, Lanza M (2015) StORMeD: stack overflow ready made data. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.67. IEEE Press, Piscataway, MSR ’15 , pp 474–477
Proksch S, Amann S, Nadi S, Mezini M (2016) A dataset of simplified syntax trees for C#. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903507. ACM, New York, MSR ’16 , pp 476–479
Raemaekers S, Van Deursen A, Visser J (2013) The Maven repository dataset of metrics, changes, and dependencies. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624031. IEEE Press, Piscataway, MSR ’13, pp 221–224
Robles G (2010) Replicating MSR: a study of the potential replicability of papers published in the mining software repositories proceedings. In: Proceedings of the 7th working conference on mining software repositories. https://doi.org/10.1109/MSR.2010.5463348. IEEE Press, Piscataway, MSR ’10, pp 171–180
Robles G, Arjona Reina L, Serebrenik A, Vasilescu B, González-Barahona JM (2014) FLOSS 2013: a survey dataset about free software contributors: Challenges for curating, sharing, and combining. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597129. ACM, New York, MSR ’14, pp 396–399
Robles G, Ho-Quang T, Hebig R, Chaudron MRV, Fernandez MA (2017) An extensive dataset of UML models in GitHub. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.48. IEEE Press, Piscataway, MSR ’17, pp 519–522
Sadat M, Bener AB, Miranskyy AV (2017) Rediscovery datasets: connecting duplicate reports. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.50. IEEE Press, Piscataway, MSR ’17, pp 527–530
Saha RK, Lyu Y, Lam W, Yoshida H, Prasad MR (2018) Bugs.jar: A large-scale, diverse dataset of real-world Java bugs. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196473. ACM, New York, MSR ’18, pp 10–13
Saini V, Sajnani H, Ossher J, Lopes CV (2014) A dataset for Maven artifacts and bug patterns found in them. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597134. ACM, New York, MSR ’14, pp 416–419
Salinger S, Plonka L, Prechelt L (2008) A coding scheme development methodology using Grounded Theory for qualitative analysis of pair programming. Human Technology: An Interdisciplinary Journal on Humans in ICT Environments 4. https://doi.org/10.17011/ht/urn.200804151350
Sawant AA, Bacchelli A (2015) A dataset for API usage. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.75. IEEE Press, Piscataway, MSR ’15, pp 506–509
Sayyad Shirabad J, Menzies T (2005) The PROMISE repository of software engineering databases, School of Information Technology and Engineering, University of Ottawa, Canada, http://promise.site.uottawa.ca/SERepository
Schermann G, Zumberi S, Cito J (2018) Structured information on state and evolution of Dockerfiles on GitHub. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196456. ACM, New York, MSR ’18, pp 26–29
Shull FJ, Carver JC, Vegas S, Juristo N (2008) The role of replications in Empirical Software Engineering. Empirical Software Engineering 13 (2):211–218. https://doi.org/10.1007/s10664-008-9060-1
Sigelaman L (1981) Question-order effects on presidential popularity. Public Opinion Quarterly 45(2):199–207. https://academic.oup.com/poq/article-pdf/45/2/199/5432386/45-2-199.pdf
Spacco J, Strecker J, Hovemeyer D, Pugh W (2005) Software repository mining with Marmoset: an automated programming project snapshot and testing system. In: Proceedings of the 2nd international workshop on mining software repositories. https://doi.org/10.1145/1082983.1083149. ACM, New York, MSR ’05, pp 1–5
Spinellis D (2015) A repository with 44 years of Unix evolution. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.64. IEEE Press, Piscataway, MSR ’15, pp 462–465
Spinellis D (2018) Documented Unix facilities over 48 years. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196476. ACM, New York, MSR ’18, pp 58–61
Squire M (2013a) Apache-affiliated twitter screen names: a dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624043. IEEE Press, Piscataway, MSR ’13, pp 305–308
Squire M (2013b) Project roles in the Apache software foundation: a dataset. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624042. IEEE Press, Piscataway, MSR ’13, pp 301–304
Squire M (2016) Data sets: the circle of life in Ruby hosting, 2003-2015. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903509. ACM, New York, MSR ’16, pp 452–459
Squire M (2018) Data sets describing the circle of life in Ruby hosting, 2003–2016. Empirical Software Engineering 23(2):1123–1152. https://doi.org/10.1007/s10664-017-9581-6
Trockman A, Zhou S, Kästner C, Vasilescu B (2018) Adding sparkle to social coding: an empirical study of repository badges in the npm ecosystem. In: Proceedings of the 40th international conference on software engineering. https://doi.org/10.1145/3180155.3180209. ACM, New York, ICSE ’18, pp 511–522
Vasilescu B, Serebrenik A, Mens T (2013) A historical dataset of software engineering conferences. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624051. IEEE Press, Piscataway, MSR ’13, pp 373–376
Vasilescu B, Serebrenik A, Filkov V (2015) A data set for social diversity studies of GitHub teams. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.77. IEEE Press, Piscataway, MSR ’15, pp 514–517
Wagstrom P, Jergensen C, Sarma A (2013) A network of rails: a graph dataset of Ruby on rails and associated projects. In: Proceedings of the 10th working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624033. IEEE Press, Piscataway, MSR ’13, pp 229–232
Wallace D (1998) Enhancing competitiveness via a public fault and failure data repository. In: Proceedings of the third ieee international high-assurance systems engineering symposium. https://doi.org/10.1109/HASE.1998.731610, pp 178–185
Webster J, Watson RT (2002) Analyzing the past to prepare for the future: writing a literature review. MIS Quarterly 26(2):xiii–xxiii
Wermelinger M, Yu Y (2015) An architectural evolution dataset. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.74. IEEE Press, Piscataway, MSR ’15, pp 502–505
Williams JR, Di Ruscio D, Matragkas N, Di Rocco J, Kolovos DS (2014) Models of OSS project meta-information: a dataset of three forges. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597132. ACM, New York, MSR ’14, pp 408–411
Wong WE, Tse T, Glass RL, Basili VR, Chen TY (2011) An assessment of systems and software engineering scholars and institutions (2003–2007 and 2004–2008). Journal of Systems and Software 84(1):162–168. https://doi.org/10.1016/j.jss.2010.09.036
Xu Y, Zhou M (2018) A multi-level dataset of Linux kernel patchwork. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196475. ACM, New York, MSR ’18, pp 54–57
Yamashita A, Abtahizadeh SA, Khomh F, Guéhéneuc YG (2017) Software evolution and quality data from controlled, multiple, industrial case studies. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.44. IEEE Press, Piscataway, MSR ’17, pp 507–510
Yamashita A, Petrillo F, Khomh F, Guéhéneuc YG (2018) Developer interaction traces backed by IDE screen recordings from Think aloud sessions. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196457. ACM, New York, MSR ’18, pp 50–53
Yang X, Kula RG, Yoshida N, Iida H (2016) Mining the modern code review repositories: a dataset of people, process and product. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903504. ACM, New York, MSR ’16, pp 460–463
Yu Y, Li Z, Yin G, Wang T, Wang H (2018) A dataset of duplicate pull-requests in GitHub. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196455. ACM, New York, MSR ’18, pp 22–25
Zacchiroli S (2015) The Debsources dataset: two decades of Debian source code metadata. In: Proceedings of the 12th working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.65. IEEE Press, Piscataway, MSR ’15, pp 466–469
Zhang C, Hindle A (2014) A green miner’s dataset: mining the impact of software change on energy consumption. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597130. ACM, New York, MSR ’14, pp 400–403
Zhu C, Li Y, Rubin J, Chechik M (2017) A dataset for dynamic discovery of semantic changes in version controlled software histories. In: Proceedings of the 14th international conference on mining software repositories. https://doi.org/10.1109/MSR.2017.49. IEEE Press, Piscataway, MSR ’17, pp 523–526
Zhu J, Zhou M, Mei H (2016) Multi-extract and multi-level dataset of Mozilla issue tracking history. In: Proceedings of the 13th international conference on mining software repositories. https://doi.org/10.1145/2901739.2903502. ACM, New York, MSR ’16, pp 472–475
Zimmermann T, Di Penta M, Kim S (eds) (2013a) Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, IEEE Computer Society, http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6597024
Zimmermann T, Di Penta M, Kim S, German DM, Bacchelli A (2013b) Welcome from the chairs. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. https://doi.org/10.1109/MSR.2013.6623995 , pp iii–viii
Zogaan W, Sharma P, Mirahkorli M, Arnaoudova V (2017) Datasets from fifteen years of automated requirements traceability research: current state, characteristics, and quality. In: Proceedings of the 25th international requirements engineering conference, IEEE, RE ’17. https://doi.org/10.1109/RE.2017.80, pp 110–121
Acknowledgements
Panos Louridas provided insightful comments on this manuscript. Furthermore, Georgios Gousios’s suggestions regarding the refinement of the questionnaire were crucial for the survey attainment. This work has received funding from: the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825328; the gsrt 2016–2017 Research Support (EP-2844-01); and the Research Centre of the Athens University of Economics and Business, under the Original Scientific Publications framework 2019.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest.
Additional information
Communicated by: Yasutaka Kamei and Andy Zaidman
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kotti, Z., Kravvaritis, K., Dritsa, K. et al. Standing on shoulders or feet? An extended study on the usage of the MSR data papers. Empir Software Eng 25, 3288–3322 (2020). https://doi.org/10.1007/s10664-020-09834-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-020-09834-7