World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data

Ma, Yuxing; Dey, Tapajit; Bogart, Chris; Amreen, Sadika; Valiev, Marat; Tutko, Adam; Kennard, David; Zaretzki, Russell; Mockus, Audris

doi:10.1007/s10664-020-09905-9

World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data

Published: 25 February 2021

Volume 26, article number 22, (2021)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Yuxing Ma ORCID: orcid.org/0000-0002-3642-8012¹,
Tapajit Dey¹,
Chris Bogart²,
Sadika Amreen¹,
Marat Valiev²,
Adam Tutko¹,
David Kennard¹,
Russell Zaretzki³ &
…
Audris Mockus¹

949 Accesses
19 Citations
Explore all metrics

Abstract

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Evolving collaboration, dependencies, and use in the Rust Open Source Software ecosystem

Article Open access 16 November 2022

Software provenance tracking at the scale of public source code

Article 29 May 2020

Investigating Evolution in Open Source Software

Notes

blackducksoftware.com,fossid.com
https://www.gharchive.org/
https://github.com/ssc-oscar/gather
https://libgit2.org/
CPU: E5-2670, No. node: 36, No. core: 16, Mem size: 256 GB
The exclusion only happens when all the conditions are met: a GitHub project has not been seen in the project list in a previous update, GitHub marked this project as a fork, and all the heads already exist in our database.
“git fetch” downloads only new objects from the remote repository
a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data
bidirectional maps between the commit and five objects/attributes and between file and blob
http://www.isthe.com/chongo/tech/comp/fnv/index.html#FNV-1a
https://en.wikipedia.org/wiki/Unix_philosophy
https://github.com/ssc-oscar/oscar.py
https://bitbucket.org/swsc/lookup/src/master/
‘File’ in this figure refers to ‘File name’
By file, we refer to the file name (including folder path) in the rest of our paper.
https://libraries.io/
Section 5.2 shows how WoC can also be used to improve them
https://github.com/ssc-oscar/lookup
The size information for Software Heritage, GHTorrent and WoC is directly obtained from their official websites. GHArchive, on the other hand did not provide such detailed information, and we looked into its dataset and checked the author ID field and project ID field.
http://boa.cs.iastate.edu/examples/index.php#scheme-use
https://www.gharchive.org/
https://github.com/harishvc/githubanalytics
https://githut.info/
https://github.com/ssc-oscar/BIMAN_bot_detection
https://github.com/woc-hack/thebridge
https://github.com/woc-hack/Workers-Comprehension
https://github.com/woc-hack/TAP
https://github.com/ssc-oscar/Analytics

References

Abadi DJ (2009) Data management in the cloud: limitations and opportunities. IEEE Data Eng Bull 32(1):3–12
Google Scholar
Agrawal S, Narasayya V, Yang B (2004) Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data. ACM, pp 359–370
Amreen S, Mockus A, Bogart C, Zhang Y, Zaretzki R (2019) Alfaa: active learning fingerprint based anti-aliasing for correcting developer identity errors in version control data. arXiv:1901.03363
Amreen S, Mockus A, Zaretzki R, Bogart C, Zhang Y (2020) ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems. Empir Softw Eng 25(2):1136–1167
Article Google Scholar
Bajracharya S, Ossher J, Lopes C (2014) Sourcerer: an infrastructure for large-scale collection and analysis of open-source code. Sci Comput Program 79:241–259
Article Google Scholar
Bevan J, Whitehead EJ Jr, Kim S, Godfrey M (2005) Facilitating software evolution research with kenyon. ACM SIGSOFT Softw Eng Notes 30(5):177–186
Article Google Scholar
Bird C, Gourley A, Devanbu P, Gertz M, Swaminathan A (2006) Mining email social networks. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06. https://doi.org/10.1145/1137983.1138016. ACM, New York, pp 137–143
Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009) The promises and perils of mining git. In: 6th IEEE international working conference on mining software repositories, 2009. MSR’09. IEEE, pp 1–10
Budde R, Kautz K, Kuhlenkamp K, Züllighoven H (1992) Prototyping. In: Prototyping. Springer, pp 33–46
Chacon S, Straub B (2014) Pro git. Springer Nature, Berlin
Book Google Scholar
Coelho J, Valente MT, Silva LL, Shihab E (2018) Identifying unmaintained projects in github. In: Proceedings of the 12th ACM/IEEE international symposium on empirical software engineering and measurement. ACM
Czerwonka J, Nagappan N, Schulte W, Murphy B (2013) Codemine: building a software development data analytics platform at microsoft. IEEE Softw 30(4):64–71
Article Google Scholar
Dey T, Mockus A (2018a) Are software dependency supply chain metrics useful in predicting change of popularity of npm packages?. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering. ACM, pp 66–69
Dey T, Mockus A (2018b) Modeling relationship between post-release faults and usage in mobilesoftware. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering, pp 56–65
Dey T, Mockus A (2020a) A dataset of pull requests and a trained random forest model forpredicting pull request acceptance. https://doi.org/10.5281/zenodo.3858046
Dey T, Mockus A (2020b) Deriving a usage-independent software quality metric. Empir Softw Eng 25(2):1596–1641
Article Google Scholar
Dey T, Mockus A (2020c) Effect of technical and social factors on pull request quality for thenpm ecosystem. In: Proceedings of the 14th ACM/IEEE international symposium on empirical software engineering and measurement (ESEM), pp 1–11
Dey T, Ma Y, Mockus A (2019a) Patterns of effort contribution and demand and user classification based on participation patterns in npm ecosystem. In: Proceedings of the fifteenth international conference on predictive models and data analytics in software engineering, pp 36–45
Dey T, Ma Y, Mockus A (2019b) Patterns of effort contribution and demand and user classification based on participation patterns in npm ecosystem. arXiv:1907.06538
Dey T, Mousavi S, Ponce E, Fry T, Vasilescu B, Filippova A, Mockus A (2020a) A dataset of bot commits. https://doi.org/10.5281/zenodo.3610205
Dey T, Mousavi S, Ponce E, Fry T, Vasilescu B, Filippova A, Mockus A (2020b) Detecting and characterizing bots that commit code. In: Proceedings of the 17th international conference on mining software repositories, pp 209–219
Dey T, Karnauch A, Mockus A (2020c) Representation of developer expertise in open source software. arXiv:2005.10176
Dey T, Vasilescu B, Mockus A (2020d) An exploratory study of bot commits. In: Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops, pp 61–65
Dey T, Vasilescu B, Mockus A (2020e) A mapping between Bot Commit, Projects, Files, and Blobs [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3699665
Di Cosmo R, Zacchiroli S (2017) Software heritage: why and how to preserve software source code. In: iPRES 2017: 14th international conference on digital preservation. Kyoto, Japan. https://www.softwareheritage.org/wp-content/uploads/2020/01/ipres-2017-swh.pdf https://hal.archives-ouvertes.fr/hal-01590958
Ducasse S, Gîrba T, Nierstrasz O (2005) Moose: an agile reengineering environment. In: ACM SIGSOFT software engineering notes, vol 30. ACM, pp 99–102
Dyer R (2013) Task fusion: improving utilization of multi-user clusters. In: Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity, SPLASH SRC, pp 117–118
Dyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: Proceedings of the 35th international conference on software engineering, ICSE’13, pp 422–431
Dyer R, Nguyen HA, Rajan H, Nguyen TN (2015a) Boa: An enabling language and infrastructure for ultra-large-scale msr studies. In: The art and science of analyzing software data. Morgan Kaufmann, pp 593–621
Dyer R, Nguyen HA, Rajan H, Nguyen TN (2015b) Boa: ultra-large-scale software repository and source-code mining. ACM Trans Softw Eng Methodol 25(1):7:1–7:34
Article Google Scholar
Dyer R, Rajan H, Nguyen TN (2013) Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes. In: Proceedings of the 12th international conference on generative programming: concepts & experiences, GPCE, pp 23–32
Eastlake 3rd D, Jones P (2001) Us secure hash algorithm 1 (sha1). Tech. rep
Fry T, Dey T, Karnauch A, Mockus A (2020) A dataset and an approach for identity resolution of 38 million author ids extracted from 2b git commits. In: Proceedings of the 17th international conference on mining software repositories, pp 518–522
German D, Mockus A (2003) Automating the measurement of open source projects. In: Proceedings of the 3rd workshop on open source software engineering. University College Cork Cork Ireland, pp 63–67
Gorton I, Bener AB, Mockus A (2016) Software engineering for big data systems. IEEE Softw 33(2):32–35
Article Google Scholar
Gousios G (2013) The ghtorrent dataset and tool suite. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. http://dl.acm.org/citation.cfm?id=2487085.2487132. IEEE Press, Piscataway, pp 233–236
Gousios G, Pinzger M, Deursen AV (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering. ACM, pp 345–355
Gousios G, Spinellis D (2009) Alitheia core: an extensible software quality monitoring platform. In: IEEE 31st international conference on software engineering, 2009. ICSE 2009. IEEE, pp 579–582
Gousios G, Spinellis D (2012) Ghtorrent: Github’s data from a firehose. In: 2012 9th ieee working conference on mining software repositories (msr). IEEE, pp 12–21
Gousios G, Vasilescu B, Serebrenik A, Zaidman A (2014) Lean ghtorrent: Github data on demand. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 384–387
Gousios G, Zaidman A (2014) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 368–371
Hackbarth R, Mockus A, Palframan J, Weiss D (2010) Assessing the state of software in a large enterprise. J Empir Softw Eng 10(3):219–249
Article Google Scholar
Howison J, Conklin M, Crowston K (2006) Flossmole: a collaborative repository for floss research data and analyses. Int J Info Technol Web Eng (IJITWE) 1(3):17–26
Article Google Scholar
Hung CS, Dyer R (2020) Boa views: easy modularization and sharing of msr analyses. In: Proceedings of the 17th international conference on mining software repositories, pp 147–157
Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. https://doi.org/10.1145/2597073.2597074. Association for Computing Machinery, New York, pp 92–101
Kim S, Zimmermann T, Kim M, Hassan AE, Mockus A, Gîrba T, Pinzger M, Jr. EJW, Zeller A (2006) Ta-re: an exchange language for mining software repositories. In: ICSE’06 workshop on mining software repositories. http://dl.acm.org/authorize?804411. Shanghai, China, pp 22–25
Le Q, Mikolov T (2014) Distributed representation of sentences and documents. In: Proceedings of the 31st international conference on machine learning. https://cs.stanford.edu/quocle/paragraph_vector.pdf, vol 32. JMLR, Beijing
Leavitt N (2010) Will NoSQL databases live up to their promise? Computer 43(2):12–14. https://doi.org/10.1109/MC.2010.58
Article Google Scholar
Lichter H, Schneider-Hufschmidt M, Zullighoven H (1994) Prototyping in industrial software projects-bridging the gap between theory and practice. IEEE Trans Softw Eng 20(11):825–832
Article Google Scholar
Ma Y, Dey T, Smith JM, Wilder N, Mockus A (2016) Crowdsourcing the discovery of software repositories in an educational environment. PeerJ Preprints 4:e2551v1. https://doi.org/10.7287/peerj.preprints.2551v1
Google Scholar
Ma Y, Mockus A, Zaretzki R, Bichescu R, Bradley B (2020) A methodology for analyzing uptake of software technologies among developers. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2020.2993758
Mockus A (2007) Large-scale code reuse in open source software. In: ICSE’07 intl. workshop on emerging trends in FLOSS research and development. Minneapolis, Minnesota. papers/ossreuse.pdf
Mockus A (2009) Amassing and indexing a large sample of version control systems: towards the census of public source code history. In: 6th IEEE working conference on mining software repositories. papers/amassing.pdf
Mockus A (2014) Engineering big data solutions. In: ICSE’14 FOSE. papers/BigData.pdf
Moniruzzaman A, Hossain SA (2013) Nosql database: new era of databases for big data analytics-classification, characteristics and comparison. arXiv:1307.0191
Munaiah N, Kroh S, Cabrey C, Nagappan M (2017) Curating github for engineered software projects. Empir Softw Eng 22(6):3219–3253
Article Google Scholar
Nexus (2019) Repository. https://www.sonatype.com/nexus-repository-oss. Accessed 02 Jan 2019
Ossher J, Bajracharya S, Linstead E, Baldi P, Lopes C (2009) Sourcererdb: an aggregated repository of statically analyzed and cross-linked open source java projects. In: 6th IEEE international working conference on mining software repositories, 2009. MSR’09. IEEE, pp 183–186
Pietri A, Spinellis D, Zacchiroli S (2019) The software heritage graph dataset: public software development under one roof. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR). IEEE, pp 138–142
Qi Z (2007) Fast sha1 implementation. US Patent 7,299,355
Rajan H, Nguyen TN, Dyer R, Nguyen HA (2015) Boa website. http://boa.cs.iastate.edu/
Rosch E (2002) Principles of categorization. In: Levitin DJ (ed) Foundations of cognitive psychology: core readings. MIT Press, pp 251–270
Rozenberg D, Beschastnikh I, Kosmale F, Poser V, Becker H, Palyart M, Murphy GC (2016) Comparing repositories visually with repograms. In: Proceedings of the 13th international conference on mining software repositories. ACM, pp 109–120
Russom P, et al. (2011) Big data analytics. TDWI best practices report, fourth quarter 19(4):1–34
Google Scholar
Sayyad Shirabad J, Menzies T (2005) The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada. http://promise.site.uottawa.ca/SERepository
Spinellis D, Kotti Z, Mockus A (2020) A dataset for github repository deduplication. In: Proceedings of the 17th international conference on mining software repositories, pp 523–527
Tiwari NM, Upadhyaya G, Nguyen HA, Rajan H (2017) Candoia: a platform for building and sharing mining software repositories tools as apps. In: MSR’17: 14th international conference on mining software repositories
Tiwari NM, Upadhyaya G, Rajan H (2016) Candoia: a platform and ecosystem for mining software repositories tools. In: Proceedings of the 38th international conference on software engineering companion. ACM, pp 759–764
Upadhyaya G, Rajan H (2017) On accelerating ultra-large-scale mining. In: Proceedings of the 39th international conference on software engineering: new ideas and emerging results track. IEEE Press, pp 39–42
Upadhyaya G, Rajan H (2018) On accelerating source code analysis at massive scale. IEEE Trans Softw Eng
Wang X, Yin YL, Yu H (2005) Collision search attacks on sha1
Winkler W (1990) String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage
Winkler WE (2006) Overview of record linkage and current research directions. Tech. rep. BUREAU OF THE CENSUS

Download references

Acknowledgments

This work was supported by the National Science Foundation NSF Awards 1633437, 1901102, and 1925615.

Funding

This work was supported by the National Science Foundation NSF Awards 1633437,1901102, and 1925615.

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville Min H. Kao Building, Room 619 1520 Middle Drive, Knoxville, TN, 37996, USA
Yuxing Ma, Tapajit Dey, Sadika Amreen, Adam Tutko, David Kennard & Audris Mockus
Institute for Software Research, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Chris Bogart & Marat Valiev
Department of Business Analytics and Statistics, University of Tennessee, Knoxville Stokely Management Center 916 Volunteer Blvd, Knoxville, TN, 37916, US
Russell Zaretzki

Authors

Yuxing Ma
View author publications
You can also search for this author in PubMed Google Scholar
Tapajit Dey
View author publications
You can also search for this author in PubMed Google Scholar
Chris Bogart
View author publications
You can also search for this author in PubMed Google Scholar
Sadika Amreen
View author publications
You can also search for this author in PubMed Google Scholar
Marat Valiev
View author publications
You can also search for this author in PubMed Google Scholar
Adam Tutko
View author publications
You can also search for this author in PubMed Google Scholar
David Kennard
View author publications
You can also search for this author in PubMed Google Scholar
Russell Zaretzki
View author publications
You can also search for this author in PubMed Google Scholar
Audris Mockus
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxing Ma.

Additional information

Communicated by: Yasutaka Kamei and Andy Zaidman

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Mining Software Repositories (MSR)

Appendices

Appendix A: Source Code for Cmt2ATShow.perl

Appendix B: Source Code for the custom lsort command in tutorial

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, Y., Dey, T., Bogart, C. et al. World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data. Empir Software Eng 26, 22 (2021). https://doi.org/10.1007/s10664-020-09905-9

Download citation

Accepted: 30 October 2020
Published: 25 February 2021
DOI: https://doi.org/10.1007/s10664-020-09905-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data

Abstract

Access this article

Similar content being viewed by others

Evolving collaboration, dependencies, and use in the Rust Open Source Software ecosystem

Software provenance tracking at the scale of public source code

Investigating Evolution in Open Source Software

Notes

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Source Code for Cmt2ATShow.perl

Appendix B: Source Code for the custom lsort command in tutorial

Rights and permissions

About this article

Cite this article

Keywords

Navigation

World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data

Abstract

Access this article

Similar content being viewed by others

Evolving collaboration, dependencies, and use in the Rust Open Source Software ecosystem

Software provenance tracking at the scale of public source code

Investigating Evolution in Open Source Software

Notes

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Source Code for Cmt2ATShow.perl

Appendix B: Source Code for the custom lsort command in tutorial

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation