Skip to main content
Log in

World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. blackducksoftware.com,fossid.com

  2. https://www.gharchive.org/

  3. https://github.com/ssc-oscar/gather

  4. https://libgit2.org/

  5. CPU: E5-2670, No. node: 36, No. core: 16, Mem size: 256 GB

  6. The exclusion only happens when all the conditions are met: a GitHub project has not been seen in the project list in a previous update, GitHub marked this project as a fork, and all the heads already exist in our database.

  7. “git fetch” downloads only new objects from the remote repository

  8. a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data

  9. bidirectional maps between the commit and five objects/attributes and between file and blob

  10. http://www.isthe.com/chongo/tech/comp/fnv/index.html#FNV-1a

  11. https://en.wikipedia.org/wiki/Unix_philosophy

  12. https://github.com/ssc-oscar/oscar.py

  13. https://bitbucket.org/swsc/lookup/src/master/

  14. ‘File’ in this figure refers to ‘File name’

  15. By file, we refer to the file name (including folder path) in the rest of our paper.

  16. https://libraries.io/

  17. Section 5.2 shows how WoC can also be used to improve them

  18. https://github.com/ssc-oscar/lookup

  19. The size information for Software Heritage, GHTorrent and WoC is directly obtained from their official websites. GHArchive, on the other hand did not provide such detailed information, and we looked into its dataset and checked the author ID field and project ID field.

  20. http://boa.cs.iastate.edu/examples/index.php#scheme-use

  21. https://www.gharchive.org/

  22. https://github.com/harishvc/githubanalytics

  23. https://githut.info/

  24. https://github.com/ssc-oscar/BIMAN_bot_detection

  25. https://github.com/woc-hack/thebridge

  26. https://github.com/woc-hack/Workers-Comprehension

  27. https://github.com/woc-hack/TAP

  28. https://github.com/ssc-oscar/Analytics

References

  • Abadi DJ (2009) Data management in the cloud: limitations and opportunities. IEEE Data Eng Bull 32(1):3–12

    Google Scholar 

  • Agrawal S, Narasayya V, Yang B (2004) Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data. ACM, pp 359–370

  • Amreen S, Mockus A, Bogart C, Zhang Y, Zaretzki R (2019) Alfaa: active learning fingerprint based anti-aliasing for correcting developer identity errors in version control data. arXiv:1901.03363

  • Amreen S, Mockus A, Zaretzki R, Bogart C, Zhang Y (2020) ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems. Empir Softw Eng 25(2):1136–1167

    Article  Google Scholar 

  • Bajracharya S, Ossher J, Lopes C (2014) Sourcerer: an infrastructure for large-scale collection and analysis of open-source code. Sci Comput Program 79:241–259

    Article  Google Scholar 

  • Bevan J, Whitehead EJ Jr, Kim S, Godfrey M (2005) Facilitating software evolution research with kenyon. ACM SIGSOFT Softw Eng Notes 30(5):177–186

    Article  Google Scholar 

  • Bird C, Gourley A, Devanbu P, Gertz M, Swaminathan A (2006) Mining email social networks. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06. https://doi.org/10.1145/1137983.1138016. ACM, New York, pp 137–143

  • Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009) The promises and perils of mining git. In: 6th IEEE international working conference on mining software repositories, 2009. MSR’09. IEEE, pp 1–10

  • Budde R, Kautz K, Kuhlenkamp K, Züllighoven H (1992) Prototyping. In: Prototyping. Springer, pp 33–46

  • Chacon S, Straub B (2014) Pro git. Springer Nature, Berlin

    Book  Google Scholar 

  • Coelho J, Valente MT, Silva LL, Shihab E (2018) Identifying unmaintained projects in github. In: Proceedings of the 12th ACM/IEEE international symposium on empirical software engineering and measurement. ACM

  • Czerwonka J, Nagappan N, Schulte W, Murphy B (2013) Codemine: building a software development data analytics platform at microsoft. IEEE Softw 30(4):64–71

    Article  Google Scholar 

  • Dey T, Mockus A (2018a) Are software dependency supply chain metrics useful in predicting change of popularity of npm packages?. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering. ACM, pp 66–69

  • Dey T, Mockus A (2018b) Modeling relationship between post-release faults and usage in mobilesoftware. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering, pp 56–65

  • Dey T, Mockus A (2020a) A dataset of pull requests and a trained random forest model forpredicting pull request acceptance. https://doi.org/10.5281/zenodo.3858046

  • Dey T, Mockus A (2020b) Deriving a usage-independent software quality metric. Empir Softw Eng 25(2):1596–1641

    Article  Google Scholar 

  • Dey T, Mockus A (2020c) Effect of technical and social factors on pull request quality for thenpm ecosystem. In: Proceedings of the 14th ACM/IEEE international symposium on empirical software engineering and measurement (ESEM), pp 1–11

  • Dey T, Ma Y, Mockus A (2019a) Patterns of effort contribution and demand and user classification based on participation patterns in npm ecosystem. In: Proceedings of the fifteenth international conference on predictive models and data analytics in software engineering, pp 36–45

  • Dey T, Ma Y, Mockus A (2019b) Patterns of effort contribution and demand and user classification based on participation patterns in npm ecosystem. arXiv:1907.06538

  • Dey T, Mousavi S, Ponce E, Fry T, Vasilescu B, Filippova A, Mockus A (2020a) A dataset of bot commits. https://doi.org/10.5281/zenodo.3610205

  • Dey T, Mousavi S, Ponce E, Fry T, Vasilescu B, Filippova A, Mockus A (2020b) Detecting and characterizing bots that commit code. In: Proceedings of the 17th international conference on mining software repositories, pp 209–219

  • Dey T, Karnauch A, Mockus A (2020c) Representation of developer expertise in open source software. arXiv:2005.10176

  • Dey T, Vasilescu B, Mockus A (2020d) An exploratory study of bot commits. In: Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops, pp 61–65

  • Dey T, Vasilescu B, Mockus A (2020e) A mapping between Bot Commit, Projects, Files, and Blobs [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3699665

  • Di Cosmo R, Zacchiroli S (2017) Software heritage: why and how to preserve software source code. In: iPRES 2017: 14th international conference on digital preservation. Kyoto, Japan. https://www.softwareheritage.org/wp-content/uploads/2020/01/ipres-2017-swh.pdfhttps://hal.archives-ouvertes.fr/hal-01590958

  • Ducasse S, Gîrba T, Nierstrasz O (2005) Moose: an agile reengineering environment. In: ACM SIGSOFT software engineering notes, vol 30. ACM, pp 99–102

  • Dyer R (2013) Task fusion: improving utilization of multi-user clusters. In: Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity, SPLASH SRC, pp 117–118

  • Dyer R, Nguyen HA, Rajan H, Nguyen TN (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: Proceedings of the 35th international conference on software engineering, ICSE’13, pp 422–431

  • Dyer R, Nguyen HA, Rajan H, Nguyen TN (2015a) Boa: An enabling language and infrastructure for ultra-large-scale msr studies. In: The art and science of analyzing software data. Morgan Kaufmann, pp 593–621

  • Dyer R, Nguyen HA, Rajan H, Nguyen TN (2015b) Boa: ultra-large-scale software repository and source-code mining. ACM Trans Softw Eng Methodol 25(1):7:1–7:34

    Article  Google Scholar 

  • Dyer R, Rajan H, Nguyen TN (2013) Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes. In: Proceedings of the 12th international conference on generative programming: concepts & experiences, GPCE, pp 23–32

  • Eastlake 3rd D, Jones P (2001) Us secure hash algorithm 1 (sha1). Tech. rep

  • Fry T, Dey T, Karnauch A, Mockus A (2020) A dataset and an approach for identity resolution of 38 million author ids extracted from 2b git commits. In: Proceedings of the 17th international conference on mining software repositories, pp 518–522

  • German D, Mockus A (2003) Automating the measurement of open source projects. In: Proceedings of the 3rd workshop on open source software engineering. University College Cork Cork Ireland, pp 63–67

  • Gorton I, Bener AB, Mockus A (2016) Software engineering for big data systems. IEEE Softw 33(2):32–35

    Article  Google Scholar 

  • Gousios G (2013) The ghtorrent dataset and tool suite. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. http://dl.acm.org/citation.cfm?id=2487085.2487132. IEEE Press, Piscataway, pp 233–236

  • Gousios G, Pinzger M, Deursen AV (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th international conference on software engineering. ACM, pp 345–355

  • Gousios G, Spinellis D (2009) Alitheia core: an extensible software quality monitoring platform. In: IEEE 31st international conference on software engineering, 2009. ICSE 2009. IEEE, pp 579–582

  • Gousios G, Spinellis D (2012) Ghtorrent: Github’s data from a firehose. In: 2012 9th ieee working conference on mining software repositories (msr). IEEE, pp 12–21

  • Gousios G, Vasilescu B, Serebrenik A, Zaidman A (2014) Lean ghtorrent: Github data on demand. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 384–387

  • Gousios G, Zaidman A (2014) A dataset for pull-based development research. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 368–371

  • Hackbarth R, Mockus A, Palframan J, Weiss D (2010) Assessing the state of software in a large enterprise. J Empir Softw Eng 10(3):219–249

    Article  Google Scholar 

  • Howison J, Conklin M, Crowston K (2006) Flossmole: a collaborative repository for floss research data and analyses. Int J Info Technol Web Eng (IJITWE) 1(3):17–26

    Article  Google Scholar 

  • Hung CS, Dyer R (2020) Boa views: easy modularization and sharing of msr analyses. In: Proceedings of the 17th international conference on mining software repositories, pp 147–157

  • Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. https://doi.org/10.1145/2597073.2597074. Association for Computing Machinery, New York, pp 92–101

  • Kim S, Zimmermann T, Kim M, Hassan AE, Mockus A, Gîrba T, Pinzger M, Jr. EJW, Zeller A (2006) Ta-re: an exchange language for mining software repositories. In: ICSE’06 workshop on mining software repositories. http://dl.acm.org/authorize?804411. Shanghai, China, pp 22–25

  • Le Q, Mikolov T (2014) Distributed representation of sentences and documents. In: Proceedings of the 31st international conference on machine learning. https://cs.stanford.edu/quocle/paragraph_vector.pdf, vol 32. JMLR, Beijing

  • Leavitt N (2010) Will NoSQL databases live up to their promise? Computer 43(2):12–14. https://doi.org/10.1109/MC.2010.58

    Article  Google Scholar 

  • Lichter H, Schneider-Hufschmidt M, Zullighoven H (1994) Prototyping in industrial software projects-bridging the gap between theory and practice. IEEE Trans Softw Eng 20(11):825–832

    Article  Google Scholar 

  • Ma Y, Dey T, Smith JM, Wilder N, Mockus A (2016) Crowdsourcing the discovery of software repositories in an educational environment. PeerJ Preprints 4:e2551v1. https://doi.org/10.7287/peerj.preprints.2551v1

    Google Scholar 

  • Ma Y, Mockus A, Zaretzki R, Bichescu R, Bradley B (2020) A methodology for analyzing uptake of software technologies among developers. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2020.2993758

  • Mockus A (2007) Large-scale code reuse in open source software. In: ICSE’07 intl. workshop on emerging trends in FLOSS research and development. Minneapolis, Minnesota. papers/ossreuse.pdf

  • Mockus A (2009) Amassing and indexing a large sample of version control systems: towards the census of public source code history. In: 6th IEEE working conference on mining software repositories. papers/amassing.pdf

  • Mockus A (2014) Engineering big data solutions. In: ICSE’14 FOSE. papers/BigData.pdf

  • Moniruzzaman A, Hossain SA (2013) Nosql database: new era of databases for big data analytics-classification, characteristics and comparison. arXiv:1307.0191

  • Munaiah N, Kroh S, Cabrey C, Nagappan M (2017) Curating github for engineered software projects. Empir Softw Eng 22(6):3219–3253

    Article  Google Scholar 

  • Nexus (2019) Repository. https://www.sonatype.com/nexus-repository-oss. Accessed 02 Jan 2019

  • Ossher J, Bajracharya S, Linstead E, Baldi P, Lopes C (2009) Sourcererdb: an aggregated repository of statically analyzed and cross-linked open source java projects. In: 6th IEEE international working conference on mining software repositories, 2009. MSR’09. IEEE, pp 183–186

  • Pietri A, Spinellis D, Zacchiroli S (2019) The software heritage graph dataset: public software development under one roof. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR). IEEE, pp 138–142

  • Qi Z (2007) Fast sha1 implementation. US Patent 7,299,355

  • Rajan H, Nguyen TN, Dyer R, Nguyen HA (2015) Boa website. http://boa.cs.iastate.edu/

  • Rosch E (2002) Principles of categorization. In: Levitin DJ (ed) Foundations of cognitive psychology: core readings. MIT Press, pp 251–270

  • Rozenberg D, Beschastnikh I, Kosmale F, Poser V, Becker H, Palyart M, Murphy GC (2016) Comparing repositories visually with repograms. In: Proceedings of the 13th international conference on mining software repositories. ACM, pp 109–120

  • Russom P, et al. (2011) Big data analytics. TDWI best practices report, fourth quarter 19(4):1–34

    Google Scholar 

  • Sayyad Shirabad J, Menzies T (2005) The PROMISE repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada. http://promise.site.uottawa.ca/SERepository

  • Spinellis D, Kotti Z, Mockus A (2020) A dataset for github repository deduplication. In: Proceedings of the 17th international conference on mining software repositories, pp 523–527

  • Tiwari NM, Upadhyaya G, Nguyen HA, Rajan H (2017) Candoia: a platform for building and sharing mining software repositories tools as apps. In: MSR’17: 14th international conference on mining software repositories

  • Tiwari NM, Upadhyaya G, Rajan H (2016) Candoia: a platform and ecosystem for mining software repositories tools. In: Proceedings of the 38th international conference on software engineering companion. ACM, pp 759–764

  • Upadhyaya G, Rajan H (2017) On accelerating ultra-large-scale mining. In: Proceedings of the 39th international conference on software engineering: new ideas and emerging results track. IEEE Press, pp 39–42

  • Upadhyaya G, Rajan H (2018) On accelerating source code analysis at massive scale. IEEE Trans Softw Eng

  • Wang X, Yin YL, Yu H (2005) Collision search attacks on sha1

  • Winkler W (1990) String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage

  • Winkler WE (2006) Overview of record linkage and current research directions. Tech. rep. BUREAU OF THE CENSUS

Download references

Acknowledgments

This work was supported by the National Science Foundation NSF Awards 1633437, 1901102, and 1925615.

Funding

This work was supported by the National Science Foundation NSF Awards 1633437,1901102, and 1925615.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuxing Ma.

Additional information

Communicated by: Yasutaka Kamei and Andy Zaidman

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Mining Software Repositories (MSR)

Appendices

Appendix A: Source Code for Cmt2ATShow.perl

figure j

Appendix B: Source Code for the custom lsort command in tutorial

figure k

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Y., Dey, T., Bogart, C. et al. World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data. Empir Software Eng 26, 22 (2021). https://doi.org/10.1007/s10664-020-09905-9

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-020-09905-9

Keywords

Navigation