The Debsources Dataset: two decades of free and open source software

Abstract

We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    https://www.debian.org.

  2. 2.

    http://www.dwheeler.com/sloccount/.

  3. 3.

    https://github.com/AlDanial/cloc.

  4. 4.

    http://www.darwinsys.com/file/.

  5. 5.

    http://ctags.sourceforge.net/.

  6. 6.

    http://www.postgresql.org.

  7. 7.

    http://dx.doi.org/https://zenodo.org/.

  8. 8.

    https://packages.debian.org/sid/debmirror.

  9. 9.

    A list of Debian mirrors organized by geographical location is available at https://www. debian.org/mirror/list.

  10. 10.

    https://packages.debian.org/sid/python-debian.

  11. 11.

    https://en.wikipedia.org/wiki/List_{o}f_{D}ebian_{r}eleases.

  12. 12.

    https://github.com/AlDanial/cloc.

  13. 13.

    http://www.dwheeler.com/sloccount/.

  14. 14.

    http://ctags.sourceforge.net/.

  15. 15.

    https://www.westgrid.ca/.

  16. 16.

    Note that two different SLOC metrics are available in the dataset: as computed by sloccount and cloc. Each tool has its strength and weaknesses. For this case study we use sloccount numbers.

  17. 17.

    https://www.blackducksoftware.com/top-open-source-licenses.

  18. 18.

    http://bugs.debian.org/740883.

  19. 19.

    http://www.ubuntu.com/.

References

  1. Abate P, Boender J, Di Cosmo R, Zacchiroli S (2009) Strong dependencies between software components. In: ESEM, pp 89–99

  2. Adams B, Bird C, Khomh F, Moir K (2013) 1st international workshop on release engineering (RELENG 2013). In: ICSE’13, pp 1545–1546

  3. Brooks FP Jr (1995) The mythical man-month: essays on software engineering, 2nd edn. Addison-Wesley

  4. Caneill M, Zacchiroli S (2014) Debsources: live and historical views on macro-level software evolution. In: ESEM 2014: 8th international symposium on empirical software engineering and measurement. ACM

  5. Demeyer S, Murgia A, Wyckmans K, Lamkanfi A (2013) Happy birthday! A trend analysis on past msr papers. In: MSR 13: 10th Working Conference on Mining Software Repositories, MSR’13. IEEE, Piscataway, NJ, USA, pp 353–362

  6. Distrowatch distribution search — debian-based distributions. http://distrowatch.com/search.php?ostype=linux&basedon=debian&status=active

  7. Dyer R, Nguyen H A, Rajan H, Nguyen T N (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: ICSE. IEEE/ACM, pp 422–431

  8. German D M, Di Penta M, Davis J (2010) Understanding and auditing the licensing of open source software distributions. In: 18th international conference on program comprehension (ICPC’2010), pp 84–93

  9. German D M, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In: Proceedings of the IEEE/ACM international conference on automated software engineering, ASE’10. ACM, pp 437–446

  10. Gobeille R (2008) The fossology project. In: MSR 2008: the 5th working conference on mining software repositories. ACM, pp 47–50

  11. González-Barahona J M, Ortuno Perez M A, de las Heras Quirós P, González J C, Olivera V M (2001) Counting potatoes: the size of debian 2.2. Upgrade Magazine 2(6):60–66

    Google Scholar 

  12. González-Barahona J M, Robles G, Michlmayr M, Amor J J, Germán D M (2009) Macro-level software evolution: a case study of a large software compilation. Empir Softw Eng 14(3):262–285

    Article  Google Scholar 

  13. Howison J, Conklin M, Crowston K (2006) FLOSSmole: a collaborative repository for FLOSS research data and analyses. IJITWE 1(3):17–26

    Google Scholar 

  14. Jackson I, et al. (1996) Debian policy manual. Available at https://www.debian.org/doc/debian-policy/

  15. Kerrisk M (2013) Surveying open source licenses. Available at https://lwn.net/Articles/547400/

  16. La A (2015) Language trends on github. Available at https://github.com/blog/2047-language-trends-on-github

  17. Lehman M M (1980) Programs, life cycles, and laws of software evolution. Proc IEEE 68(9):1060–1076

    Article  Google Scholar 

  18. Nussbaum L, Zacchiroli S (2010) The ultimate debian database: consolidating bazaar metadata for quality assurance and data mining. In: MSR. IEEE, pp 52–61

  19. Oliver J, Cheng C, Chen Y (2013) Tlsh - a locality sensitive hash. In: CTC, 4th Cybercrime and Trustworthy Computing Workshop. IEEE, pp 7–13

  20. Robles G, Gonzalez-Barahona J M, Michlmayr M (2005) Evolution of volunteer participation in libre software projects: evidence from debian. In: Proceedings of the 1st international conference on open source systems, pp 100–107

  21. Sowe S, Stamelos I, Angelis L (2006) Identifying knowledge brokers that yield software engineering knowledge in oss projects. Inf Softw Technol 48(11):1025–1033

    Article  Google Scholar 

  22. Stewart K, Odence P, Rockett E (2011) Software package data exchange (SPDX™) specification. International Free and Open Source Software Law Review 2 (2):191–196

    Article  Google Scholar 

  23. Tridgell A (1999) Efficient algorithms for sorting and synchronization. PhD thesis Australian National University Canberra

  24. Wheeler D A (2001) More than a gigabuck: Estimating GNU/linux’s size. http://www.dwheeler.com/sloc/redhat71-v1/redhat71sloc.1.03.html

  25. Whitehead J, Zimmermann T (eds) (2010) Mining software repositories, MSR 2010. IEEE

  26. Wu Y, Manabe Y, Kanda T, German D M, Inoue K (2015) A method to detect license inconsistencies in large-scale open source projects. In: Proceedings of the 12th working conference on mining software repositories, MSR ’15. IEEE Press, Piscataway, NJ, USA, pp 324–333

  27. Zacchiroli S (2015) The Debsources dataset: two decades of Debian source code metadata. In: MSR 2015: the 12th working conference on mining software repositories. IEEE, pp 466–469

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Stefano Zacchiroli.

Additional information

This work has been partially performed at IRILL, center for Free Software Research and Innovation in Paris, France http://www.irill.org . Unless noted otherwise, all URLs in the text have been retrieved on September 1st, 2016. Authors are listed alphabetically.

Communicated by: Romain Robbes, Martin Pinzger and Yasutaka Kamei

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Caneill, M., Germán, D.M. & Zacchiroli, S. The Debsources Dataset: two decades of free and open source software. Empir Software Eng 22, 1405–1437 (2017). https://doi.org/10.1007/s10664-016-9461-5

Download citation

Keywords

  • Software evolution
  • Source code
  • Free software
  • Open source
  • Debian
  • Dataset