Empirical Software Engineering

, Volume 22, Issue 3, pp 1405–1437 | Cite as

The Debsources Dataset: two decades of free and open source software

  • Matthieu Caneill
  • Daniel M. Germán
  • Stefano ZacchiroliEmail author


We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.


Software evolution Source code Free software Open source Debian Dataset 


  1. Abate P, Boender J, Di Cosmo R, Zacchiroli S (2009) Strong dependencies between software components. In: ESEM, pp 89–99Google Scholar
  2. Adams B, Bird C, Khomh F, Moir K (2013) 1st international workshop on release engineering (RELENG 2013). In: ICSE’13, pp 1545–1546Google Scholar
  3. Brooks FP Jr (1995) The mythical man-month: essays on software engineering, 2nd edn. Addison-WesleyGoogle Scholar
  4. Caneill M, Zacchiroli S (2014) Debsources: live and historical views on macro-level software evolution. In: ESEM 2014: 8th international symposium on empirical software engineering and measurement. ACMGoogle Scholar
  5. Demeyer S, Murgia A, Wyckmans K, Lamkanfi A (2013) Happy birthday! A trend analysis on past msr papers. In: MSR 13: 10th Working Conference on Mining Software Repositories, MSR’13. IEEE, Piscataway, NJ, USA, pp 353–362Google Scholar
  6. Distrowatch distribution search — debian-based distributions.
  7. Dyer R, Nguyen H A, Rajan H, Nguyen T N (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In: ICSE. IEEE/ACM, pp 422–431Google Scholar
  8. German D M, Di Penta M, Davis J (2010) Understanding and auditing the licensing of open source software distributions. In: 18th international conference on program comprehension (ICPC’2010), pp 84–93Google Scholar
  9. German D M, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In: Proceedings of the IEEE/ACM international conference on automated software engineering, ASE’10. ACM, pp 437–446Google Scholar
  10. Gobeille R (2008) The fossology project. In: MSR 2008: the 5th working conference on mining software repositories. ACM, pp 47–50Google Scholar
  11. González-Barahona J M, Ortuno Perez M A, de las Heras Quirós P, González J C, Olivera V M (2001) Counting potatoes: the size of debian 2.2. Upgrade Magazine 2(6):60–66Google Scholar
  12. González-Barahona J M, Robles G, Michlmayr M, Amor J J, Germán D M (2009) Macro-level software evolution: a case study of a large software compilation. Empir Softw Eng 14(3):262–285CrossRefGoogle Scholar
  13. Howison J, Conklin M, Crowston K (2006) FLOSSmole: a collaborative repository for FLOSS research data and analyses. IJITWE 1(3):17–26Google Scholar
  14. Jackson I, et al. (1996) Debian policy manual. Available at
  15. Kerrisk M (2013) Surveying open source licenses. Available at
  16. La A (2015) Language trends on github. Available at
  17. Lehman M M (1980) Programs, life cycles, and laws of software evolution. Proc IEEE 68(9):1060–1076CrossRefGoogle Scholar
  18. Nussbaum L, Zacchiroli S (2010) The ultimate debian database: consolidating bazaar metadata for quality assurance and data mining. In: MSR. IEEE, pp 52–61Google Scholar
  19. Oliver J, Cheng C, Chen Y (2013) Tlsh - a locality sensitive hash. In: CTC, 4th Cybercrime and Trustworthy Computing Workshop. IEEE, pp 7–13Google Scholar
  20. Robles G, Gonzalez-Barahona J M, Michlmayr M (2005) Evolution of volunteer participation in libre software projects: evidence from debian. In: Proceedings of the 1st international conference on open source systems, pp 100–107Google Scholar
  21. Sowe S, Stamelos I, Angelis L (2006) Identifying knowledge brokers that yield software engineering knowledge in oss projects. Inf Softw Technol 48(11):1025–1033CrossRefGoogle Scholar
  22. Stewart K, Odence P, Rockett E (2011) Software package data exchange (SPDX™) specification. International Free and Open Source Software Law Review 2 (2):191–196CrossRefGoogle Scholar
  23. Tridgell A (1999) Efficient algorithms for sorting and synchronization. PhD thesis Australian National University CanberraGoogle Scholar
  24. Wheeler D A (2001) More than a gigabuck: Estimating GNU/linux’s size.
  25. Whitehead J, Zimmermann T (eds) (2010) Mining software repositories, MSR 2010. IEEEGoogle Scholar
  26. Wu Y, Manabe Y, Kanda T, German D M, Inoue K (2015) A method to detect license inconsistencies in large-scale open source projects. In: Proceedings of the 12th working conference on mining software repositories, MSR ’15. IEEE Press, Piscataway, NJ, USA, pp 324–333Google Scholar
  27. Zacchiroli S (2015) The Debsources dataset: two decades of Debian source code metadata. In: MSR 2015: the 12th working conference on mining software repositories. IEEE, pp 466–469Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Matthieu Caneill
    • 1
  • Daniel M. Germán
    • 2
  • Stefano Zacchiroli
    • 3
    • 4
    Email author
  1. 1.Université Grenoble AlpesGrenobleFrance
  2. 2.University of VictoriaVictoriaCanada
  3. 3.Sorbonne Paris Cité, IRIF, UMR 8243, CNRSUniversité Paris DiderotParisFrance
  4. 4.InriaParisFrance

Personalised recommendations