Macro-level software evolution: a case study of a large software compilation
- First Online:
- Cite this article as:
- Gonzalez-Barahona, J.M., Robles, G., Michlmayr, M. et al. Empir Software Eng (2009) 14: 262. doi:10.1007/s10664-008-9100-x
- 984 Downloads
Software evolution studies have traditionally focused on individual products. In this study we scale up the idea of software evolution by considering software compilations composed of a large quantity of independently developed products, engineered to work together. With the success of libre (free, open source) software, these compilations have become common in the form of ‘software distributions’, which group hundreds or thousands of software applications and libraries into an integrated system. We have performed an exploratory case study on one of them, Debian GNU/Linux, finding some significant results. First, Debian has been doubling in size every 2 years, totalling about 300 million lines of code as of 2007. Second, the mean size of packages has remained stable over time. Third, the number of dependencies between packages has been growing quickly. Finally, while C is still by far the most commonly used programming language for applications, use of the C++, Java, and Python languages have all significantly increased. The study helps not only to understand the evolution of Debian, but also yields insights into the evolution of mature libre software systems in general.
KeywordsMining software repositoriesLarge software collectionsSoftware evolutionSoftware integrators
Software evolution studies usually consider single products developed by a coordinated team. However, software systems are commonly composed of a large set of applications and libraries, many of them coming from unrelated parties, and developed by different teams with their own goals. The evolution of those systems presents some specific aspects and characteristics that are worth studying. However, finding all the elements needed for such a study, and especially, the source code for the whole system at certain points of time, is not easy. This is probably the reason of the little attention paid to them by researchers in the area of software evolution.
Fortunately, the opportunity of performing such studies has become real with the advent of libre software1 distributions: collections of software packages engineered to work in coordination, providing the user with a large operating system with many, maybe thousands, of applications. Each package is actually developed by a different group, usually called ‘project’, in relative isolation from the others. The job of creating and maintaining a distribution is mainly about making all packages fit together, and producing installers, package managers, some common look and feel, etc. Examples of libre software distributions are Fedora (Red Hat) Linux, FreeBSD, Ubuntu, and Debian GNU/Linux.
Although each package appears to be developed in isolation, there are relationships and interactions that become apparent when the whole system is considered. As a result, a dichotomy can be identified, similar to the one found in economics: software evolution in the small (the evolution of a single application) versus software evolution in the large (the evolution of compilations of software, composed of many different individual software applications that are combined together to form a system).
For this paper, we have selected one of the most popular libre software distributions, Debian GNU/Linux, and have examined it from a macro point of view. We have studied the stable releases of Debian over a period of nine years. For each release the source code of all applications was downloaded, and their evolution analyzed in terms of number of packages, size of each of them, programming languages used, and interdependencies among packages.
As a result of this analysis, we have found that Debian is an interesting collection composed of applications of varying sizes, with a large proportion of small, and few huge applications. Some of them evolve rapidly, while others change at a lower pace. Some applications have not changed during all the considered period, while others have been removed from the distribution. We have also discovered that, despite being developed by different groups, applications are hardly isolated: they are subject to complex interdependencies that have to be satisfied for the whole system to work. The number of these dependencies tends to explode as the system grows, rendering it more difficult to maintain.
In Debian, developers must show confidence in the interest, usability and maturity of each package they select for the distribution. Given this selection criteria, a large share of all mature libre software ever available for Linux-based systems is present in it. Therefore, Debian can be considered as a good proxy of all mature libre software ever developed for such systems. This permits the interpretation of the results of this study in a larger framework, as an overview of the evolution of the landscape of libre software for Linux systems.
The rest of this paper is organized as follows. The next section introduces some of the main characteristics of libre software distributions, also showing previous research related to this study. Section 3 introduces and explains the main research questions addressed. Then, Section 4 details the methodology used for the collection and analysis of the data, with the intention of clarifying the results shown later, in Section 5, which is organized in six subsections: total size, size of packages, maintenance of packages, languages, file sizes, and dependencies. The paper ends with a section on conclusions and further research.
2 Libre Software Distributions and Related Research
Large distributions based on libre software are created in a manner that is quite different from traditional software development. In large non-libre software systems most of the work is performed in-house, with only some pieces licensed from other companies, and some work outsourced to third parties. Even in the case of intense outsourcing, the work is usually performed in close coordination under tightly defined requirements. Libre software, on the contrary, is typically written by small, independent teams of volunteers, sometimes collaborating with paid staff from one or more companies. While projects may interact with each other, in particular where dependencies between the software they produce exist, there is often no central coordination, neither common goals or guidelines.
Therefore, people building and maintaining software distributions are those who have to adapt each package to work in coordination with the rest. Usually, the required modifications are contributed back to the groups performing the actual development, in a continuous provision of feedback. Because of this, although usually they do not develop much code themselves, they have to know with some detail not only the general architecture of the software they are integrating, but also the development process used by the original team.
One of the most visible tasks performed by distributions is the automatic installation and management of packages. Manually installing and upgrading a libre software application is time-consuming and requires certain technical skills that not all users have, such as compiling or configuring the installation of the software. Doing that for the hundreds, if not thousands, of applications that are typically installed in a GNU/Linux system is out of question, even for experienced users, since it would require a significant effort to manually download and install (or upgrade, when a new release is available) each package. This is precisely the main role that distributions play: to select, test, and prepare applications so that they become packages easy to install, upgrade or remove. Unsurprisingly, a number of companies have found this to be a business opportunity, offering a distribution plus some related services, such as support. There are also various community distributions that operate on a non-profit basis like many other libre software projects.
In fact, public availability of the source code of libre software programs, and the possibility of freely redistributing them, has resulted in a large number of libre software distributions. Both characteristics also facilitate their study: several of them have been published, mainly for the well-known Red Hat and Debian systems. Those studies detail several parameters of the packages contained, their size, and some statistics on the programming languages present, among other issues (Wheeler 2001; Gonzalez-Barahona et al. 2001; Amor et al. 2005).
Among the different libre software distributions, Debian GNU/Linux has been selected for this study because it is one of the most popular, accessible, complete (in terms of number of packages maintained) and best established. Debian is a community effort that has provided a software distribution based on the Linux kernel for well over 12 years. The work of the members of the Debian project is similar to that carried out in other distributions: software integration. Unlike many other distributions, Debian is mostly composed of volunteers who are spread all around the world. As a side-effect, all development infrastructure, including mailing lists, bug tracking, and the source code itself, is publicly available. In addition to integrating and maintaining software packages, members of the Debian project oversee the maintenance of a number of services, such as web sites and wikis.
Software evolution has been a matter of study for more than 30 years (Lehman and Belady 1985; Lehman and Ramil 2001). So far, the scope of software evolution analysis has always been single applications, such as the “classical” analysis of the OS/360 operating system (Lehman and Belady 1985) or, more recently, those on the Linux kernel (Godfrey and Tu 2000) or other well-known libre software applications, including Apache and GCC (Succi et al. 2001). Noteworthy is the proposal of studying the evolution of applications at the subsystem level (Gall et al. 1997), as this introduces the issue of granularity. Our approach considers a complete software compilation to be a system, with the constituent applications and libraries serving the role of subsystems.
The authors are not aware of a study on the evolution of a system integrating many independent software applications. In fact, software compilations have rarely been studied in software engineering, probably because of the constraints found when integrating software from different vendors, such as the restrictions imposed by the license of each piece. It is noticeable that even if one of the most promising steps of software engineering has been to create reusable components or modules, in a similar way as bricks and mortar, little attention has been paid to how the integration of these components evolve. A promising path in this direction has been the study of integration of COTS from a software evolution perspective (Lehman and Ramil 1998).
3 Research Questions
Software compilations are composed of heterogeneous pieces of software from many different sources. Some are developed by large, organized, well funded groups, such as the Mozilla Foundation, while some others are developed by a handful of volunteers. Therefore, they provide a diverse and comprehensive view of the libre software landscape. Furthermore, distributions provide a way to understand how different applications are interrelated (how each depends upon one or several other applications, and vice versa).
In this context, a compilation of the size of Debian can be considered a good proxy of libre software in general, thus offering a macroscopic view of the libre software landscape. Therefore, this paper can be considered to present a holistic study of libre software, analyzing how it is in the large, and drawing some conclusions about the phenomenon itself. Because of that, it is important to characterize the evolution of the main parameters of the distribution: total size (in lines of code), and total number of packages. The specific study of the distribution of the size provides some additional insight into this evolution.
Additionally, the changing demographics over time of programming language use presents itself as an avenue of exploration in this study. We examine the changes in popularity of various languages with respect to Debian applications, and discuss possible reasons for the various shifts and long term trends.
From another point of view, this paper goes a step beyond the single-release analysis of software distributions by considering their evolution over time. In this respect, the main goals slightly differ from those commonly found in software evolution studies. This is due to the different type of work involved in the creation and maintenance of software compilations, which is mainly integration and maintenance, with only some marginal true development. For example, a distribution might require an installer or some other software to perform administration tasks that require development effort, but theses cases are few in the context of all the effort to produce a new release. There are also some aspects that are common to classical software evolution analysis, such as how the size of the software it includes evolves.
Of course, software compilations must be maintained, but the practice of compilation maintenance differs from that of (single) software system maintenance. For example, Swanson’s well known categories of corrective, adaptive, and perfective software maintenance (Swanson 1976) have little bearing on software compilations. Instead, software compilation maintenance focuses on the integration of new versions of software that have been released. In other words, package maintainers keep track of each software application, and update the distribution with the newer versions. They also check that the package keeps working when new releases of libraries and other programs used by it are updated. It is not uncommon for package maintainers to become bug-reporters of the applications they maintain. This raises interesting issues worth investigating, such as when packages are included or removed from the compilation, and the tracking of packages in several releases of the compilation, to learn about their evolution patterns.
From these observations, another important research question emerges: how much code is changing between Debian releases? This can be refined by studying both the size of the packages that remained unchanged between releases, and that of packages that were already present in previous releases, but have changed, at least in part.
The importance of the relationships between packages have already been stressed. Although they are developed and maintained by independent teams, with little or no coordination, in the end all of them have to work together. For performing its job, any application has a large set of requirements on a usually long list of packages that it uses in one form or another. Therefore, an important research question is also how those dependencies evolve, both from a macro (for the distribution as a whole) and a micro (for a specific application) point of view.
In the end, by understanding how all these aspects evolve, we address the general question of how the Debian software distribution is evolving. Since it is a good proxy of mature libre software available for Linux, some insight on its characterization is also obtained.
Distributions are organized as a set of packages, each one usually corresponding to an application or a library, although they can also correspond to other products, such as documentation. As most libre software distributions do, Debian defines two different types of packages: source and binary. A source package contains the source code needed to produce a binary, installable, package. Once built, a source package results into one or more binary packages.
Debian maintains a Sources file for each release, describing the source packages that it contains. For each package, it contains the name and version, the list of binary packages built from it, the name and e-mail address of the maintainer, and some other information not relevant for this study. A package can be maintained by an individual or a team.
Binary: mozilla, mozilla-dev, libnspr4, libnspr4-dev
Maintainer: Frank Belew (Myth) <firstname.lastname@example.org>
57ee230[...]c66908a 719 mozilla_M18-3.dsc
5329346[...]bad03c8 28642415 mozilla_M18.orig.tar.gz
3adf83d[...]ca20372 18277 mozilla_M18-3.diff.gz
Maintainer: Frank Belew (Myth) <email@example.com>
Replaces: mozilla-dmotif, mozilla-smotif
Depends: libc6 (>= 2.1.2), libglib1.2 (>= 1.2.0),
libgtk1.2 (>=1.2.7-1), libjpeg62, libpng2, libstdc++2.10,
libz1, xlib6g (>= 3.3.6-4), libnspr4 (= M18-3), xcontrib
Suggests: postscript-viewer, pdf-viewer, eeyes |
imagemagick | netpbm | xli | xloadimage | xv, xanim |
ucbmpeg-play, freeamp | amp | splay | maplay | mpg123 | xmms
Conflicts: mozilla-dmotif, mozilla-smotif
Description: An Open Source WWW browser for X and GTK+
Mozilla is a sophisticated graphical World-Wide-Web browser,
The Depends field of a package description lists other binary packages needed for it to run successfully. Therefore, packages that satisfy those dependencies should be installed before, or at the same time, than the described package. In the above case, each of the packages in the Depends field (in some cases specific versions of packages, such as a version of libc6 higher or equal to 2.1.2, or libnspr4 version M18-3) should be installed before, or at the same time, than mozilla. Each of these dependencies is either explicit (one and only one package is specified), one-of-many (a list of packages separated by | is specified, of which only one is required to be installed, for example eeyes | imagemagick | netpbm |... |xv), or an abstract dependency (an identifier for a common one-of-many dependency—e.g., emacsen is commonly used to indicate a choice of either version of emacs or xemacs to be installed). Pre-Depends is a similar field used by some packages, listing dependencies that should be installed before the installation of the package can proceed.
A Debian binary package may also have some optional requirements, listed in the Recommends and Suggests fields. The Debian Policy Manual defines packages listed in Recommends as strong but not required dependencies, and as those that would “be found together with this package in all but unusual installations.” Suggests is used to declare optional dependencies that would enhance the original package but are not as common as those listed in the Recommends field. For a detailed formalization of Debian dependencies, and the method used to resolve them, see Mancinelli et al. (2006).
The study presented in this paper started by retrieving the files describing each Debian GNU/Linux stable releases between 2.0 and 4.0, that is: 2.0, 2.1, 2.2, 3.0, 3.1, and 4.0. For each of them, the corresponding Sources and Packages files of the i386 binary distribution were considered.
Once retrieved, Sources files were parsed, storing the resulting data into a database. Then, each source package was retrieved, the programming languages used in it identified, and the number of source lines of code (SLOC) for each file it contained, counted. The counting and language identification was performed with the SLOCCount tool. This tool analyzes a directory with source code, (in our case corresponding to a source package), identifies (by a series of heuristics) the files that contain source code, identifies for each of them (also by means of heuristics) the programming language, and finally counts the number of source lines of code they contain. SLOCCount counts “physical SLOC”, defined as follows: “a physical source line of code (SLOC) is a line ending in a newline or end-of-file marker, and which contains at least one non-whitespace non-comment character.”.3
SLOCCount also identifies identical files using MD5 hashes, and includes heuristics to detect, and avoid counting, automatically generated code. These mechanisms are helpful when analyzing the code, but have some deficiencies. MD5 detects identical files, but not those that have been slightly modified. With respect to automatic code, heuristics detect well-known or common cases, but may fail in some scenarios. Nevertheless, SLOCCount is a proven tool and it has been used on studies of Red Hat (Wheeler 2001) and Debian (Gonzalez-Barahona et al. 2001).
The results of the SLOCCount analysis were converted later into other formats, including both SQL and XML, which were used for later analysis, and for publishing most of the data.4
For creating the dependency graphs of each release, the corresponding Packages file were parsed, searching for Depends, Pre-Depends, Suggests and Recommends fields.
For IDGs, the following notation has been used (German et al. 2007): the starting package is depicted as a circle; binary packages are depicted as rectangles; abstract dependencies are depicted as diamonds; and the packages that are always installed in a Debian system are colored in orange (darker). These graphs are similar to those defined by Mancinelli et al. (2006), being the main difference that they do not contain nodes for abstract dependencies, which they call disjunctive dependencies. Instead, such information is stored as logic predicates. In addition, Mancinelli graphs include information about conflicting binary packages, and the nodes are annotated with the version of the package that they require.
Direct dependencies of package p is the set of nodes (binary packages and abstract dependencies) in V that are directly connected to p.
Direct subordinates of package p correspond to those nodes in V (binary packages and abstract dependencies) from which there is an edge to p. The Direct subordinates of p have p as one of its Direct dependencies.
Abstract dependencies represent a choice of one of many packages; only one of them needs to be installed to satisfy the dependency. This implies that there might be multiple ways in which the dependencies of a package can be satisfied (Tucker et al. 2007). We define the Instance of the IDG of a package as a subset of its IDG where each abstract dependency points only to one package (the one that solved that abstract dependency). A specific instance of an IDG of a package p represents how p can be installed in a specific Debian system, with specific packages solving each abstract dependency.
All dependencies of package p is the set of binary packages in its pIDG.
All potential subordinates of package p is all binary packages in a Debian distribution that include p in its IDG.
The set of all dependencies of p corresponds to the most common set of applications that need to be installed before p can function (each abstract dependency—one-of-many, or disjunctive—is resolved to exactly one binary package). All potential subordinates, on the other hand, include any binary package that might require p. For example, in Debian 2.2 mozilla lists xlib6g as one of its direct dependencies; and xlib6g lists xfree86common as one of its direct dependencies. xlib6g and xfree86common are members of all dependencies of mozilla. At the same time, mozilla and xlib6g are members of the set of all potential subordinates of xfree86common. Figure 6 shows the pIDG of mozilla under Debian 2.2.
5 Results and Observations
In the following subsections, the main results obtained from the study presented in this paper are shown and discussed.
5.1 Total Size
Size, in number of source packages and total lines of code, and mean package size of the Debian releases studied
Mean pkg size (SLOC)
In general terms, using time in the horizontal axis, a smooth growth of the software compilation can be observed, which is compatible with that described by Turski (1996). However, if we considered only releases, which is the methodology preferred by Lehman, the growth would be super-linear. The main reason for this is that the time interval between subsequent releases has been growing for most recent ones. However, given that the Debian project has not been actively seeking to increase the release interval, Turski’s model seems more appropriate in this case.
5.2 Size of Packages
A small number of large packages (over 100 KSLOC) exist, with their size increasing over time, as the sixth law of software evolution predicts (Lehman et al. 1997). Perhaps the most significant fact is that the average size of packages is relatively stable, around 30 KSLOC for Debian 4.0 and 23 KSLOC for other releases, see Table 1. Currently, we lack an authoritative explanation for this phenomenon, but we have several hypotheses. One of them is that libre software production tends to grow mainly by creating new, more specialized, smaller packages (that can be developed by a handful of developers), rather than large, complex ones (that require a large software development team). With time, some of the most successful small packages may attract more interest and developers, and start to grow. Perhaps the total mixture in Debian is so rich that while many packages grow in size, smaller ones are included causing the average to stay approximately constant.
5.3 Maintenance of Packages
Packages in Debian are identified by a name, a version (which should match the version of the package as defined by its original developers) and a Debian package revision number with the following format: \(\langle\)package name\(\rangle\)-\(\langle\)version number\(\rangle\)-\(\langle\)revision number\(\rangle\). For example, in Debian 4.0 the package for Mozilla’s Firefox is identified as mozilla-firefox-18.104.22.168-1, which corresponds to version 22.214.171.124, first revision of the Debian package (the revisions of the package are changes to the package specification, as described in Section 4). Except for dynamic libraries, Debian package names are rarely changed.6 This allows us to track packages from one release of Debian to another.
One of the main tasks of Debian maintainers is to track new versions of software packages, re-package them, and update the package descriptions accordingly. Whenever a new version of a package is released (either a major release, or a minor one) it is updated, and its identifier changed. This allows the assumption that if the version of a package in Debian has not changed, then the original package has not changed enough to warrant a new package version.7 It is also possible that the package is no longer maintained, but still useful to warrant its inclusion in a distribution.
For any given pair of Debian releases we can classify packages into three sets: common (those that appear in both distributions), removed (those that are in the older distribution, but not in the newer one), and new (those that appear in the newer distribution but not in the older one). Common packages include unchanged packages, those with the same version number in both distributions.
Number of packages (and SLOC of those packages) common and unchanged, for each release of Debian, with respect to release 2.0, and number of files in unchanged packages
Number of packages (and SLOC of those packages) common and unchanged, for each release of Debian, with respect to release 4.0, and number of files in unchanged packages
Out of the 1,096 packages included in Debian 2.0, 721 can be found in 4.0 (common packages). This means that only around 30% of the packages in Debian 2.0 were removed by the time Debian 4.0 was released, nine years later. For comparison, the number of packages of the 3.1 release that are still present in 4.0 is 7,300, out of a total of 10,106, which gives a similar percentage of removed packages.
With respect to unchanged packages, release 4.0 includes 132 with the same version number than they had in Debian 2.0. In other words, no less than 15% of the source packages included in Debian 2.0 are still the same in Debian 4.0, 9 years later.
Table 3 compares 4.0 with the previous releases. Even though a large percentage of Debian 2.0 remains unchanged in 4.0, such code is very small with respect to the current size of 4.0.
It is also important to notice that the number of files in unchanged packages, as presented in Tables 2 and 3, does no reflect the total number of files unchanged, which is higher: there are many files that do not change between Debian releases even when the version number of their package changes. Something similar can be said for the unchanged number of SLOC in those tables: it refers only to the size of packages that did not change. But outside those packages, many other files also did not change.
5.4 Programming Languages
Top programming languages in Debian 4.0, in MSLOC, for each Debian release studied, sorted by their importance in Debian 4.0
The most used language in all Debian releases is C, with a large difference over the second, C++. However, the evolution of their shares for the first and last releases analyzed, falling from 77% to 51% in the case of C, raising from 6% to 19% for C++, show different stories. While the relative importance of C is diminishing gradually, that of C++, and other languages, is growing from release to release. It can also be noticed that despite of these trends, the absolute size of the code written in C has been growing for all releases from about 19 MSLOC in Debian 2.0 to more than 147 MSLOC in Debian 4.0. It just happens that it is not growing as quickly as other languages.
The case of Shell, in a solid third place, has mainly to do with its presence in almost any package, of any kind. With the entry of increasingly smaller packages in the latest Debian releases, all with some Shell code, the total share of Shell is growing accordingly.
The most rapid entry in this top-8 of the languages in Debian is certainly Java, which grows from a marginal 0.5% in Debian 3.0 to 1.7% in 3.1 and 3.1% in Debian 4.0. Although it is still far from the top-three languages, it is currently in a strong fourth position. The main reason for this is the availability of large applications, such as Eclipse or Azureus. It is important to notice that the releases under study do not include neither the Sun Java Runtime Environment nor the Sun Java Development Kit, due to licensing issues. Although there are other Java runtimes and development kits, it is quite possible that this causes an underrepresentation of Java, since for most other languages, Debian includes at least one of the usual development (compiling or scripting) environments.
Yet in absolute terms C has grown three times during this period, although the total number of SLOC has grown 10 times. At the same time, some scripting languages (Shell, Python and Perl) have undergone an extraordinary growth, all of them multiplying their presence by factors superior to seven.
In terms of SLOC, some programming languages that could be considered as uncommon account for a significant share of the distribution. This is because, even though they are present in a small number of packages, these packages are large. For example, Ada accounts for a total of 576 KSLOC in Debian 3.0. But 430 KSLOC come from three packages (Gnat, an Ada compiler; libgtkada, a language binding for the GTK+ library; and Asis, a system to manage sources of Ada programs). LISP follows a similar pattern: it accounts for approximately 4 MSLOC in Debian 3.0, but 1.2 of these come from two single packages: GNU Emacs and XEmacs.
5.5 File Sizes
Mean file size for some programming languages
This is especially noteworthy taking into account the large differences in SLOC for those languages in each release. For example, for C the mean length lies between 260 and 295 SLOC per file, whereas in C++ this value is between 140 and 196. An exception to this behavior can be observed for the Shell language, which has tripled its size from Debian 2.0 to Debian 4.0. This may be because the Shell language is peculiar: almost all packages include something in Shell for their installation, configuration or as glue code. It is likely that what happens is that these scripts get more complex over time, and thus grow over the years. This adds up to the fact that Shell programs are seldom divided into several files: if there is more functionality, usually they just get longer.
It is also remarkable how procedural languages usually have larger average file lengths than object-oriented languages. For example, the files in C or YACC are usually larger, in average, than those in C++. This suggests that class-inheritance or other characteristics of object-oriented languages are somehow reflected in shorter file sizes.
Libre software, just like any other type of software, is designed to be modular. Software reuse is particularly easy in libre software, as there are no economic constraints: most libre projects can use the results of other libre projects without having to pay for that privilege. The only requirement is for the license of the module to be used to be consistent with the license that wants to use it. For example, GPL-licensed software is able to use a BSD-licensed library without any extra arrangements. See Rosen (2004) for a discussion of the main libre source licenses and their compatibility.
Number of dependencies and subordinates of binary applications in different Debian releases
In Debian 2.0 the packages with more dependencies had 19 (python-gdk-imlib, boot-floppies and libhdf4-altdev). In Debian 4.0 the package with the largest number of dependencies is kde, with 561, followed by gnome, with 486. kde and gnome are sets of GUI applications for the Unix desktop, none of them is present in Debian 2.0.
Both kde and gnome are bundles of packages. In practical terms this means that they do not have any source code associated: when these packages are installed, the bundle is installed. This raises three noteworthy issues: first, from the point of view of the user installing such bundles, these collections of packages operate as a single software product; second, it can be argued that these packages inflate the average number of dependencies without adding any new source code themselves; and third, they can be considered a great demonstration of the power of component-oriented software engineering, where a “new” application, the bundle, can be created from many components without writing a single line of code.
Evolution of the number of all dependencies for some selected binary packages, for the studied Debian releases
Evolution of the number of all potential subordinates for selected binary packages
In Subsection 5.2 it was highlighted how many of the newer applications are very small. It is now possible to argue that applications can be smaller because there are more packages, including libraries, available, upon which they can depend and reuse. In other words, applications can be smaller, but at the same time they can be more powerful.
6 Conclusions and Further Research
In this paper we have shown the results of a study on the evolution of the stable releases of Debian from 1998 to 2007. We have analyzed and presented the evolution of the size of their source code (measured in lines of source code), the number and size of their packages, the changed and unchanged packages, the use of programming languages, and the dependencies between packages.
Of the many findings from this study, one observation in particular stands out: stable releases double in size (measured by number of packages or by lines of code) approximately every two years. This, when combined with the huge size of the system (about 300 MSLOC and 10,000 packages in 2007) may pose significant problems for the management of its future evolution, something that has probably influenced the delays experienced for the last stable releases.
During the period under study, the mean size of packages has remained almost constant, which means that the system has more and more packages, growing linearly with the size of the system in SLOC. Debian 4.0 has 10 times more packages than Debian 2.0. In order to cope with this growth, Debian must increase its number of package maintainers, the number of packages under the responsibility of each maintainer, or both. Such a growth, however, is not easy to cope with, and causes problems of its own, especially in the area of coordination.
With respect to the absolute figures, it can be noted that Debian 4.0 is probably one of the largest coordinated software collections in history, and almost certainly the largest one in the domain of general-purpose software for desktops and servers. This means that the human team maintaining it, which has also the peculiarity of being completely formed by volunteers, is exploring the limits of how to assemble and coordinate such a huge quantity of software. Therefore, the techniques and processes they employ to maintain a certain level of quality, a reasonable speed of updating, and a release process that delivers usable stable versions, are worth studying, and can for sure be of use in other domains which have to deal with large, complex collections of software.
As far as programming languages are concerned, C is the most commonly used, although it is gradually losing its dominance. Scripting languages (Perl, Python, Shell), C++ and Java are those with a higher growth in the newer releases, whereas most other compiled languages have even inferior growth rates than C. These variations also imply that the Debian team has to include developers with skills in new (for Debian) programming languages in order to maintain the evolving shares. Although Debian maintainers do not develop the packages themselves, they must have a detailed understanding of the their internal workings. Consequently, proficiency in the native programming language is a de facto necessity for them. By looking at the trends in languages used within the distribution, the project could estimate how many developers fluent in a given language will be needed. In addition, the evolution of the different languages can also be considered as an estimate of how libre software is evolving in terms of languages used, although some of them, such as Java, are certainly misrepresented.
One of the most surprising results has been the high number of packages that are present in the latest release exactly as they were in Debian 2.0, 9 years before. In general, the presence of unchanged packages between any two releases has been studied in detail, finding that there is a large share of them (with respect to the common packages, that is, those present with the same name in both releases). This indicates that a large share of the code in Debian did not require to be maintained for long periods of time, or maintenance was not performed on it, but the package still was found to have sufficient quality to be included in the distribution.
Using dependency information, we have shown that packages are highly interrelated, and as Debian evolves, the total number of dependencies grows quickly. We have also seen how packages with the interpreters for some scripting languages, Perl and Python, are among those being used by more packages, and that the C run time library, libc6, is being required by almost every package.
From a combination of the dependency information and the study of the size of the packages, we have learned that the growing number of small packages is possible because they can use many other components in the distribution. That is, the modularity of Debian, understood as a large collection of components, is allowing developers to build powerful, yet small applications, that gain advantage of using tens of other packages.
In summary, the study of large libre software distributions such as Debian has proven to be revealing, not only of how they are evolving over time, but also of how individual applications interact among themselves. The latter finding shows how these distributions, where applications and libraries are really ready to be used by any other application, foster the composition and code reuse at a new level. This kind of result emerges only after studying the system as a whole, although it mainly impacts how individual applications are built.
As further work, several research lines have been opened by this study. For example, the evolution of code artifacts shown in it could be put in the context of the activity of the volunteers doing all the packaging work. While some work has been done in this area (Michlmayr and Hill 2003), more research needs to be performed before a link can be established between the evolution of the skills and size of the developer population, the complexity and size of the distribution, the processes and activities performed by the project, and the quality of the resulting product. Only by understanding the relationships between all these parameters, reasonable measures can be proposed to improve the quality of the software distribution, or to shorten the release cycle without harming reliability and stability of the releases.
Another promising line is related to the further study of the evolution of dependencies. The trustability of an application depends not only on its own characteristics, but also on those of all the components (packages, in our case) that it is using. Therefore, there should be a balance between the convenience of using more and more external packages, for functionality, modularity and code reuse reasons, and the convenience of not using too much. Or at least, consider carefully how they impact on the trustability of the whole application. This balance could be studied over time, relating the different packages in the dependency set to bug reports and their relevance.
In general, all studies that relate the kind of analysis shown in this paper to other sources of information, such as the issue tracking databases of the projects, the mailing lists used for maintenance of the packages, the usage information available from the Debian Popularity Contest, etc., will allow for more interesting results. In the end, if Debian and other distributions are to be conceived as a rich ecosystem, more research is needed before we can model the relationship between their more important parameters.
Through this paper we will use the term “libre software” to refer to any code that conforms either to the definition of “free software” (according to the Free Software Foundation) or “open source software” (according to the Open Source Initiative).
The original Sources file in which this entry can be found is in http://www.debian.org/mirror/list.
The Debian policy requires the shared object name of a library to be part of the package name. This permits different versions of the library to coexist in the same computer.
It is possible for a package to be in active development and yet not having released a new version in time to be included in a new Debian distribution. But this is unlikely, given that Debian distributions are released several years apart, while libre software projects tend to ‘release-early, release-often’.
We thank the anonymous reviewers for their helpful comments and suggestions.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.