In the following subsections, the main results obtained from the study presented in this paper are shown and discussed.
The total size of the six studied releases of Debian is shown in Table 1. It presents, for each release, the date of publication, the total number of SLOC (sum of the SLOC of all packages in the distribution), the number of packages it contains, and the mean package size in SLOC. In nine years the number of packages in Debian and the total number of lines of code have grown by an order of magnitude, while the average size of a package has remained relatively stable.
Figure 2 shows the size of each distribution with respect to time. Although the number of points is insufficient to obtain a statistically significant model, we can infer from the current data that the Debian distribution has doubled in size in terms of source lines of code and of number of packages around every 2 years. This growth has been fastest at the beginning of the period: from July 1998 to August 2000 we observe an increase of 135%. In later releases this pace has slowed, and for example between July 2002 and June 2005 the source code base has not experimented a 100% increase during this 3 year period.
In general terms, using time in the horizontal axis, a smooth growth of the software compilation can be observed, which is compatible with that described by Turski (1996). However, if we considered only releases, which is the methodology preferred by Lehman, the growth would be super-linear. The main reason for this is that the time interval between subsequent releases has been growing for most recent ones. However, given that the Debian project has not been actively seeking to increase the release interval, Turski’s model seems more appropriate in this case.
Size of Packages
Histograms in Fig. 3 display package sizes for Debian 2.0 and Debian 3.0 (measured in SLOC). It can be observed that the largest packages are getting larger and larger, while at the same time more and more small packages enter the distribution. It is surprising how many packages are very small (less than one thousand lines of code), small (between one and ten thousand lines) and medium-sized (between ten and fifty thousand lines of code).
A small number of large packages (over 100 KSLOC) exist, with their size increasing over time, as the sixth law of software evolution predicts (Lehman et al. 1997). Perhaps the most significant fact is that the average size of packages is relatively stable, around 30 KSLOC for Debian 4.0 and 23 KSLOC for other releases, see Table 1. Currently, we lack an authoritative explanation for this phenomenon, but we have several hypotheses. One of them is that libre software production tends to grow mainly by creating new, more specialized, smaller packages (that can be developed by a handful of developers), rather than large, complex ones (that require a large software development team). With time, some of the most successful small packages may attract more interest and developers, and start to grow. Perhaps the total mixture in Debian is so rich that while many packages grow in size, smaller ones are included causing the average to stay approximately constant.
Maintenance of Packages
Packages in Debian are identified by a name, a version (which should match the version of the package as defined by its original developers) and a Debian package revision number with the following format: \(\langle\)package name\(\rangle\)-\(\langle\)version number\(\rangle\)-\(\langle\)revision number\(\rangle\). For example, in Debian 4.0 the package for Mozilla’s Firefox is identified as mozilla-firefox-22.214.171.124-1, which corresponds to version 126.96.36.199, first revision of the Debian package (the revisions of the package are changes to the package specification, as described in Section 4). Except for dynamic libraries, Debian package names are rarely changed.Footnote 6 This allows us to track packages from one release of Debian to another.
One of the main tasks of Debian maintainers is to track new versions of software packages, re-package them, and update the package descriptions accordingly. Whenever a new version of a package is released (either a major release, or a minor one) it is updated, and its identifier changed. This allows the assumption that if the version of a package in Debian has not changed, then the original package has not changed enough to warrant a new package version.Footnote 7 It is also possible that the package is no longer maintained, but still useful to warrant its inclusion in a distribution.
For any given pair of Debian releases we can classify packages into three sets: common (those that appear in both distributions), removed (those that are in the older distribution, but not in the newer one), and new (those that appear in the newer distribution but not in the older one). Common packages include unchanged packages, those with the same version number in both distributions.
Tables 2 and 3 contain some statistics of common and unchanged packages in the different distributions. To facilitate the comparison in relative and absolute terms, the Debian release that is compared is also included. For instance, Debian 2.0 has in common with itself all its (1,096) source packages.
Out of the 1,096 packages included in Debian 2.0, 721 can be found in 4.0 (common packages). This means that only around 30% of the packages in Debian 2.0 were removed by the time Debian 4.0 was released, nine years later. For comparison, the number of packages of the 3.1 release that are still present in 4.0 is 7,300, out of a total of 10,106, which gives a similar percentage of removed packages.
With respect to unchanged packages, release 4.0 includes 132 with the same version number than they had in Debian 2.0. In other words, no less than 15% of the source packages included in Debian 2.0 are still the same in Debian 4.0, 9 years later.
Table 3 compares 4.0 with the previous releases. Even though a large percentage of Debian 2.0 remains unchanged in 4.0, such code is very small with respect to the current size of 4.0.
It is also important to notice that the number of files in unchanged packages, as presented in Tables 2 and 3, does no reflect the total number of files unchanged, which is higher: there are many files that do not change between Debian releases even when the version number of their package changes. Something similar can be said for the unchanged number of SLOC in those tables: it refers only to the size of packages that did not change. But outside those packages, many other files also did not change.
Table 4 shows the evolution of the most significant languages, those that account for at least 1% of code in Debian 4.0 (C, C++, Shell, Java, Perl, LISP, Python, PHP). Below that 1% mark we find, in order of their relative shares, also for Debian 4.0: Fortran, Tcl, Ada, Ruby, ML, Objective C, YACC, C#, Haskell, Expect, Awk and Modula3. The aggregation of all the Assembler code found would make it the 8th language in Debian 4.0, but has been omitted from the table.
The most used language in all Debian releases is C, with a large difference over the second, C++. However, the evolution of their shares for the first and last releases analyzed, falling from 77% to 51% in the case of C, raising from 6% to 19% for C++, show different stories. While the relative importance of C is diminishing gradually, that of C++, and other languages, is growing from release to release. It can also be noticed that despite of these trends, the absolute size of the code written in C has been growing for all releases from about 19 MSLOC in Debian 2.0 to more than 147 MSLOC in Debian 4.0. It just happens that it is not growing as quickly as other languages.
The case of Shell, in a solid third place, has mainly to do with its presence in almost any package, of any kind. With the entry of increasingly smaller packages in the latest Debian releases, all with some Shell code, the total share of Shell is growing accordingly.
The most rapid entry in this top-8 of the languages in Debian is certainly Java, which grows from a marginal 0.5% in Debian 3.0 to 1.7% in 3.1 and 3.1% in Debian 4.0. Although it is still far from the top-three languages, it is currently in a strong fourth position. The main reason for this is the availability of large applications, such as Eclipse or Azureus. It is important to notice that the releases under study do not include neither the Sun Java Runtime Environment nor the Sun Java Development Kit, due to licensing issues. Although there are other Java runtimes and development kits, it is quite possible that this causes an underrepresentation of Java, since for most other languages, Debian includes at least one of the usual development (compiling or scripting) environments.
To better understand the evolution of some of the top languages, (Fig. 4) shows the lines of code for four of them, for each studied Debian release. In it, the decline in relative terms of C, but its growth in absolute terms, is clearly visible. A noteworthy similar case is LISP, which is the third most used language in Debian 2.0 and becomes fourth in Debian 3.1 and fifth in Debian 4.0. In contrast to these, both Shell and C++ grow significantly, amounting for a large share of all the code in Debian 4.0.
Figure 5 provides a view of the relative evolution of some programming languages, normalizing to their respective situation in Debian 2.0. The relative SLOC (vertical axis) is computed by dividing the number of SLOC in a given distribution by the number of SLOC in Debian 2.0. For example, Python has 60 times more SLOC in 4.0 than in 2.0, but C only seven times more. This plot is useful to identify some of the languages that have become more popular in the last nine years: Python, Shell, and C++ (Java is not in the figure). When this information is combined with Table 4, it is found that the growth of these languages is mainly at the expense of C and Perl.
Yet in absolute terms C has grown three times during this period, although the total number of SLOC has grown 10 times. At the same time, some scripting languages (Shell, Python and Perl) have undergone an extraordinary growth, all of them multiplying their presence by factors superior to seven.
In terms of SLOC, some programming languages that could be considered as uncommon account for a significant share of the distribution. This is because, even though they are present in a small number of packages, these packages are large. For example, Ada accounts for a total of 576 KSLOC in Debian 3.0. But 430 KSLOC come from three packages (Gnat, an Ada compiler; libgtkada, a language binding for the GTK+ library; and Asis, a system to manage sources of Ada programs). LISP follows a similar pattern: it accounts for approximately 4 MSLOC in Debian 3.0, but 1.2 of these come from two single packages: GNU Emacs and XEmacs.
The mean file size for most of the languages, including those with a largest share, show a remarkable stability from release to release (see Table 5).
This is especially noteworthy taking into account the large differences in SLOC for those languages in each release. For example, for C the mean length lies between 260 and 295 SLOC per file, whereas in C++ this value is between 140 and 196. An exception to this behavior can be observed for the Shell language, which has tripled its size from Debian 2.0 to Debian 4.0. This may be because the Shell language is peculiar: almost all packages include something in Shell for their installation, configuration or as glue code. It is likely that what happens is that these scripts get more complex over time, and thus grow over the years. This adds up to the fact that Shell programs are seldom divided into several files: if there is more functionality, usually they just get longer.
It is also remarkable how procedural languages usually have larger average file lengths than object-oriented languages. For example, the files in C or YACC are usually larger, in average, than those in C++. This suggests that class-inheritance or other characteristics of object-oriented languages are somehow reflected in shorter file sizes.
Libre software, just like any other type of software, is designed to be modular. Software reuse is particularly easy in libre software, as there are no economic constraints: most libre projects can use the results of other libre projects without having to pay for that privilege. The only requirement is for the license of the module to be used to be consistent with the license that wants to use it. For example, GPL-licensed software is able to use a BSD-licensed library without any extra arrangements. See Rosen (2004) for a discussion of the main libre source licenses and their compatibility.
As was explained in Section 4, these relationships can be found, in the form of dependencies, in the Debian distribution. Table 6 summarizes the sizes of the dependents, one level dependents, dependencies, and one-level dependencies for the packages in the different releases of Debian. The number of binary packages in Debian has grown an order of magnitude from version 2.0 to 4.0. At the same time the mean number of all dependencies has grown at a similar rate: binary packages are becoming more interrelated.
In Debian 2.0 the packages with more dependencies had 19 (python-gdk-imlib, boot-floppies and libhdf4-altdev). In Debian 4.0 the package with the largest number of dependencies is kde, with 561, followed by gnome, with 486. kde and gnome are sets of GUI applications for the Unix desktop, none of them is present in Debian 2.0.
Both kde and gnome are bundles of packages. In practical terms this means that they do not have any source code associated: when these packages are installed, the bundle is installed. This raises three noteworthy issues: first, from the point of view of the user installing such bundles, these collections of packages operate as a single software product; second, it can be argued that these packages inflate the average number of dependencies without adding any new source code themselves; and third, they can be considered a great demonstration of the power of component-oriented software engineering, where a “new” application, the bundle, can be created from many components without writing a single line of code.
As the number of dependencies of packages evolves, their dependency graphs are likely to change too. For example, Fig. 6 shows the pIDG of mozilla in Debian 4.0, which can be compared to its dependency graph in Debian 2.2, depicted in Fig. 7. Mozilla required 13 packages in 2.2 (the first version of Debian to include it), and 72 in 4.0. This growth is expected as applications evolve and grow to satisfy newer requirements.
With respect to the number of subordinates of a package, the story is different. In this case, the median is zero, meaning that most packages do not have any subordinates. Yet their average number keeps growing at a rate similar to the growth of the number of packages in the distribution. This implies that the subordinates of some packages are growing very fast (a small portion of packages are being used by a very large number of packages). For example, in Debian 2.0, perl has a total of 118 subordinates, but in Debian 4.0 it has 11,459. It is also not surprising that the packages with the largest number of subordinates are libraries (such as libc6, the GNU C library, which has the largest number of subordinates in every release of Debian), interpreters (such as perl) or utilities (such as binutils and sed). The number of potential subordinates can serve as a good indicator of the success of a library: the more binary packages that depend on it, the more successful it is. Table 7 and 8 show the evolution of the size of the dependencies and subordinates of selected applications.
In Subsection 5.2 it was highlighted how many of the newer applications are very small. It is now possible to argue that applications can be smaller because there are more packages, including libraries, available, upon which they can depend and reuse. In other words, applications can be smaller, but at the same time they can be more powerful.