Keywords

1 Introduction

Two different researchers in genetic epidemiology (more on that field later) write two equally good manuscripts that describe an experiment with computational steps. Both manuscripts are accepted by an equally prestigious journal after peer review. One researcher, however, does not supply the computer code (from now on: ‘code’) that was used to generate the results, where the other does. Are the conclusion of these papers to be trusted equally? Is this difference relevant and worth the effort? How common is it to share code, and, if shared, how can it be preserved? This chapter discusses why code, a type of paradata, should be supplied and what features it needs to have for it to be useful.

The concept of paradata (although not named as such yet) was introduced by Couper and colleagues, who developed a computer-assisted interview program that, among others, records all key strokes, measures the time used to answer each question, and even the time the monitor is turned off (Couper, 1998). The goal in that context was to assess and improve survey quality.

There exists no standard definition of paradata (Nicolaas, 2011; Sköld et al., 2022; Huvila, 2022). Without declaring it a formal definition, however, paradata can be understood in general sense as ‘data about processes’ (with e.g. survey paradata termed alternatively as records and audit trails) (Nicolaas, 2011), or in more specific sense in relation to data collection, as ‘data about the data collection process’ (Choumert-Nkolo et al., 2019). This chapter uses the latter as a working definition.

The code used in computational experiments is just that ‘data about the data collection process’ as it commonly downloads data, selects relevant subsets in those data, performs statistical tests, and generates figures. In each case, code answers the question: ‘Where is it (i.e. the result) coming from?’. Code, hence, is paradata, and this chapter explores and illustrates the consequence of that premise.

A recent paper explores the use of paradata to increase the impact of data, stating that lack of paradata can be seen as ‘a drastic constraint’ in the use of data and offer some suggestions to make paradata useful for data re(use): Paradata should be comprehensive, documented in a useful way, the documentation and data should have co-evolved, and the paradata should be computer-friendly (Huvila, 2022). At first glance, these suggestions appear to work well for code and will be discussed in detail below.

There are multiple reasons why useful paradata matters. Most obvious is that having useful paradata gives an understanding of how data is produced. This knowledge helps researchers from different fields to understand each other and collaborate. Additionally, useful open data is needed for Open Science to convincingly show its benefits. Finally, in computational fields, it can help understand how scholarly knowledge is produced (Huvila, 2022).

This chapter discusses these general reasons applied to code and its implications for knowledge management, including recommendations on how to make code useful paradata in Sect. 6 (Fig. 1).

Fig. 1
A combined block diagram presents the relationship between data, paradata, and metadata. In the context of genetic epidemiology, it exhibits the connections among input data, results, code, and published papers.

Left: general relation between data, paradata, and metadata. Right: the same relations specified for genetic epidemiology. Input data: genetic data, phenotypic data, and associations found in earlier studies. Results: associations between genetic data and phenotypic data. Code: the computer code used in a computational experiment. Paper: the scholarly paper describing the (code of the) experiment

2 Code Availability

To determine how useful code is, it needs to be available. However, it is not common to publish the code of an experiment or analysis (Stodden, 2011; Read et al., 2015) (with a pleasant exception being Conesa & Beck, 2019). For example, in computer graphics, a field intimately familiar with computer code, 42% of 454 SIGGRAPH papers supply computer code (Bonneel et al., 2020). Another study analysed the reproducibility of registered reports in the field of psychology, where 60% of 62 studies supplied the code to redo the analysis (Obels et al., 2020). For articles in Science magazine, 12% of 204 studies published the code (Stodden et al., 2018) (note that these were in years 2011–2012). Unpublished code has been the cause of some saddening examples, such as an algorithm that detects breast cancer from images better than a human expert, yet failed to ever be reproduced (Haibe-Kains et al., 2020).

When code of an academic paper is not published, one could contact the corresponding author and request it. However, the response rate of corresponding authors with a request for data in computational fields is around 50% (Manca et al., 2018; Stodden et al., 2018; Teunis et al., 2015) (the field of emergency medicine seems to be a pleasant outlier, with 73% of 118 emails being replied O’Leary, 2003). When getting an answer of the corresponding author, 48% out of 134 replied emails will actually result in a sharing of the code (Stodden et al., 2018). The responses of unwilling authors (see Stodden et al., 2018 for some real examples) can come across as so caustic that one may be excused from not contacting a corresponding author.

3 Genetic Epidemiology

This chapter uses genetic epidemiology as a specific example, to illustrate in which way code is paradata, and why this type of paradata is relevant. However, any field that uses computation and sensitive data in its experiments could be used as an example.

Genetic epidemiology is a field within biology that tries to determine the spread of heritable traits and their underlying biological mechanism. For example, we know that lactose intolerance in adults is caused by a decline in the production of lactose-degrading enzymes and is most commonly found in south-east Asia and south Africa (Storhaug et al., 2017). The trait is caused by the genetic make-up, or ‘genotype’ , of a person. The trait, also called ‘phenotype’, in this example is lactose intolerance at adult age, yet any human property, such as weight or height, can be studied. When an association between genotype and phenotype is found, these associations, when relevant enough, are used to create the so-called gene panels, where the location of the gene causing an association is measured specifically, to detect people at risk for the associated phenotype. The rest of this section describes a genetic epidemiology study in more detail, with special focus on the computational experiment.

The example study followed is a pseudorandomly selected paper from Ahsan et al. (2017). The primary data used by that paper is from a population study called the Northern Swedish Population Health Study (NSPHS) that started in 2010 (Igl et al., 2010). The approximately 1000 participants were initially mostly surveyed about lifestyle (Igl et al., 2010), and follow-up studies provided the type of data relevant for this paper, which are (1) the genotypes (Johansson et al., 2013), (2) the phenotypes, in this case, concentrations of certain proteins in the blood (Enroth et al., 2014, 2015).

The first type of primary data, the genotypes, consists of single nucleotide polymorphisms (SNPs, pronounced ‘snips’). An SNP has a name and a location within the genome. At the SNP’s location in the DNA, there will be two nucleotides. One of these nucleotides is inherited from the mother, the other from the father. The DNA consists of billions of nucleotides. There are four types of nucleotides, called adenosine, cytosine, guanine, and thyrosine, commonly abbreviated as A, C, G, and T, respectively.

One SNP example is rs12133641, which is an SNP located at position 154,428,283 (that is at the 154 millionth nucleotide), where 67% of the people within this study have an A, and 33% have a G (also from Ahsan et al., 2017, Table S3). From this it follows (assuming the nucleotides are inherited independently) that 45% of subjects have the genotype AA, 44% have AG, and 11% have GG.

The second type of data, the phenotypes, are concentrations of proteins in the blood. The nucleotides of the DNA contain the code for building proteins , as well as the rate at which a protein is created (for sake of simplicity, it is assumed that such a rate is constant, yet, in practice, there are complex regulation mechanisms). Some proteins end up in the blood, and their presence can be used to assess the health of an individual. IL6RA is one such protein, and its concentration may (and will, see below) be associated with the SNP mentioned earlier.

The field of genetic epidemiology looks—among others—for correlations between genetic data and biological traits. For example, Ahsan and colleagues show that SNP rs12133641 is highly correlated (p-value is \(3.0^{-73}\) , for n = 961 individuals) with protein IL6RA (Ahsan et al., 2017). The direction of the association is also concluded: The more guanines are present at that SNPs location, the higher concentration of IL6RA can be found in a human’s blood . The amount of variance that can be explained by an association (i.e. the R2 ) is rarely 100%, which means that a trait (in this case, the concentration of IL6RA) cannot be perfectly explained from the genotype (in this case, SNP rs12133641) alone. In this example, as much as 43% of the variance can be attributed to an individuals’ genotype . Additional factors, such as the effect of the environment (e.g. geographic location, time of day the measurement was done), lifestyle (e.g. smoking yes/no), or having a disease (e.g. diabetes), are needed to explain the additional variation.

The conclusions drawn from this chapter may end up in the clinic. For the sake of having a clear (yet fictitious) example, let us assume that a high level of IL6RA is associated with a disease that develops later in life, yet is preventable by lifestyle changes (see Pope et al., 2003 for an example in Alzheimer’s disease). Would this be the case, we can create a tailored experiment, called a gene panel, that specifically measures SNP rs12133641. If the gene panel shows an individual has two guanines, we know that this person is likelier to develop higher levels of IL6RA and is likelier to benefit from the lifestyle changes.

From this simple example, it will be easier to measure the level of IL6RA in the blood than using a gene panel, as blood tests are easier and cheaper. However, there are associations published for many diseases, in which one SNP (e.g. phenylketonuria) or many SNPs (e.g. Bruce & Byrne, 2009) contribute to being more likely to develop a disease in the future. Here, the phenotype (having a disease in the future) is impossible to detect at the present, and associations found in earlier studies are used to create a gene panel. As creating a gene panel is costly, those associations better be correct.

Additionally, there is an interdependency of scholarly findings here: The SNP has received its name based on a computational experiment. This earlier experiment that concluded the usefulness of that SNP is based on some DNA sequences. This experiment is based on assumed DNA sequences. DNA sequences, however, are (nowadays) not simply read, yet are the result of a complex computational analysis instead, with its own dedicated field of research. Both studies assumed a correctly calculated DNA sequence. This means that if the DNA sequence analysis contained a software bug, this study may be invalidated. Additionally, the result of this study may be used in follow-up studies that assume the result to be correctly calculated: One paper’s conclusion is the next paper’s assumption.

4 Why Code Is Useful Paradata

The experiment described above is run by code. It was code that detected the relationship between the genotype (in this case, SNP rs12133641) and the phenotype (in this case, the concentration of IL6RA). To be more precise, it was code that read the data, subsetted the data, removed outliers, performed the statistics, and generated the plots. For the rest of the discussion, we assume that the code is available to us (if not, see Sect. 2 for a glum estimate of the chance of obtaining the code).

There are multiple reasons why (useful) code matters, and these are the same reasons as why useful paradata matters: Code gives an understanding of how the raw results and the subsequent scholarly knowledge are obtained from an experiment. Additionally, code helps researchers from different fields to understand each other and collaborate. Additionally, code helps Open Science reach its goals of openness and transparency. The core of these reasons is to achieve reproducible science: That any person in any field can redo a computational experiment and see exactly what happened.

For computational science, it may appear to be relatively easy to reproduce an experiment, as all it takes is a computer, electricity, an optional Internet connection, the code, and the data. In practice, however, only 18% of 180 computational studies are easily reproducible (Stodden et al., 2018). To some, it appears that the academic culture to reproduce results has been lost over time (Peng, 2011), with labs that embrace reproducibility (for example Barba, 2016) being the exception. One suggested way forward is to make the reproduction of research a minimal requirement for publication (Peng, 2011).

A genetic epidemiologist works with sensitive data as well: The genetic sequences of participants are private Clayton et al., 2019. For research to be reproducible, one needs both the code and the data to reproduce the (hopefully) same results. This problem is discussed in Sect. 7.

Code holds the ground truth of an experiment; it does the actual work. The more complex the computation pipeline is, the easier it is to have a mismatch between the article (that describes what the code does) and the code (that actually does the work). The moment these two disagree, it is the code that is true.

5 Preserving Code

Code is rarely preserved (Barnes, 2010). This section discusses the preservation of code for a short, medium, and long term.

5.1 Code Hosting

To preserve code for a short term, a code hosting website is a good first step. A code hosting website is a website where its users can create dedicated pages (called ‘repositories’) for a project, upload code, and interact with that code. There are multiple code hosting websites, with GitHub being the most popular one (Cosentino et al., 2017). The use of code hosting websites has increased strongly (Russell et al., 2018), accommodates collaboration (Perez-Riverol et al., 2016), and improves transparency (Gorgolewski & Poldrack, 2016), due to its inherent computer-friendliness. See Cosentino et al. (2017) for an extensive overview of research conducted on GitHub.

Hosted code commonly keeps a history of file changes. This means that when a change is made to the code, a new version is created. In the case that the change was harmful, one can go back to an earlier version and continue again from there. The version control system keeps track of who-did-what transparently. It is a general recommendation to put version control on all human-produced data (Wilson et al., 2014), as well as openly working on the code from the start (Jiménez et al., 2017). Half of the published code has such a version control system (Stodden et al., 2018).

The degradation of software is a known feature for nearly four decades. This is called ‘bit rot’ by Steele Jr. et al. (1983), or ‘software collapse’ by Hinsen (2019), in which software fails due to dependencies on other software. Using a code hosting website, which only passively stores code, ignores this problem.

5.2 Continuous Integration

To preserve code for a longer timespan, it needs to be embraced that software degrades (Beck, 2000). Continuous integration (‘CI’) allows one to verify if code still works and, if not, to be notified.

Some code hosts allow a user to trigger specialized code upon uploading a change, called a CI script. Such a CI script typically builds and tests the code. This practice is known to significantly increase the number of bugs exposed (Vasilescu et al., 2015) and increase the speed at which new features are added (Vasilescu et al., 2015). CI can be scheduled to run on a regular basis and notify the user directly when the code has broken down.

5.3 Containerization

To preserve code for the longest time, both code and its dependencies can be put into a so-called container. The most reproducible way of submitting the code of an experiment is to put all code with all the (software) dependencies in a file that acts as a virtual computational environment, called a ‘virtual container’ (from now on: ‘container’). Such a container is close to the golden standard of reproducible research as suggested by Peng (2011) .

6 Making Code Useful Paradata

Useful paradata, in general, is (1) comprehensive, (2) documented appropriately, (3) documented in co-evolution with the data, and (4) friendly to computers (Huvila, 2022). In this section, these ideal properties of paradata are applied to code. However, code has multiple uses, as code can be used to (1) reproduce, (2) replicate, or (3) extend a computational experiment (Benureau & Rougier, 2018). Depending on the intended use of the code, there are different requirements for code being useful paradata.

6.1 Code Must Be Usefully Documented

For code to be ideal paradata, it must be usefully documented (Huvila, 2022). For purposes of reproducing code, it should at least be documented how to run the code and what it ought to do. Although this may be obvious, only 57% out of 56 science papers with obtainable code (in total there were 180 papers) do so (Stodden et al., 2018).

When it can be found out how an experiment is run, it is possible (even ideal!) that no code is read at all. Within that context, it could be argued that no (further) documentation is needed. However, all code in general should be documented ‘adequately’ (Peng et al., 2006), ideally writing code in such a way that it becomes self-explanatory (Wilson et al., 2014) and for the remaining code to document the reasons behind it, its design, and its purpose (Wilson et al., 2014).

For purposes of extending a study and its code, documentation becomes even more important, as the code will be read and modified. The extent of investing time in documenting code is recommended to be proportional to the intended reuse (Pianosi et al., 2020), and there exists a clear relationship between the reuse of code and its documentation effort (Cosentino et al., 2017; Hata et al., 2015).

6.2 Code and Documentation Must Align

For code to be ideal paradata, its creation and documentation need to align, not least because they shape each other (Huvila, 2022). This emphasized part of the quote resonates strongly with the idea of literate programming (Knuth, 1984), in which documentation and code are developed hand-in-hand. Literate programming is the practice of writing code and documentation in the same file. Contemporary examples of this idea are, among others, vignettes (Wickham, 2015) for the R programming language (R Core Team, 2021) and Jupyter notebooks (Wang et al., 2020) for the Julia (Bezanson et al., 2017), Python (Van Rossum & Drake Jr., 1995), and R (R Core Team, 2021) programming languages. rOpenSci, a community that, among others, reviews programming code (Ram, 2013; Ram et al., 2018), is one example of where extensive documentation is mandatory and all code must have examples (that are actually run) as part of the documentation (rOpenSci et al., 2021). In general, when developing software, it is recommended to to write documentation while writing software, as well as to include many examples (Lee, 2018), as this leads to both better code and documentation (Reenskaug & Skaar, 1989).

6.3 Code Must Be Extensive

For code to be ideal paradata, it must be extensive. As code has many properties, there are many recommendations on this aspect.

Code should be distributed in standard ways (Peng et al., 2006), as is done by using a code hosting website (see Sect. 5.1). Additionally, code must be more extensive when it is (intended to be) used on different data, as then ‘code must act as a teacher for future developers’ (Sadowski et al., 2018). Error handling is one of the mechanisms to do so. In genetic epidemiology, it is common to have incomplete or missing data, so analyses should take this into account with clear error messages.

Coding errors are extremely common (Baggerly & Coombes, 2009; Vable et al., 2021) and contribute to the reproducibility crisis in science (Vable et al., 2021). Testing, in general, is an important mechanism to ensure the correctness of code. One clear example is Rahman and Farhana (2020), showing bugs in scientific software on the COVID-19 pandemic.

Testing is so important that it is at the heart of a software development methodology called ‘Test-Driven Development’ (‘TDD’), in which tests are written before the ‘real’ code. TDD improves code quality (Alkaoud & Walcott, 2018; Janzen & Saiedian, 2006), and it is easy to integrate the writing of documentation as part of the TDD cycle (Shmerlin et al., 2015).

The percentage of (lines of) code tested is called the code coverage. Code coverage correlates with code quality (Horgan et al., 1994; Del Frate et al., 1995), and, due to this, having a code coverage of (around) 100% is mandatory to pass a code peer review by rOpenSci (Ram et al., 2018). When CI is activated, the code coverage of a project can be shown on the repository’s website .

It is considered good practice to add a software license (Jiménez et al., 2017), so that it is clear that the software can be reused. Although this may seem trivial, only two-thirds of 56 computational experiments supply a software license (Stodden et al., 2018).

Code reviews are recommended by software development best practices (Wilson et al., 2014). However, more than half of 315 scientists have their code rarely or never reviewed (Vable et al., 2021), although code reviews are known to accelerate learning of the developers, improve the quality of the code, and resulting to an experiment that is likelier to be reproducible (Vable et al., 2021).

6.4 Code Must Be Computer-Friendly

The most reproducible way of submitting the code of an experiment is by providing the code with all its (software) dependencies in a container. Containers allow a computation experiment to be highly reproducible: Given the same data, an experiment put into a container will give the same results on different platforms, at least in theory. In practice, differences may be observed when peripheral factors are different, such as the random numbers as generated by an operating system, or data that are downloaded from online (and hence, probably changing) sources.

For paradata to be useful, it has to be computer-friendly, yet “the best paradata does not necessarily look like ‘data’ at all for its human users” (Huvila, 2022). There are features of code that humans find useful, without directly being able to measure these. In the end, code is just ‘another kind of data’ and should be designed as such, for example by using tools to work on it (Wilson, 2022).

A first example is to use a tool to enforce a coding style [e.g. the Tidyverse style guide (Wickham, 2019) for R, or PEP 8 (Van Rossum et al., 2001) for Python], as following a consistent coding style improves software quality (Fang, 2001). A second example is to use a tool to enforce a low cyclomatic complexity. The cyclomatic complexity is approximately defined as the number of independent paths that the code can be executed. The cyclomatic complexity correlates with code complexity, where more complex code is likelier to contain or give rise to bugs (Abd Jader & Mahmood, 2018; Chen, 2019; Zimmermann et al., 2008).

7 Sensitive Data

Next to the code, it is the data used in an experiment that must be made available for an experiment to be called ‘reproducible’ (Peng et al., 2006). In some fields, such as genetic epidemiology, the data is sensitive, hence cannot be released, and thus one cannot reproduce an experiment. To solve this problem in the future, there are some interesting methods being developed to run code on sensitive data with assured privacy (Zhang et al., 2016; Azencott, 2018).

To alleviate the problem today, a developer should supply a simulated [also called ‘analytical’ (Peng et al., 2006)] dataset together with the code. This simulated data is needed to run tests, as is part of the TDD methodology. In the case of genetic epidemiology, this would mean simulated genotypes and associated phenotypes, as can be done with the plinkr R package (Bilderbeek, 2022). One extra benefit of simulated data is that these can be used as a benchmark, as slightly different analyses should give similar conclusions.

8 Discussion

In a perfect world, all code has the characteristics of ideal paradata and is written from software development best practices. This section discusses the problems that arise by doing so.

To know these best practices, one needs to be trained. Articles that suggest these best practices (such as this one) claim that this initial investment pays off. Code reviews are a good way to accelerate the learning of team members (Vable et al., 2021).

Code needs maintenance, as code that will stand the test of time perfectly is deemed ‘impossible’ (Benureau & Rougier, 2018). CI can help a maintainer to be notified when the code breaks, where the use of containers may slow down time, as an entire computational environment is preserved.

Uploading code, preferably to a code hosting website, may feel like a risk, as all code can be seen and scrutinized. However, not publishing code may put oneself in the focus of attention and—after much effect by others reproducing an incorrect result— at the cost of a scientific career (Baggerly & Coombes, 2009).

When the author of code can be contacted, there will be users asking for technical support. One solution for the author is to ignore such emails, as is done in a third of 357 cases (Teunis et al., 2015): It can be argued that no energy should be wasted on published code and work on something new instead. However, see Barnes (2010) for a better way to deal with this problem.

When the author of code can be contacted, users will send in bug reports. If the bug is severe enough, the question arises if all the research that use that code still results in the same conclusions. One such bug is described in Eklund et al. (2016), with 40,000 studies using that incorrect code. These reports could be ignored to work on something new.

Containers do have problems. First, they themselves require software to run, with the same software decay being possible. Second, one needs to install that software and have the computer access rights (i.e. admin rights under Windows, or root rights under Linux) to do so. Third, one needs to learn how to build and use containers. Lastly, containers can be several gigabytes big files, which makes their distribution harder. Ideally, containers are stored online and distributed in standardized ways. Although progress is being made, there is no way to do so for all container types. Additionally, probably due to their novelty, container hosting sites lack metadata.

9 Conclusions

This chapter started with some suggestions to make paradata useful for data re(use): Paradata should be extensive, comprehensively documented, with the creation of documentation and code going hand-in-hand, as well as friendly to computers (Huvila, 2022).

Before applying these features, the first step is to publish the code. When applying these general recommendations to code, this list can be phrased more precisely:

  1. 1.

    Code should be comprehensive in supplying automatically generated metadata (such as commit history and code coverage).

  2. 2.

    The documentation should be as extensive as recommended by the software development literature.

  3. 3.

    The documentation should have co-evolved with the code following the best practices in literate programming.

  4. 4.

    Code should be made machine-readable by, at least, being uploaded to a code hosting website. Ideally, the code is checked by CI and is put in a container.

For the preservation of code, these recommendations are made:

  1. 1.

    Uploading code to a code hosting website is better than not publishing code at all.

  2. 2.

    Adding CI to code allows one to detect the day when that code does not run anymore.

  3. 3.

    Putting the code in a container is the best way to preserve code.

When research truly needs to be reproducible, putting the code of an experiment into a container is today’s best solution, as containers are the best solution to keep code running for the longest amount of time. Creating such a container, however, requires more skill that—as of today—is not rewarded, although an experiment put into a container can be considered the pinnacle of reproducible research.

The simplest and most impactful step to make code more useful paradata is, however, to publish it on a code hosting website along a publication. From then on, the next steps can be taken gradually as the skills of the author(s) progress. To quote Barnes, 2010: Publish your code, and it is good enough.

The world of science would be a more open, humble, trustworthy, truthful and helpful would the code that accompanies a scientific paper be given the attention it deserves treated like a first class citizen. Doing so, however, is yet to be rewarded, and still both of the two scientists at the start of this chapter can provide a good rationale for their behaviour. This will change when reward incentives are put into place that reward making paradata useful. For code specifically, in any computational field, the rewards are even higher, as reproducibility should again be a cornerstone in science.

10 Data Accessibility

This chapter and its metadata can be found at https://github.com/richelbilderbeek/chapter_paradata.