1 Introduction

Currently, software development relies on integrating existing features by writing client code interfacing Application Programming Interfaces (APIs) (Nybom et al. , 2018). To do that, developers increasingly rely on proper, complete documentation (Broy , 2022), which aids them during evolution/maintenance activities, reuse of libraries, and external components (Aghajani et al. , 2019).

Software documentation often has insufficient and inadequate content, obsolete or ambiguous information, and unexplained examples (Aghajani et al. , 2019; Broy , 2022). This is concerning since poor documentation affects software maintenance over time, leading to technical issues, limited reusability (Groher and Weinreich , 2017), and inconsistent and incomplete documentation (Aghajani et al. , 2020). Nevertheless, the production and maintenance of up-to-date software documentation continue to be neglected (Treude et al. , 2020).

Documentation research has focused mainly on Object-Oriented Programming (OOP) languages producing commercial/traditional software (Monperrus et al. , 2012; Tan , 2015; Blasi et al. , 2017; Stulova et al. , 2020). The difference between ‘traditional’ and ‘scientific’ software is not caused by the age or name of the programming language but the purpose of the software itself, including who works on the project (fewer people and more junior developers (Milewicz et al. , 2019)), the money invested in its maintenance (Ahalt et al. , 2014), and the development lifecycle (Pinto et al. , 2018).

The lack of research in scientific software documentation contributes to the ‘gap’ expressed by Storer (2017): “the ‘chasm’ between Software Engineering (SE) and scientific programming is a serious risk to the production of reliable scientific results." This risk happens because scientific software is generally coded in package-based environments requiring interfacing multiple components (Howison et al. , 2015). Likewise, research-intensive software packages are generally highly specialised and targeted to specific research niches, effectively requiring more in-depth documentation to be appropriately selected and used (Königstorfer and Thalmann , 2021).

The above generates two issues. First, if only the core maintainers understand the package due to poor documentation (Groher and Weinreich , 2017), it will likely be discontinued rather than subsequently updated. Second, producing quality documentation requires understanding its patterns of knowledge–namely, what knowledge it contains and how it is organised (Maalej and Robillard , 2013).

Currently, recommendations of patterns of knowledge for scientific software documentation are limited to incomplete, high-level descriptionsFootnote 1 that provide no grounded guidance on what or how to document software. For example, prior studies demonstrated that poor R package documentation might lead to incorrect usage, affecting all the code depending on it and threatening the validity of experiments and analyses that eventually use these packages (Codabux et al. , 2021).

In particular, R originated as a special-purpose language (Ihaka , 2017) with extensive features for statistical analysis, whose developers rely more heavily on its functional features (Hornik , 2012; Vidoni , 2021a). R is more susceptible to poor documentation than other languages because most R package contributors are less likely to be software engineers by trade (German et al. , 2013), and only a few apply sound software engineering practices during development (Morandat et al. , 2012; Pinto et al. , 2018; Vidoni , 2021b).

Therefore, we chose to work only with R both to keep the scope of this work manageable but also due to three motivators: 1) R is dynamically typed (meaning that a variable can take multiple types at different moments without reserved words for types (Hornik , 2012; Korkmaz et al. , 2018), forcing developers to ‘guess’ what types to pass), 2) R blends OOP and functional programming (thus rendering current OOP-exclusive taxonomies inapplicable), and 3) R documentation has been acknowledged as an ongoing issue (Codabux et al. , 2021; Wickham , 2019).

For this purpose, we mined 379 R packages from GitHub, including popular and newer packages maintained since 2019. We performed card sorting on three critical elements of the documentation–parameters, returns, and description. We produced an initial taxonomy of what information should be included in the documentation of R packages regarding parameters, returns, and descriptions. The extracted documentation was mined from source code files, providing 8,670 comments, with over 860,000 lines of package documentation.

As a result, this paper contributes to the need for better documentation standards in R programming (Monperrus et al. , 2012) by extending prior work in library documentation to the R domain. Our other contributions are:

  • This is the first study conducted to explore and understand R packages documentation practices.

  • A taxonomy of Roxygen directives for parameters, returns, and description elements. It is structured, including examples (taken from mined GitHub repositories), good practices, and anti-patterns. Our taxonomy is more detailed and complete than Roxygen’s own package documentation.

  • An analysis of the documentation directives, discussing frequencies, anti-patterns, and comparatives to existing taxonomies. The ‘documentation directives’ are natural-language statements explaining constraints and guidelines about correctly using a piece of code (Monperrus et al. , 2012).

  • We make available an extensive, well-documented replication package. See Data Availability.

Paper structure. Section 2 presents the related work. Section 3 describes the study’s setup, repository selection, dataset generation, and the protocol for the taxonomy generation. Section 4 summarises the taxonomy generated and the relationships between directive kinds. Section 5 discusses the implications, and Section 6 addresses the threats to the validity of this study. Section 7 concludes this study and outlines future works.

2 Related Work

R’s Software Ecosystem. R’s structure motivated investigations regarding its ecosystem, including the correlation between downloads and package citations to determine impact (Korkmaz et al. , 2018) and its effects on publication activities (Zanella and Liu , 2020). Others explored the differences in the growth and expansion of several package-based communities (Blanthorn et al. , 2019), the influence of outdated dependency versioning (Ooms , 2013), metrics to quantify the R ecosystem regarding package activity and lifecycle (Plakidas et al. , 2017), and the maintainability capabilities of CRAN packages (Claes et al. , 2015).

However, R’s documentation practices have yet to be approached with the same interest. Souza and Oliveira (2017) assessed markdown-generated documentation but was more concerned with R Markdown and other literate-programming formats. Treude et al. (2020) focused on documentation quality in R programming but only explored the R language manual, README files, tutorials, articles, and threads in StackOverflow. Zagalsky et al. (2018) investigated how the R community creates and curates knowledge in StackOverflow and mailing list, determining that in the former, the participation tends to be individual; in the latter, it builds up on other responses. The types of responses also differ; the mailing list offers suggestions and alternatives, while StackOverflow offers tutorial-like responses.

Vidoni (2021b) evaluated self-admitted technical debt in R programming through source code comments, purposefully excluding package documentation. Finally, Codabux et al. (2021) analysed the technical debt in the peer-review of R packages from rOpenSci and determined that package reviewers give more importance to documentation, being more inclined to manage documentation debt compared to developers.

API Documentation. Several works studied API documentation regarding taxonomies and quality alike (Tan , 2015). There are large-scale manual explorations of API documentation in Java and .NET to generate a taxonomy and patterns of knowledge, semantically parsing Javadoc tags (Tan et al. , 2012). Maalej and Robillard (2013) assessed patterns of knowledge in Java and .NET API documentation, assessing the patterns’ frequency and co-occurrences while comparing both languages. Also, automatically analysing code assertions with semantic knowledge (Blasi et al. , 2017), and proof-of-concept tools to detect outdated documentation (Stulova et al. , 2020).

Robillard (2009) perused Microsoft developers and identified five obstacles to learning API, highlighting that documentation quality is an issue when learning API. A follow-up study with Microsoft developers uncovered five factors to consider when designing API documentation (Robillard and DeLine , 2011). Later, Uddin and Robillard (2015) reported API documentation issues categorised as content or presentation, prioritising the former over the latter.

Existing Taxonomies. Dekel and Herbsleb (2009) defined documentation directives as natural language statements notifying other developers about how to use a software library; however, most are exclusive to Object Oriented (OO) (e.g., related to sub-classes). Monperrus et al. (2012) analysed Java documentation to determine and classify directives, extended prior work, and provided a systematic, organised representation of the taxonomy of Java projects.

Roxygen’s official documentationFootnote 2, it is not a taxonomy but a high-level description of the tag’s intended use, written by the developers behind the package roxygen2. Therefore, it is not based on systematically-gathered evidence and was considered by Roxygen’s own developers as the worst package documentation available (Wickham , 2019).

3 Methodology

The mining was completed by following a systematic methodology (Vidoni , 2022a) and taking into account the perils of mining GitHub (Kalliamvakou et al. , 2014).

3.1 Source Selection

Although CRAN distributes packages, GitHub has risen as a distribution platform for R packages. For this study, we analysed GitHub packages since the perils and problems of mining GitHub are better known than CRAN’s, and there are clear strategies to mitigate them (Kalliamvakou et al. , 2014). Moreover, GitHub is “increasingly used as a distribution platform for R packages” (Decan et al. , 2016), given that CRAN reserves the right to remove packages without warningFootnote 3. Previous studies demonstrated that about 20% of the most downloaded CRAN packages are on GitHub and that GitHub has a more diverse sample of R packages(Decan et al. , 2015). The sections below will discuss how the perils of GitHub were mitigated during the process.

We also considered other sources, such as StackOverflow and GitHub issues, but disregarded them, because 1) GitHub issues are exclusive to the repository in which they are open and do not generally discuss documentation unless it caused a development problem, 2) roxygen2’s own issues report problems with the package but not with documentation, and 3) the population of StackOverflow posts touching documentation in R programming was too small to be meaningful.

3.2 Repository Selection

The mining process followed the recommendations outlined by Vidoni (2022a) and described below.

Step 1. We defined inclusion and exclusion criteria to determine which packages to consider. The included R packages had to be public, open-source repositories written in R and with English as their main language, with a basic structure (as defined by (Wickham and Grolemund , 2017)); packages that provided minimal R code but wrapped other languages were also allowed.

Several exclusion criteria were defined and are presented in comparison to Kalliamvakou et al. (2014) perils: 1) forked packages (to avoid duplicated samples), 2) packages created before 2010 or with no commits after 2019 (to avoid inactive and low-activity projects), 3) personal, deprecated, archived, or unmaintained packages (to avoid personal projects), and 4) books, data packages, or collections of other packages (to avoid non-software projects).

Finally, the peril GitHub is continuously evolving could be related to our use of GitHub’s ‘best match’ algorithm. However, given that we are providing a comprehensive replication package, the names of the packages mined and assessed are publicly available, mitigating this threat and enabling reproducibility.

Step 2. We used GitHub’s advanced searchFootnote 4 to filter the exclusion criteria through the provided form. We applied the following search string: package NOT personal NOT archived NOT superseded archived:false created:>2010-01-01 pushed:>2019-01-01 language:R, searched using GitHub’s ‘best match’ sortingFootnote 5, which “combines multiple factors to boost the most relevant item to the top of the result list." However, its algorithm is not publicly available, and its result (just like any other GitHub search) changes as repositories evolve. Nevertheless, as Kalliamvakou et al. (2014) pointed out, all searches in GitHub are prone to change (this is Peril XIII). Since this affects the search replicability, the associated threats to validity are discussed in Section 6. The reproducibility package includes all package names and data used in this study to mitigate the natural variability of the search results.

This search returned 22,308 results, i.e., repositories (as of November 2020). However, regardless of how specific the query terms may be, it is possible that some packages were not properly filtered, producing false positives (i.e., packages that should have been excluded but were shown in the search) or false negatives (i.e., packages that should be included but were left out).

Step 3. Because we used a manual hybrid card-sorting, it was not feasible to work with the total number of packages returned from the search (namely, \(>22k\)) and achieve results within a reasonable timeframe. Therefore, we applied a sample size calculation of 95% confidence and 5% error to determine how many packages should be effectively mined; for 22,308 repositories, the sample size was 379.

The search result list produced in Step 2 was manually inspected by both authors in the order of the results (starting from the first result of the first page) until acquiring the 379 packages that fit the inclusion/exclusion criteria. While doing this, we recorded how many packages were excluded per exclusion criterion (Table 1). We inspected 432 packages, discarded 53, and kept 379 packages. Only the 379 packages were effectively mined; this means that the ‘false positive’ packages were not mined and, therefore, not considered in this study.

Table 1 Number of repositories excluded by criteria

The difference between 432 (scouted) packages and the final 379 (selected) packages is due to false positives, namely, packages still listed as a search result that did not fit the inclusion criteria. The decision to exclude a package was taken after reading the README file, and perusing the package’s code structure, to assess it against the inclusion/exclusion criteria; a typical example of why the false negatives were listed as search results is that many superseded/personal projects do not use GitHub’s tags, and list the status on the README file. Therefore, if a package was excluded, we had to inspect an additional one. As mentioned before, this process continued until we reached the sample size.

Every searched package was reviewed by both authors individually, and disagreements were discussed until a consensus was reached to mitigate researcher bias. We calculated the inter-rater reliability using Cohen’s Kappa coefficient–a test measuring the raters’ agreement in studies with two or more raters responsible for labelling a categorical scale variable (McHugh , 2012). Cohen’s Kappa results in a number \(\kappa \) between \([-1, +1]\), indicating the highest disagreement and agreement, respectively; nevertheless, the threshold cut-off for deciding on the high agreement varies based on the fields (McHugh , 2012). We considered \(\kappa \ge 0.79\) as a high agreement rate, as used in software engineering studies (Liu et al. , 2020). We obtained a \(\kappa =+0.92\), indicating a high agreement rate and reliability for our coding.

Note that this approach is considered standard and systematic, matching current methodologies for MSR selection (Vidoni , 2022a), and does not run into any peril from GitHub (Kalliamvakou et al. , 2014). Additionally, the GitHub URLs of the selected R packages were kept in a CSV file alongside those filtered by criteria. It is available in the replication package presented in Section 1.

We conducted another analysis to determine the overlap of our selected packages with CRAN, BioConductor, and rOpenSci. Both BioConductor and rOpenSci enforce extensive, well-regarded peer-review processes. In particular, BioConductor is a sub-framework inside the R environment with its automated installation using BioC (Amezquita et al. , 2020). Table 2 presents the number of selected packages in each package directory. Note that rOpenSci thoroughly peer-reviews its packages but does not require them to be uploaded to CRAN (Codabux et al. , 2021); likewise, it is standard for BioConductor packages not to be available in CRAN due to the intrinsic dependencies with other BioConductor packages. As can be seen, \(72.5\%\) of our selected packages were available in CRAN, effectively mitigating validity threats regarding not mining directly from CRAN.

Table 2 Overlap of GitHub packages to other R-related environments

3.3 Data Extraction

is based on Javadoc and has a similar structure and functionality (Wickham and Grolemund , 2017). As with Javadoc (or similar systems), Roxygen allows detailing specific elements for every function, class, object, or data type existing within a package. To do that, it provides ‘tags’ (equivalent to Javadoc’s @tag) to indicate that a ‘segment’ of the documentation refers to a given attribute of a specific functionFootnote 6. However, some can be implicit (i.e., recognised by position and not by tag) and are automatically detected when parsing a file to create the online version. It is possible to leave the segment ‘blank’ (i.e., empty), provide no information, or remove a segment/tag.

The tag structure was used to extract and organise the Roxygen documentation of the mined packages. Using the GitHub repositories’ URL obtained from the repository selection, we used an R script to download the source code and extracted the lines corresponding to Roxygen. This process consisted of the following steps, completed for each package:

  1. 1.

    Read all R files located in the /R folder (equivalent to Java’s src folder). This allowed us to work only with the source code files and ignore unit testing files since they are located in a different folder, named /tests (at the same level as the source code folder) (Wickham and Grolemund , 2017).

  2. 2.

    For each file, the script extracted all the lines of the Roxygen documentation using a regular expression. It trimmed starting white spaces and searched for lines beginning with #’, the Roxygen comment symbol. The approach was straightforward and uncomplicated, mitigating possible threats caused by the unnecessary complexity of a different tool. This generated a dataset including package name, file name, comment ID number, start and end line, current line number, Roxygen text, and function signature. This produced one large Roxygen block per function, class, data type, or object; henceforth, these four types are called ‘elements.’

  3. 3.

    Each extracted comment was divided into segments (i.e., consecutive lines that belong to a tag). An auto-incremental ‘segment id’ field was added to the dataset (in spreadsheet format). Multi-line comments were kept together by reading lines until (a) a blank line or (b) a new tag began. Every row of this dataset had a key composed of the package’s name, auto-incremental comment, and segment number.

  4. 4.

    Like in Javadoc, Roxygen tags are identified with an @ symbol. These were extracted to a new column using regular expressions and purposefully checking against non-Roxygen tags or words (e.g., email addresses); in particular, this check was done by comparing directly to Roxygen’s official tag listFootnote 7. Synonym or alias tags were kept together under the main tag only. Figure 1 showcases a Roxygen comment, highlighting the segments it contained. This simplified example is illustrative and does not include all the possible tags.

Fig. 1
figure 1

Example documentation and its segments

Of the 379 selected packages, only 342 had Roxygen documentation. The existence of Roxygen in the package was not part of the exclusion criteria to allow an accurate representation of poorly documented packages. Some ‘undocumented’ packages had a man folder for documentation, but its functions had no comments. Others used regular comments (written as #) instead of Roxygen’s (#’) to write minimal documentation; these were not structured, limited, and sometimes written under a method’s signature. Some used regular comments to clarify code ownership or contribution without information about the function. Only four packages (of the 37 without Roxygen) used the original Latex-inspired documentation style exclusivelyFootnote 8.

Since Roxygen was the primary documentation type, the study centred on its analysis; this is also supported by the literature (Wickham and Grolemund , 2017; Wickham , 2015). The complete dataset has 8,670 Roxygen comments, totalling 86, 1601 lines. The packages had a mean of 38 Roxygen documents, with about 4.3 comments per file describing a function. The dataset was extracted only from the latest commit of the ‘master’ branch of each repository since it is often used as the main ‘release’ branch, as per standard R programming books (Wickham and Grolemund , 2017; Bryan , 2021; Bryan , 2018). This is also addressed as a threat to validity in Section 6.

3.4 Taxonomy Generation

The following subsections present the methodology for generating the taxonomy.

3.4.1 Tag Selection & Study Scope

Roxygen documentation is provided by the R package roxygen2Footnote 9, produced by RStudio, the most used Integrated Development Environment (IDE) in the R community. Like Javadoc, it has many tags to separate specific parts of the documentation. The tags are classified as namespace (to export elements on a package by setting visibility or to import dependencies) and documentation (to provide explanations about the elements being documented).

The group of tags considered the minimal ‘skeleton’ for a Roxygen comment (Wickham and Grolemund , 2017) is title and description (which can be implicit, i.e., detected by position rather than tag), parameters, return, visibility, and examples. The description section can have multiple paragraphs organised into individual sections. Likewise, the examples could reference a code page (embedded in the parsed documentation) or be written directly in the Roxygen comment. Given the dataset’s size, the large variability most tags present, and the time-consuming manual card-sorting process, we analysed a sample.

As a result, we focused only on three key elements of the documentation tags, discussed below; see Table 3 for statistics of counts:

  • @param paramName describes an argument of a function. It can be multi-line but requires the tag and the parameter name. Because R is dynamically-typed, arguments can receive data of multiple types depending on the value of another parameter and have a default value to use if omitted during the invocation. Argument type(s) are not enforced on the signature (the language has no reserved words for types either), are not visible, and may not be internally checked by a function. Finally, it is possible to invoke a function with arguments written in a different order than the specified in the signature by simply writing paramName = value. Finally, parameters are essential in functional programming as they allow functions’ abstraction and reuse (Hinsen , 2009).

  • @return describes a function’s return and its conditions. Because R is dynamically-typed, functions’ signatures do not disclose any type as there are no reserved words for a return and no equivalent Java/C’s void. A developer can a) use return(...) to stop the execution and rebound the value passed there, b) use invisible(...) to rebound values that can be assigned but do not print when unassignedFootnote 10, or c) return implicitly, by letting the function finish and return whatever was the last in-scope variable assigned. This, plus the fact that returns are essential to functional programming, led us to study this segment type (Hinsen , 2009).

  • @description is an optional tag for the main explanation of a function. If omitted, the second paragraph of the documentation is considered the description, while the remainder creates the ‘details’ sections (Wickham and Grolemund , 2017). Therefore, we narrowed the scope to a manageable size by considering only the first paragraph of descriptions explicitly tagged (namely, those with the corresponding tag). Additionally, descriptions can be explicitly formatted using @section, items, and markdown notations. Thus, a description section may be lengthy and is not limited to a particular format; this variability adds unnecessary complexity to a manual study. An analysis of how descriptions are documented and formatted is outside the scope of this work.

Table 3 Total number of identified segments per tag, and its mean and standard deviation (SD) per package

The remaining tags were excluded from the analysis. Titles were disregarded as they are recommended to be a single short sentence (Wickham and Grolemund , 2017). Visibility, aliases, links, families, and related tags were left as future work since they either require extensive triangulation of documents (e.g., links and related) or do not have explanations and only perform actions (e.g., visibility or import tags). Additionally, @examples segments were not considered, given the relevance the R community puts on such elements (Wickham and Grolemund , 2017; Wickham , 2015; Chambers , 2008), and how long they can be, they are suitable for a study on its own.

Additionally, a companion study by Vidoni (2022) focused on analysing common issues on documentation, crossing the findings with a developers’ survey to determine cases of outdated, incorrect, and incomplete documentation. It reviewed comments distribution and elements generally documented alongside systematic Roxygen tags that do not depend on natural language (i.e., visibility, dependencies management, disclosure of references, author ownership, examples availability, aliases, and keywords). In terms of parameters, it compared the matching between names in the documentation and the code. The current paper performed an analysis that is complementary to the one already published and used a newly mined dataset. Calculating a statistical sample size yielded the exact sample size; despite being the same number coincidentally (379 repositories), they are different datasets. The main reason to mine new R packages was due to Vidoni (2022)’s dataset being restricted by an Ethical Protocol that limited data available to protect the survey’s respondents’ identities. The sampling process of the current manuscript was detailed in Section 3.2.

Table 4 Extracted tagged segments of the original dataset, sample calculation details, and final sample size

Note that this taxonomy only attempts to be exhaustive but to tackle the most critical natural-language Roxygen tags.

3.4.2 Sampling & Card Sorting

Once we decided which segments to study, the segmented dataset (see Section 3.3) was divided into three parts: \(D_r\) containing all the @return segments, \(D_p\) for the @param , and \(D_d\) for the @description (only those that fit the criteria). However, due to the large datasets, we analysed a representative sample for each segment, using the total number of segments and not the total number of lines. E.g., when comparing two return segments, one can be three lines long, while the other can have 15, but they accounted for two segments and not 18 lines.

Note that topic-modelling algorithms such as LDA are intended to extract topics from a collection of documents (Wang et al. , 2019); however, their purpose is not to extract or develop taxonomies–namely, a systematic classification and categorisation of content. Likewise, advanced classification techniques (i.e., machine and deep learning, pre-trained models) are not intended to develop taxonomies but to assist in classifying data through supervised learning, thus requiring gold datasets of previously manually labelled data (Shyam and Singh , 2021; Fucci et al. , 2019). As a result, the evaluation of automated classification techniques and their use to expand the taxonomy was left as future work.

Table 4 summarises the original size (i.e., how many segments per type were mined for the 379 packages), the conditions for the sample calculation, and the resulting sample size (i.e., segments to be manually analysed). The @description have a larger error margin, as per the recommendation of a statistician, since most descriptions were at least twice as long as the other segments.

For the manual analysis, we used the method call directives proposed by Monperrus et al. (2012) as ‘starting directives.’ A directive is a natural-language statement that makes developers aware of constraints and guidelines related to the correct and optimal use of an R Package. A directive kind is a set of directives that share the same sort of constraints or guidelines (Monperrus et al. , 2012).

Though some of the ‘starting directives’ were generic and could be applied to functional programming, many were exclusively OO (e.g., related to object inheritance). Given that R’s OO functionalities are limited and not fully embraced by most practitioners (Wickham and Grolemund , 2017), these ‘starting directives’ were excluded. Then, before commencing the sorting process, we organised the ‘starting directives’ into parameters, returns, errors, and others (i.e., those used in more than one group). This was done in a brainstorming session by both authors, in which each ‘starting directive’ was discussed and added to one of the groups to be used. Note that a directive may fit into multiple groups.

We applied a hybrid card-sorting (Whitworth et al. , 2006), which is commonly used to derive taxonomies from data and has been previously used in the R domain (Codabux et al. , 2021). Hybrid card-sorting combines working from existing categories (closed card-sorting, in which data is associated with the existing categories) with defining new categories as they emerge during the categorisation (open card-sorting, where the categories are extracted from the data while organising it) (Whitworth et al. , 2006). Hybrid card-sorting has been applied to many software engineering studies for taxonomy generation, including a taxonomy of functionality deletion in mobile apps (Nayebi et al. , 2018), a bug-characterisation taxonomy in cyber-physical systems (Zampetti et al. , 2022), and a taxonomy of mechanics and gamification parameters (Villegas et al. , 2019).

Therefore, we started with two lists of predefined concepts:

(A):

The selected ‘starting directives’(Monperrus et al. , 2012). From “Method Call Directives:" Not Null and Null Allowed, Return Value, String Format, Number Range, Method Parameter Type, Method Parameter Correlation, Exception Rising, Parameter History (later renamed and expanded), and Lifecycle (extended to be on the method and not only the parameters). From “State Directives", Method Call Sequence.

(B):

A list of data types derived from R programming books (Wickham and Grolemund , 2017; Wickham , 2015): Return Style (none, fixed, variable, normal, invisible), Return Type (primitive, non-primitives, collection, dataframe, object, entry details), Parameter Type (same as Return Type), Extended Restrictions (to consider NAs, defaults), Format Restrictions (extending Point A’s to include Number Format, Date Format, Size/Length). These groups were discussed with two R developers (with 10+ years of experience) during an open brainstorming session, and they suggested including Return Correlation (which extended Point A’s Parameters Correlation) and References (as Roxygen allows adding links to local elements or URLs).

Starting with these sets, the samples of each dataset were explored iteratively to obtain the complete taxonomy. The inter-rater reliability was calculated using Cohen’s Kappa coefficient (McHugh , 2012), which measures the raters’ agreement in studies with two or more raters responsible for measuring a categorical scale variable. The steps were conducted as follows.

Phase 1. Both authors performed independent manual classifications. Each sampled dataset was perused, and the comments were classified into the directives stated in Point A. The disagreements were discussed during a peer review session. We calculated the Cohen’s Kappa inter-rater for each dataset, averaging \(+0.83\), and ensuring a solid classification compared to similar studies (Liu et al. , 2020; Huang et al. , 2018).

Phase 2. Both authors worked on the agreed partial classification of Phase 1. Each sampled dataset was reread without being aware of the other author’s classification and categorised into the directives stated in Point B. The disagreements were discussed during a peer-review session; the resulting Cohen’s Kappa averaged \(+0.82\), ensuring a sound classification.

Phase 3. A new discussion emerged between the authors as several ‘shared’ directives emerged. For example, given that R is dynamically typed, both return and parameters define the data type (i.e., primitive, non-primitive). The only difference was that parameters often added format restrictions (e.g., number ranges). Another case was ‘references’ (i.e., links pointing to websites or other documented elements) since the same directive appeared in all three datasets. Thus, given that the directives were shared, the authors established unified names and renamed these classifications in all three datasets. This process did not imply reclassification, but simply renaming and extracting labels into new columns in the dataset. Thus, there was no need to calculate Cohen’s Kappa.

Phase 4. After the renaming and file restructuring, we noticed that non-predefined directives were repeated in the dataset. Examples were parameters explicitly stating a ‘deprecated’ status and others mentioning they were mandatory or ‘required.’ Both authors reviewed each dataset again (individually and independently), marking the segments believed to have an ‘emergent directive’ (highlighting the sentence or word that prompted the marking). No label was produced at this point. Then, we reviewed each case and: a) removed duplicates per dataset, b) grouped them by similarity (e.g., dataset, words used)Footnote 11, and c) read each segment per group to create a new directive.

Phase 5. We extracted a random sample of 20 example segments for each category and discussed them with the R developers that advised on Point B. The R developers were aware of the classification, as it was not possible to keep the label hidden while aiming to validate the classification. Since no new suggestions were made, we concluded the taxonomy generation.

To define the good practices and anti-patterns, we followed the definition provided by Monperrus et al. (2012), where good practices are “explanation of good practices to achieve clarity and completeness when describing corresponding directives" and anti-patterns are defined as ineffective trends that “should be avoided when describing directives in the documentation." To obtain this, we performed three additional phases:

Phase 6. A script searched for empty tags (e.g., writing just @param p1 but without adding any explanation), unlinked links (i.e., mentioning a webpage or section of the documentation without providing a working, clickable link), and incomplete citations (i.e., without DOI, or publication details). This automatically determined specific anti-patterns.

Phase 7. A peer-review process was conducted between both authors. Here, we agreed to convert some of the directives already labelled into anti-patterns, such as Type>Undefined, Restrictions>Ignored. In other situations, the lack of a directive was considered an anti-pattern, such as having a Type>Non-Primitive without explaining the entries (or referencing to a document).

Phase 8. The remainder of the anti-patterns and good practices were manually explored, per directive. Each author individually and independently selected 2-3 example cases that lacked information (for the anti-patterns) and those considered complete for all the labels involved. This step implied rereading almost all 7800 segments. These were later discussed during a) another peer-review session between the authors and b) a second peer review with the R developers consulted before. None of these two sub-steps produced any changes to the selections. Following the practices of Monperrus et al. (2012), we omitted the calculation of inter-rater reliability. Some anti-patterns or good-practices occurrences may be subjective; however, the risk of missing an anti-pattern or good practice and leading to unreliable results is negligible.

We chose to work with samples, focusing only on specific documentation elements because the process was extremely long and time-consuming. For instance, Phase 1 to Phase 8 required over a year (14 months) to complete. We performed multiple meetings, and in particular Phase 5 alone required about two months to complete, with another two for Phase 8. Moreover, coordinating with external developers required additional time. Furthermore, it was not possible to perform automatic classification through machine learning or deep learning, given that no gold data was available for supervised approaches (namely, pre-labelled data to use as a training dataset). Given the text complexity, unsupervised approaches were not recommended (Stulova et al. , 2020). Therefore, this first approach was completed manually, and the labelled datasets were publicly shared in the replication package to enable future automation.

3.4.3 Anti-Patterns Extraction

Automated techniques have been used to extract anti-patterns from source code, such as when dealing with code and design smells through static code analysis (Barbez et al. , 2020; Brabra et al. , 2019). Prior works also investigated the automated detection of anti-patterns in API functions and variable names (Palma et al. , 2015). However, those are often called ‘linguistic anti-patterns.’ They do not refer to free textFootnote 12, but the wording chosen for class, functions, and variable names and their effects in readability (Arnaoudova et al. , 2013).

Since we were working with free text instead of code sections or elements’ names, detecting anti-patterns required combining knowledge of (i) how R works and what it allows, (ii) understanding the documentation’s domain, (iii) comparing that to the positive patterns already found, and (iv) crossing those details to known anti-patterns in traditional OOP documentation. As a result, designing an automated approach to detect anti-patterns was out of the scope of this work.

4 Taxonomy

The taxonomy derived following the methodology explained in Section 3.4 is presented in Fig. 2, which is colour-coded. In the Figure, green shapes represent the segments studied in this manuscript, while the blue ones are the groups of directives (and those shared between multiple segment types are indicated with a blue share icon). Grey shapes represent possible directives (if exclusive, they are indicated with a |, or a red lock if restricted to a segment).

Fig. 2
figure 2

Taxonomy of Roxygen directives

For example, if a developer needs to document a parameter, they should first find the green tag Parameters on the taxonomy of Fig. 2. They can ‘navigate’ to the required directives using the dashed arrows. If the example parameter is a date, they will need to ‘navigate’ to Format>Primitive>Date Format; if it can be NA, the directive to include would be Restrictions>NA Allowed. Finally, having identified these directives, the developer can use the fully documented taxonomy (in Appendix 1) to learn about each selected directive. We redirect the reader to Section 4.2 for a more in-depth explanation.

4.1 Directives Summary

The following subsections summarise the taxonomy’s directives. Given the taxonomy’s size, we outline directive kinds, mentioning each grouped directive, and discuss relationships and frequencies. The complete documentation can be found in Appendix 1. Comparisons and discussions will be presented in Section 5.

4.1.1 Shared Directives

These directives were found in multiple segments. In Fig. 2, they have a blue share symbol. For example, given that R is dynamically typed, both returns and parameters must explain their type (e.g., integer, string, list) on the documentation; others are generic and appear across all segments (namely, references and error). Thus, the shared directives are:

Style. These directives are related to the dynamically-typed nature of R programming, which does not provide reserved words for data types, thus allowing variables to hold different types at different times during their lifecycle (Wickham , 2015). As a result, the directives Fixed or Variable make a distinction between parameters or returns that always accept/provide a single type (with the same internal structure in the case of non-primitives) or those under different conditions will accept/provide a different type. Note that No Return refers to functions that provide no return value (i.e., that would be void for statically-typed languages such as Java). Therefore, it has a ‘lock’ symbol because it was only detected in the return segment.

Type. This refers to the type of data being passed; they can be either primitive, non-primitive, or undefined. Primitives are generally characters, logicals, or numbers (in all its variations), but no subdirectives were created for each type. Meanwhile, Non-Primitives are Collection (factors, lists, vectors, arrays) a Dataframe (matrix, dataframe, tibble, table) or an Object (defined as an R object). Non-primitives can be accompanied by Entry, which details the individual values of that non-primitive. Note that the Undefined type is an anti-pattern in itself, as it was used in cases without meaningful information to infer the type being passed (e.g., an ellipsis argument without description or a vague description that highlights no type).

References. This refers to External sources (e.g., websites) not generated by the current documentation to clarify constraints on an element. Otherwise, they can be Internal and generated in other parts of the documentation. For example, when mentioning another object, a shared page, or a section of the same document. In both cases, a non-working link is considered an anti-pattern.

Error. The segment describes cases in which errors are thrown as part of an exception not being handled. It is also valid when describing errors printed on the console or logged in a file.

Correlation. In R, arguments are not enforced and can be omitted when invoking a function. As a result, parameters are often used to change the type of a Return. Likewise, they can be used to alter other Parameters by using, enforcing, or ignoring them (related to Restrictions) or by changing the type of value they accept (related to Style and Type). Therefore, these were detected only in parameters and returns. Note that the Return correlation has a ‘lock’ icon because it was only found in the parameters.

4.1.2 Return-Exclusive Directives

These are found exclusively on the @return segments. They express constraints and guidelines when documenting a function’s return–the term ‘function’ is preferred as this tag can be used for regular functions or R’s OO methods. They can be:

Condition. These express how a return is rebounded, as discussed in Section 3.4. They can be Normal (either when a developer uses use return(...) to stop the execution and rebound the value passed there, or the function finishes and returns the last in-scope assigned variable), or Invisible (when using invisible(...) to rebound values which can be assigned, but which do not print when not assigned). They can happen on the same return if the return is Type>Variable.

Showcase. These are possible, given how R works with its console. It may refer to Writing (partial output saved as a file at a specific path), Plot or Print (a part of the return is printed or logged on the console or plotted into the inspector).

4.1.3 Description-Exclusive Directives

These were found exclusively on the @description segments. Although the tag is optional, this part of the taxonomy only covered the segments adequately tagged. Therefore, these directives only cover part of what can be discussed in the Roxygen function description.

State. The first subdirective is Sequence, derived from the work of Monperrus et al. (2012), and specifies the order of method calls (e.g., other functions that should be called before or after the current one). Meanwhile, Versioning indicates the lifecycle of a function, such as being experimental, stable, or other custom-made labels. These are not exclusive, and both can appear in a description.

Others. Packages often implement (or apply) algorithms previously developed in scientific papers. The directive Algorithm Citation specifies an algorithm implemented in the function; it can mention the name (for a well-known and established algorithm) or provide a citation. Meanwhile, the Individual Definition clarifies the individual behaviour of every function in a family or group and only applies to shared or grouped documents. These two are not exclusive, and both can appear in a description.

4.1.4 Parameter-Exclusive Directives

These directives appear in the @param [name] segments. They express constraints and guidelines when documenting a specific argument for a function; the term ‘function’ is preferred as this tag can be used for regular functions or R’s OO methods.

Format. Related to specific details regarding formatting requirements of a parameter. They can be String Format (as derived from the work of Monperrus et al. (2012)) regarding correct string structures, or Date Format when they refer to dates; the latter includes dates passed as strings. Regarding numbers, the taxonomy includes Number Range (when either or both minimal and maximum values are stated) and Number Format (when there is a clarification of formats, such as integer or floating-point, meaning, calculation). Primitives include size to refer to the length (e.g., a string of no more than ten characters, a number with no more than five decimals). Non-Primitives can declare Entry or Size (e.g., a matrix’s dimensions). As a result, these directives also share a level (Primitive/Non-Primitive) with Type.

Restrictions. These refer to multiple restrictions enforced on a parameter, either by documentation only or by implementing a particular logic inside a function. Derived from the work of Monperrus et al. (2012), there is Null Allowed or Not Allowed, with the equivalent R-exclusive NA Allowed or Not Allowed, which restricts the usage of null and NA in a particular argument. As explained in the Correlation section, some arguments can be Optional (in which case they may offer a Default value that would be used if nothing is received) or can be Required if they must be present. A parameter with a correlation can be both simultaneously (i.e., optional under some conditions, mandatory under others). Lastly, some parameters can be Deprecated (no longer used) or Ignored (not implemented yet or irrelevant), which are anti-patterns themselves. However, some deprecated parameters are kept for backward compatibility purposes.

4.2 Directives Relationships

In some cases, we detected that a specific directive ‘limited’ the usage of entire directive kinds or dependent directives. These special cases are summarised below. Detailed plots on the relationships are available in the Replication Package (see Section 1.

  • In the @return , if a segment had Style>No Return (meaning that it rebounded no value, like a void function), then it did not have a Condition or Type. This is reasonable, given that nothing is returned. However, it may have Showcase, Error, Reference or Correlation. As a result, it was possible for the return segments not to have a condition.

  • In the @return , there were cases in which a segment explicitly mentioned that the function always returned null. Although this flag could be considered as a case of Style>No Return, we labelled it as Type>Primitive>Null. We considered this a special case of the former, which behaves similarly.

  • As seen on Fig. 2, Only Type>Non-Primitive could have a particular type, such as Collection, Object, Dataframe or Entry. Thus, Type>Primitive did not include this division. Likewise, a similar hierarchy happened in Format (with the primitive-exclusive formats), and in Restrictions (with the allowance for NA and null values).

There were also two cases of mutually exclusive directive kinds; namely, those whose individual directives cannot be overlapped (i.e., it is either one or the other). This is the case of Style. Although Restrictions>NA>... and Restrictions>Null are not found together, the existence of a Correlation>Parameters may allow them to coexist in the same segment (e.g., given a correlation a parameter cannot be null, while in the other cases, it is allowed or default). However, Restrictions>Deprecated or Ignored were exclusive, as many developers used the word ‘ignored’ but meant ‘deprecated’ (inferred by the remaining words).

Other directive kinds were conditionally shareable; namely, under specific conditions, they can appear together (e.g., a parameter being both optional and required depending on the values of another). These were: Restrictions (generally because of a correlation), Format (e.g., a string could not have a number format, but a number could have format and range), Type (generally a variable type, sometimes caused by a correlation), and Condition (because of a style, a correlation, or on its own due to a function’s behaviour).

Moreover, other directive kinds did not require a correlation to be shared, thus being fully shareable; these are Correlation, Others, State, Error, References, and Showcase.

4.3 Directives Frequency

Depending on the relationships between directive kinds mentioned in Section 4.2, we drew insights on what is being documented and how functions work, which are summarised in Table 5. Percentages are always calculated regarding the sample size of the corresponding dataset. A directive not appearing in the dataset does not mean it is infrequently used; however, not having a directive may be an anti-pattern. However, given that the code was not inspected when reading the segments, it was not possible to determine whether this was the case.

Table 5 Frequencies of directives per segment. Percentages are always calculated regarding the sample size of the corresponding dataset, and some may be co-occurring (e.g., a variable parameter can be both primitive and non-primitive), so not every value accounts for 100%; likewise, empty cases are not counted)

Regardless of R’s flexibility, about 89.5% of segments returned a fixed type (e.g., always the same type), and only 0.37% provided no return (or always null). This was similar to the return condition, with 90% being normal and only 0.8% using invisible returns (either alone or combined). The showcase was minimal, as barely 2.4% of the total returns included a showcase.

For Correlations, only 152 (about 5.4%) of returns made an explicit mention or a Correlation>Parameters, while about 355 parameters did it (about 9%). Parameters also have about 374 records (close to 9.5%) of Correlation>Returns. Thus, stating a correlation seems more common in parameters than returns. We did not perform a matching study to see how many parameters and returns belonged to the same function.

Regarding Type, almost 92.4% of the returns were Type>Non-Primitive, and the order of popularity for types were objects, dataframes, and collections (with 43%, 33%, and 22.8% of the total, respectively). About 49.5% disclosed the Entry of their non-primitives (of the total). This trend was considerably different in the parameters sample. About 71.5% were Type>Primitive, and only 4.2% were Type>Undefined; for the non-primitives in parameters, the order was collections, objects, and finally, dataframes. Through this, we can confirm that parameters are often used as configurations and entries for the algorithms or analysis performed in a function, hence the different types of returns.

About Format, the most common was Format>String Format, appearing in about 10% of the parameters’ segments. Restrictions>Default and Restrictions>Required were the most common constraints (14.7% and 14.2% of parameters segments, respectively). On a positive note, only 0.65% of parameters were stated to be Restrictions>Ignored, and only 0.33% were stated as Restrictions>Deprecated. The latter indicates positive practices about updated parameters, albeit it is possible for the documentation to be outdated and not disclose such situations; further explorations regarding version control changes are needed.

Overall, 32.9% of returns had Reference>Internal (932 records), with only four records registering an explicit External Reference. Finally, parameters had 13.% of Reference>Internal (but given the sample was larger, this meant 526 records), and 2.6% (102 records) of Reference>External. Descriptions had the least references, with 19.6% and 7.6% respectively; however, of the total, only 6.77% descriptions mentioned an academic citation. Returns had the most internal references, but parameters had the most externals; we hypothesise this may be due to parameters often used as configuration, hence linked to the papers that created the method implemented in a function.

Finally, the usage of Error>Exception Raising is not common, as it appeared on barely 0.96% of the parameters (38 records), 3.15% of descriptions (exactly 34 records), and 0.7% of returns (20 records).

5 Discussion

5.1 Taxonomy vs. Roxygen Documentation

Roxygen’s current documentationFootnote 13 is a high-level description of some tags, substantiated with roxygen2’s developers’ perception (“guidelines") of how Roxygen should be used. For example, in the official Roxygen documentation, only the following information is provided regarding the tag @param 13.

@param name description describes the inputs to the function. The description should provide a succinct summary of the parameter type (e.g. a string, a numeric vector) and what the parameter does if it is not evident from the name. The description should start with a capital letter and end with a full stop. It can span multiple lines (or even paragraphs) if necessary. All parameters must be documented. You can document multiple arguments in one place by separating the names with commas (no spaces). For example, to document both x and y, you can say @param x,y Numeric vectors.

@param tags that appear before the class definition are automatically inherited by all methods if needed.

As can be seen regarding parameters, there are no suggestions regarding how to document Correlation (between parameters or with a return), Format or Type (neither for Primitives nor non-Primitives), or Restrictions. Cross-link documentation (for References>Internal) is mentioned in isolation through the @seealso tag, which is not necessarily the only option for this (Vidoni , 2022).

Similarly, only the following information regarding the tag @return 13 is available in Roxygen’s official documentation:

@return description describes the output from the function. This is not always necessary, but is a good idea if you return different types of outputs depending on the input, or you’re returning an S3, S4 or RC object.

One of the examples in the documentation hints at a correlation between parameters and returns but does not describe how to document none of the return- or parameter-related directive kinds presented in this work. Moreover, regardless of R being dynamically typed, Roxygen’s official documentation13 does not mention how to document variable parameters or returns (i.e., the ones whose type changes given a condition)–which our taxonomy explains in detail through the Style correlation.

Therefore, our taxonomy provides a more detailed and extensive guidance for package documentation regarding the analysed tags. Given that this work is a companion to a previously published study regarding semi-automated Roxygen tags (Vidoni , 2022), it is reasonable to assume that the contribution of this paper is relevant for R developers and Roxygen documentation.

Given the above, we provide some suggestions grounded in the taxonomy to extend Roxygen’s current documentation regarding parameters and returns:

  • Clarify how the Type of each parameter or return must be documented and why this is required. Previous studies indicates that R developers do not consider themselves as developers, e.g., (Pinto et al. , 2018), but this perception is no reason to overly simplify the information provided to them.

  • Using the taxonomy’s dashed arrows to establish the “suggested order" in which to document an element. For example, a parameter’s documentation could follow this order: Type, Format, Style, Restrictions, Correlations, References; meanwhile, a return could use: Style, Condition, Type, Showcase, Correlation, and References. The order suggested could be altered by the community. However, the primary purpose is to act as a ‘mnemotechnics’ to 1) aid developers to remember what to document and 2) establish common ground across R programming.

  • Once the order above is presented, Roxygen documentation could include specific examples showing different combinations of directives. This is because only some parameters or returns would need all the directives (e.g., a fixed return may only say it implicitly).

  • Although Roxygen provides tags for the references, these are not regulated, and citations of external works (especially academic works) are done in various formats and citation styles. Roxygen documentation could strongly suggest a particular format to establish a common ground between developers.

  • Similar to the above, Parameters>Restrictions should be further elaborated in Roxygen’s documentation, providing specific guidelines on: how to document them, why they are needed, suggested practices for deprecation, and ignored parameters/values that are kept in the function signature for backward compatibility.

Beyond these suggested steps, many of the directives uncovered through this work could be used to generate additional tags for Roxygen documentation; for example: @notNull or @notNA (to automatically document the corresponding Restrictions), or strongly typed hint styles (akin to Python’s PEP-0484Footnote 14).

5.2 Taxonomy vs. Other Taxonomies

The taxonomy generated in this study was derived from the taxonomy established by the ‘Method Call Directives’ (MCD) from Monperrus et al. (2012), as explained in Section 3.4.2 (referred to as ‘starting directives’); moreover, some of our directives have multiple ‘patterns of knowledge,’ according to the definition provided by Maalej and Robillard (2013). This is summarised in Table 6.

Table 6 Comparison of the developed Roxygen Taxonomy (central column) to Monperrus et al. (2012) (used as ‘starting directives,’ as per Section 3.4.2), and the patterns of knowledge by Maalej and Robillard (2013)

In particular, as explained in Section 3.4.2, we only used the MCD from Monperrus et al. (2012), given that were were not assessing R’s OO features or documentation. Nevertheless, our Roxygen Taxonomy is more detailed, both in the granularity of the information provided (e.g., we included both Number Format and Number Range as separate directives), but also adds additional groups to ‘cluster’ directives (e.g., Correlation or Format) and links to which each element is allowed to use it. The above presents a two-fold improvement and key differences to Monperrus et al. (2012)’s work.

  1. 1.

    Our Roxygen taxonomy’s structure reduces duplication and allows developers to observe the commonalities between different elements in specific package documentation (e.g., Returns and Parameters); an example of duplication in Monperrus et al. (2012) directives are ‘Return Value’ and ‘Parameter Type.’

  2. 2.

    Our Roxygen taxonomy extends some directives to multiple elements. Although this is primarily due to R’s dynamically-typed nature, which requires further clarifications of types, multiple directives benefited from this. For example, while Monperrus et al. (2012) only associated ‘Correlations’ and ‘String Formats’ to parameters, our exploration demonstrated that, in R packages, they are also related to returns. This helped us create a more extensive taxonomy.

We also compared the content found on the Roxygen taxonomy to the ‘patterns of knowledge’ (Maalej and Robillard , 2013). Although two ‘patterns of knowledge’ were out of scope for this study (namely, ‘Code Examples’ and ‘Patterns’), we found the rest among the Roxygen directives. In particular, some of our directives intersect multiple knowledge patterns. For example:

  • Correlations were considered ‘directives’ (specifies what the users can/cannot do), ‘control-flow’ (describes how the package triggers events/behaviours based on the correlation), and ‘functionality’ (describes the package’s function). For example, @param ’add’ boolean that determines if all items should be added to the travis yaml file or printed on screen , is a Correlation>Return; the following is a correlation between parameters Correlation>Parameters: @param dpi Input the dpi. If the imageFormat is "pdf," users need not define the dpi. For "png" images, the default dpi is 72. It is suggested that for high-resolution images, select a dpi of 300.

  • Types (namely, primitives or non-primitives) were classified both as ‘directives’ (because they are a clear contract on what type to pass, regardless of R’s dynamic nature) and ‘functionality’ (because they clarify how the function uses the parameter or generates the return).

  • Return>Style was considered a ‘directive’ (because it explains what the user can expect to receive/do with a return), but also ‘quality attributes and internal aspects,’ because they often explain the conditions for those returns. For example, for a variable return: @return igraph_options returns a list with the old values of the updated parameters, invisibly. Without any arguments, it returns the values of all options. For igraph_opt, the current value is set for option x, or NULL if the option is unset.

  • Although Table 6 labels References and others as a pattern of ‘references’ and ‘concepts,’ we still consider them directives because, in many cases, the work that is being referred to (e.g., a citation) will often explain what values to pass/expect, effectively constraining how the developer interacts with the package’s functions.

5.3 Implications

For Researchers. This study provides exploratory insights into package documentation in R programming. It paves the way for future studies in Documentation Debt, such as the evolution of directives over time (i.e., outdated documentation), the impact of anti-patterns, and the extension of the taxonomy into other segments, among others. This can be done using the provided directs for close-coding segments of mined documentation to train automatic classifiers, large-scale predictions, and assessments across time (namely, throughout a project’s git history). Likewise, this taxonomy enables future human-centric studies, such as those related to challenges and usages (Meng et al. , 2019).

Additionally, our study presents evidence of the differences with documentation made for statically-typed OO languages, supporting further research in this domain. A clear example of this is the need for R developers to make explicit the type of a variable or return (by stating it in the text), given that R (as a dynamically-typed language) does not provide typed variables and type-constraints cannot be added in function signatures. Our taxonomy’s directives were created for R but could be extrapolated to other languages to enable comparative studies across programming languages or even ‘idioms’ within R.

Although some intrinsic characteristics presented in this work may have been ‘intuitively’ known, our results provide systematic evidence of their existence and impact, supporting future investigations by detecting specific patterns. For example, investigating how parameters are correlated and what are the most common correlations.

For R Developers & Data Scientists. The taxonomy generated in this manuscript (and available in Appendix 1) can be used as a guideline for R developers to decide what to include in their documentation, how to avoid anti-patterns, and which practices to uphold. In particular, the “discussion" section of each directive often mentions specific functions or configurations (e.g., invisible() for Return>Condition>Invisible) to assist developers in determining what to document.

As discussed in Section 5.1, our taxonomy is more detailed and extensive than the current guidelines provided by Roxygen13. The level of specificity of our taxonomy will provide more guidance to the R developers–especially to those that never crafted software documentation on their own, or never considered if their writing was readable and understandable (Vidoni , 2022).

Our focus on the ‘good practices’ and ‘anti patterns’ of each directive can improve real-world practices, eventually leading to educational trends and practical advice that may contribute to the reduction of documentation debt in R packages (Codabux et al. , 2021; Vidoni , 2022), by making it more explicit (for example, by creating tools that will allow detecting segments and providing hints and suggestions on what to document). Such future works would be possible because our taxonomy establishes relationships between directives, simplifying the decision of what to include for each element.

Developers often do not understand how to use a package or “debug" code based on that poorly-documented package and resort to building their own. With proper documentation, such issues can become less common, the usability of R packages beyond the original developers will be increased, and contributing to a repository (instead of creating new software) will be more straightforward.

Additionally, given that organisations for peer-reviewing R packages focus extensively on documentation (Codabux et al. , 2021), our results can assist them in establishing a standard of how to document a package and what to look for when reviewing packages prior to acceptance/publication.

For Educators. R is often taught as part of mathematics and data science courses without a solid perspective on code quality or in matters essential to traditional software developers (Thieme , 2018; Kross et al. , 2020; Datta and Nagabandi , 2017; Auker and Barthelmess , 2020). The taxonomy proposed in this article can be used as a teaching and learning resource to establish the baseline of quality documentation. It is presented in the structured format proposed by (Monperrus et al. , 2012), with detailed explanations and accompanied by examples. It can also assist educators in directing their curricula, planning classroom activities, and teaching the relevance of proper documentation to avoid accumulating Documentation Debt in R packages.

5.4 Future Works

avenues can be derived from this work, some of which were already mentioned in prior sections.

Taxonomy Extension. Extending the taxonomy to cover other segments (e.g., aliases, titles, sections, visibility, functions families) will be a priority, which can be achieved by reusing our dataset. A cross-source comparison between different documentation sources (such as pkgdown websites, developers’ tutorials, and blog posts) remains a future work. Likewise, future works on Roxygen documentation could use an iterative, stratified sampling based on our emerging taxonomy.

The current proposal is not validated; however, it was systematically extracted from real-world data, thus generating knowledge grounded on current, actual R developers’ practices. Although this mitigates some threats, a more thorough assessment will be required before using the taxonomy to ensure its rigour for future work. Once the taxonomy is extended, another avenue will be conducting a developers’ survey to validate our results from a human-centric perspective; nevertheless, given the extensive work completed in this article, such a survey was out of scope for this work.

Human-Centred Analyses. Our taxonomy will enable future mining software repository studies, such as evaluating social aspects (e.g., who completes the documentation and when) and understanding how the directives identified in this work evolve in time (e.g., inspecting commits to uncover which types of changes are done, who does them, and when). Finally, it will allow further comparison with documentation practices in other programming languages, such as Python and Julia, especially given that R shares some general similarities (e.g., scientific approach, dynamically-typed).

Automated Classification. Using both our directives and open dataset, it will be possible to train machine and/or deep learning models as done by Shyam and Singh (2021); Fucci et al. (2019) and develop tool support for comment-assistance, either automatically generating the comments or advising developers of anti-patterns. Although some studies have been conducted in Jupyter Notebooks (which, just like Python, is also dynamically-typed) (Miceli Barone and Sennrich , 2017), they require an understanding of common patterns and anti-patterns (Liu et al. , 2021), and manually-labelled datasets which, before our study, were not available for R programming. Machine learning for knowledge identification in API has been previously used (Fucci et al. , 2019). However, they also require gold-standard datasets for the supervised training of the different algorithms; our taxonomy will enable similar works for R. For example, a future avenue of research would be to investigate how directives and anti-patterns are used from a practitioner’s perspective to determine their impact and perception.

6 Threats to Validity

Internal Validity. This aspect examines whether the data treatment affected the outcome (Zhou et al. , 2016; Ampatzoglou et al. , 2019). The manual study limited how many comments it could explore within a reasonable time frame; thus, we worked with representative sample sizes. To minimise researcher bias, the entire classification was performed by both authors independently and discussed at different stages (Section 3.4). Both authors have extensive experience in technical debt, years of programming experience, and are versed in R. Moreover, we discussed and validated our findings with expert R developers. A validation performed with more raters remains a future work.

Though using GitHub’s ‘best match’ sorting approach is a common standard, it has no clearly defined algorithm. As with any other GitHub search, its use threatens the validity, reproducibility, and generalisability of the data collection and sampling approach, given that the order of packages obtained for this paper may not be precisely reproducible in the future. However, the search results represented the current state-of-the-art when the data was mined, which evolves as software continues to be maintained. To mitigate the presence of false negatives or false positives provided by the ‘best match’ sorting, we manually analysed each package provided by the search (in the order of the results) to double-check the inclusion/exclusion criteria. The entire process is explained in Section 3.2.

The data was only mined from the latest commit of the ‘master’ branch of each repository; this decision was made because the ‘master’ (alternatively known as ‘main’) branch is suggested as the core ‘release’ branch in the standard primer R programming books (Wickham and Grolemund , 2017; Bryan , 2021; Bryan , 2018). This may have led us to analyse incomplete documentation or documentation currently being written. However, we did not produce a completeness/correctness analysis, given that its impact on the quality of the taxonomy is negligible.

External Validity. These threats refer to the generalisability of results. Mining packages from GitHub instead of CRAN enabled future works and ensured that packages under study would be varied enough to depict better what the community offers. It is well-known that not all packages go to CRAN and that significant overlap exists between both (Decan et al. , 2015; Decan et al. , 2016). Several strategies were used to ensure generalisability: the inclusion and exclusion criteria were defined following accepted standards (Kalliamvakou et al. , 2014), and a best-match approach was first used to obtain the packages that aligned with the criteria. A representative number of suitable packages was selected from GitHub, and packages were inspected to ensure they fit the criteria. At different points, and to keep the work manageable, random samples of the data were used to generate the taxonomy while maintaining the generalisability of the results (Section 3).

Regarding generalisability to other languages, some directive kinds are applicable to other dynamically-typed languages (e.g., Style, Type), and others derived from R’s own capabilities (e.g., Condition, Showcase, Restrictions). While the former may be applied to other dynamically-typed languages, further evaluation is needed to assess the latter in other contexts. Nevertheless, generalising the results of this work into other languages was out of the scope of this study, and it is considered a future work (Section 5.4).

Likewise, given the extent of the work required for this taxonomy, further validation steps were left as future works; for example, manually checking categories against a new set of mined packages, and/or performing surveys and real-life assessments of the taxonomy. These are discussed in Section 5.4.

To ensure the taxonomy was complete, we worked with representative samples of a large, diverse dataset that included multiple projects of diverse sizes and characteristics. Moreover, we are making public the names of the packages and the labelled dataset to enable further studies in this area (Section 1). Our results are based on real-world data because we mined repositories of existing packages that continue to be worked on. Additionally, the continuous consultation with experienced R developers during the methodology (Section 3.4) contributed to its coherence with real-world practices.

7 Conclusion

This study conducted an MSR of 379 repositories of R packages from GitHub, systematically parsed to extract the Roxygen documentation. We used a hybrid card-sorting to explore generalisable samples of three key segments: parameters, returns, and functions’ descriptions to determine which directives (i.e., types of natural language statements) of documentation are used. The paper introduces a taxonomy of directives for R functions (systematically documented in the Replication Package) alongside the coded dataset, which is publicly available. We also provided an analysis of the relationships between directives and frequencies.

Although the proposed taxonomy can be extended, the data provided is a valid and helpful construct: it can support R programmers to identify critical elements to include in their documentation and direct researchers to new opportunities for investigation regarding Documentation Debt in scientific software. This study aims to serve as an empirical foundation for future works.