A study into the practice of reporting software engineering experiments

Revoredo, Kate; Djurica, Djordje; Mendling, Jan

doi:10.1007/s10664-021-10007-3

A study into the practice of reporting software engineering experiments

Open access
Published: 18 August 2021

Volume 26, article number 113, (2021)
Cite this article

Download PDF

You have full access to this open access article

Empirical Software Engineering Aims and scope Submit manuscript

A study into the practice of reporting software engineering experiments

Download PDF

3810 Accesses
4 Citations
4 Altmetric
Explore all metrics

Abstract

It has been argued that reporting software engineering experiments in a standardized way helps researchers find relevant information, understand how experiments were conducted and assess the validity of their results. Various guidelines have been proposed specifically for software engineering experiments. The benefits of such guidelines have often been emphasized, but the actual uptake and practice of reporting have not yet been investigated since the introduction of many of the more recent guidelines. In this research, we utilize a mixed-method study design including sequence analysis techniques for evaluating to which extent papers follow such guidelines. Our study focuses on the four most prominent software engineering journals and the time period from 2000 to 2020. Our results show that many experimental papers miss information suggested by guidelines, that no de facto standard sequence for reporting exists, and that many papers do not cite any guidelines. We discuss these findings and implications for the discipline of experimental software engineering focusing on the review process and the potential to refine and extend guidelines, among others, to account for theory explicitly.

Empirical Practice in Software Engineering

On the pragmatic design of literature studies in software engineering: an experience-based guideline

Article 06 January 2017

The Evolution of Empirical Methods in Software Engineering

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Reporting guidelines are an important concern for software engineering experiments.^{Footnote 1} Arguably, using reporting guidelines makes it easier for the reader understand the experimental design and the validity of the conclusions (Jedlitschka et al. 2008). These benefits have motivated the community to design and refine guidelines that support systematic and consistent reporting (Singer 1999; Wohlin et al. 2000; Juristo and Moreno 2001; Kitchenham et al. 2002; Shaw 2003; Jedlitschka et al. 2008).

Despite these efforts to establish standards, it has been observed that reporting in practice is often heterogeneous and important information is missing (Jedlitschka and Ciolkowski 2004; Sjøberg et al. 2005; Jedlitschka et al. 2008). Indeed, research on reporting guidelines has remained largely prescriptive. We know little about the extent to which reporting guidelines are used and how the uptake has changed over time. This might be because the number of controlled experiments has drastically increased since 2000 and that quantitative analysis of how these are reported is difficult. Still, gaining insights into the actual reporting practice is important to further improve guidelines and reporting practices.

In this paper, we investigate actual reporting practices for controlled experiments with human subjects that have been published in major software engineering journals during the period between the years 2000 and 2020. To this end, we use a mixed-method approach combining coding techniques from qualitative research with a formal analysis of event sequences from process mining. Our analysis reveals the actual reporting path of experiment papers and the degree of conformance for different journals over time. We find that conformance oscillates between 55% and 75% for all covered journals without a clear trend towards increasing. Our consecutive citation analysis shows that roughly one-third of the papers do not refer to any of the experiment reporting guidelines, while replication studies hardly ever refer to the guideline by Carver (2010) for replications. Based on the observed results, we highlight several implications for improving both reporting practices and refining guidelines, among others by more explicitly covering theory.

The remainder of this paper is structured as follows. Section 2 discusses the role of experiments in software engineering with a focus on reporting guidelines. We present hypotheses on the presumable impact of these guidelines. Section 3 describes our mixed-method research design, including paper selection, coding procedures, and analysis techniques. Section 4 presents our analysis results, focusing on the conformance between the reporting sequence of papers and guidelines. Section 5 discusses the implications of our research and threats to validity. Section 6 concludes with a summary and an outlook on future research.

2 Background

In this section, we first discuss the role of experiments in software engineering. Then, we revisit reporting guidelines for experiments in software engineering. Finally, we hypothesize how reporting practice could be expected to develop over time.

2.1 Experiments in Software Engineering

Experiments are an important means for generating new scientific insights. Gauch (2003) highlights the strengths of experiments, including control and understanding of causal factors. For these reasons, experiments are also increasingly used in software engineering. Research by Basili (1993) and Basili (1996), Kitchenham et al. (2004), and Wohlin et al. (2000) and Wohlin et al. (2012) laid the foundations for developing the field of empirical software engineering. Basili (1993) and Basili (1996) emphasizes the benefits for software engineering to develop an experimental line of research similar to other scientific disciplines. Specifically, he proposes an experimental approach inspired by the quality improvement paradigm as often adapted in industrial development and engineering as the basis. Kitchenham et al. (2004) highlight the potential of adapting principles from evidence-based medicine to software engineering and discuss both benefits and difficulties of its adoption. Finally, Wohlin et al. (2012) provides an extensive summary of methodological principles for conducting software engineering experiments. They emphasizes the importance of experiments, given that the practice of software engineering builds on the interactions between software artifacts and human behavior of software developers and other stakeholders.

Various reviews have investigated how and to which extent experiments are used in software engineering. Shull et al. (2004) emphasize the importance of replication for establishing reliable insights into software engineering. They present guidelines that are meant to support a better transfer of knowledge about experimental designs and execution.

Sjøberg et al. (2005) review controlled experiments in software engineering published between 1993 and 2002, focusing on how relevant information on subjects and tasks is reported. Their findings suggest that reporting is often incomplete and unsystematic, with partially inconsistent terminology. They state that the software engineering community needs guidelines helping researchers to better tackle difficulties of methodological and practical complexity of conducting controlled experiments. They provide a list of experimental details that they consider necessary to be reported. The review by Kampenes et al. (2007) drills down into the aspect of effect sizes. They observe that only 29% of the reviewed experiments reported effect sizes, even though this information is considered essential for interpreting experimental results. Additionally, Dybå et al. (2006) review the quantitative assessment of statistical power in software engineering experiments, which they find to be below established norms. They stress the importance of reporting confidence intervals and effect sizes. Hannay and Sjøberg (2007) review to which extent software engineering experiments build upon a theoretical justification of hypotheses. Their results reveal that out of 103 articles, only 23 build in total on 40 theories. These theories mainly were used for two reasons: to justify research questions and hypotheses as a part of the experimental design and to provide additional explanations of the results. The benefits of building theories and building on theories for software engineering experiments are stressed by Hannay and Sjøberg (2007). As an aid, they provide an extensive overview of the theories used in the papers that they reviewed.

It is important to note that the mentioned reviews stem from the years 2004 to 2007. The weaknesses they uncovered led to a refinement of guidelines for reporting software engineering experiments.

2.2 Experimental Reporting Guidelines in Software Engineering

Reporting has been a concern of research on software engineering experiments since the paper by Singer (1999), and there are several papers afterwards that describe reporting guidelines. We provide a short description of these guideline papers and a comparison between them.

The first guideline paper by Singer (1999) introduced the APA style guidelines to the field of empirical software engineering. According to Singer (1999), the abstract should summarize the paper including information about hypotheses, population, and results. The paper itself should first present the study’s problem and a brief explanation of the research strategy; describe in detail the experimental method, participants, and materials; outline the experimental procedure; and then present the statistical analysis of results with a discussion of the overall findings.

The second guideline of interest is proposed in the book by Wohlin et al. (2000). The authors emphasize the need for describing the scope of the experiment, its goals and foundational definitions. A summary of the experimental planning should cover the context selection for its importance to validity and generalization as well as the hypothesis formulation, including justifications for the selection of variables and subjects. Also the instrumentation is meant to be described. Among others, Wohlin et al. (2000) discuss what threats to validity have to be considered and how they can be addressed. The book also provides guidelines on analyzing data and interpreting results, together with suggestions for writing a summary and conclusion. It is worth mentioning that the second edition (Wohlin et al. 2012) refers to the reporting guideline by Jedlitschka et al. (2008).

Guidelines are also presented in the book by Juristo and Moreno (2001), later reprinted as Juristo and Moreno (2010). These guidelines are motivated by the idea that good experimental documentation needs to describe all phases, including goal definition, experiment design, execution, and analysis at a level of detail that a replication study can be conducted. To this end, the motivation should justify the relevance of the experiment, provide pointers to prior experiments, and describe experimental goals and hypotheses. The experimental design should report the experimental factors, response variables, parameters, blocks, as well as experimental units, subjects, and experimental process. Also, information should be provided about internal replication, randomization procedure if applicable, as well as subject knowledge, experimental schedule, and various factors that may have influenced the experimental result, such as potential learning and boredom effects. Regarding experimental execution, details are warranted about experimenters, instruction to participants, available time for completing the study, as well as experimental deviations and data collection. Finally, the analysis best includes information on data normality, analysis methods, results, and result interpretation.

The guideline by Kitchenham et al. (2002) presents a hands-on approach for reporting the experiment. It suggests describing the context of the experiment at extensive detail. Then, the experimental design is described regarding the planning of subjects, sampling techniques, and the process of allocating and administering treatments. Next, the procedures of experiment execution and data collection are summarized. This leads to the data analysis, presentation, and interpretation of the results, before the report concludes. We have to emphasize that this guideline presents a more general instruction on how to run an experiment, but lacks the instructions on how to report it precisely. The most extensive work on reporting are arguably the guidelines by Jedlitschka et al. (2008), which we will use as a reference in this paper for its level of detail. Note that these guidelines are fairly consistent with other guidelines, but more fine-granular. These guidelines suggest starting with the title and authorship section, which should include the term “controlled experiment” in the title. The structure of the abstract is inspired by abstracts in the medical sciences. The actual paper starts with the introduction section, including the problem statement, research objectives and context. The related work discussion should summarize prior contributions as well as the technology under investigation and alternative technology. Next, the section on the planning and experimental design covers various aspects. These include research objectives; information on the sampling strategy, population, and sample size; experimental material, objects, tasks and subjects; subsections on hypotheses, experimental design, and the experimental procedure with details on data gathering; as well as a description of the analysis procedure. In turn, the section on the actual experiment execution is followed by the analysis section. Readers should be provided with descriptive statistics, data set preparation procedures, and hypothesis testing results. The discussion and interpretation section should cover results and implications, as well as threats to validity. The conclusion includes a summary and future work propositions.

Table 1 summarizes the guideline structure by Jedlitschka et al. (2008) together with its sections, subsections and a short description. The four columns on the right-hand side of this table describe which of its subsection are also considered by previous guidelines, namely [G2:] Singer (1999); [G3:] Wohlin et al. (2000); [G4:] Juristo and Moreno (2001); and [G5:] Kitchenham et al. (2002). Table 1 also highlights that the guidelines by Jedlitschka et al. (2008) are the most fine-granular ones, and they show substantial overlap with the activities of other guidelines. Table 1 marks those activities with a check if they are explicitly covered in the reporting structure. Furthermore, it is important to mention that there are guidelines that we did not include in this comparison. Jedlitschka et al. (2014) is an extension of Jedlitschka et al. (2008) suggesting the inclusion of additional information for practitioners, such as costs, quality schedule, and productivity in the context of software development. Although widely used in various research fields, we did not consider the introduction, methods, results, and discussion (IMRAD) guideline (Sollaci and Pereira 2004) as it is not specifically designed for software engineering experiments.

Table 1 Description of Jedlitschka et al. (2008) guideline (G1) and comparison with previous guidelines: G2: Singer (1999); G3: Wohlin et al. (2000); G4: Juristo and Moreno (2001); G5: Kitchenham et al. (2002)

Full size table

2.3 Propositions on the Evolution of Reporting Practices

In this section, we formulate a set of propositions that help us to define clear analysis objectives. Such an approach is consistent with general guidelines on conducting systematic literature reviews (Kitchenham and Charters 2007), which emphasize the need to formulate research questions and objectives. Investigating reporting practices for software engineering experiments is warranted, because the last larger review covering this aspect dates back to Sjøberg et al. (2005) and various proposals for reporting guidelines have been made since then.

Proposition 1 (Patterns)

We expect that patterns of reporting can be observed. Two arguments support this assumption. First, reporting an experiment is a human activity of an author team that becomes routinized by means of social entrainment (Gersick and Hackman 1990). This means that the same author team will likely organize the reporting of a new experiment in such a way as they have done for the previous one. Such persistence of behavior has been studied among others by Kelly and McGrath (1985). Second, experimental reporting is subject to social norms of the scientific process in a particular field. Social norms contribute to the development of recurring patterns of behavior (DiMaggio and Powell 1983). These are further stabilized by mimetic behavior (Gersick and Hackman 1990) of imitating reporting of published experiments in case authors do not have yet established their own way of reporting. All these aspects contribute to the emergence of reporting patterns.

Proposition 2 (Conformance)

We expect that compliance with reporting guidelines can be observed. DiMaggio and Powell (1983) emphasize that normative pressure is a key factor that explains why organizations have been observed to be rather similar. These pressures are stronger in fields in which formal education and professional networks establish standards. Arguably, these attributes can be associated with empirical software engineering and reporting experiments in this field as well, contributing to compliance with reporting guidelines.

Proposition 3 (Evolution)

We expect that the reporting patterns have evolved over time. We expect that this evolution is associated with two forces. First, reporting practices might have presumably become more similar over time. Similar observations have been made by Levitt and Nass (1989), who compared the topic sequence in leading textbooks in physics and sociology over time. Levitt and Nass (1989) explain their results with institutional forces, including coercive, mimetic, and normative pressures (DiMaggio and Powell 1983). Such forces are arguably also relevant for reporting software engineering experiments.

Proposition 4 (Contingency)

We expect that papers with similar reporting can be observed. Burnes (1996) emphasizes that there is often “no one best way” of applying methods because contingent factors require an adaptation to circumstances. Similar observations have influenced situational method engineering (Brinkkemper 1996). This does not mean that reporting is arbitrary, but that differences are systematic and associated with contextual factors. As a consequence, we would be able to observe that certain types of papers would form clusters. Often, when there are hardly patterns overall (Proposition 1), it can still be possible to identify patterns for subgroups, which is investigated for this proposition.

Proposition 5 (Factors)

We finally expect several factors to be associated with reporting practices. First, we expect that the awareness that papers exhibit concerning the discourse on reporting guidelines is associated with reporting practice. The weakest indication of such awareness is arguably the citation of a reporting guideline. Second, the specific setting of an experiment might have an impact on reporting. Presumably, replication studies might define a context in which specific reporting needs have to be considered Carver (2010).

Next, we describe how we constructed our dataset with which we aim to investigate these propositions.

3 Method

In this section, we present the research design for investigating Propositions 1–5. To this end, we use a mixed-method approach that combines qualitative and quantitative research methods. More precisely, we apply a sequential mixed-method design (Venkatesh et al. 2013). We first conduct qualitative coding of experiment papers inspired by systematic mapping studies (Kitchenham and Charters 2007) and qualitative coding procedures (Saldaña 2015), which yields structured data that we analyze using computational methods (Berente et al. 2019), namely process mining (van der Aalst 2016) and method mining (Malinova et al. 2019).

We proceed as follows. Section 3.1 defines preliminary concepts that we make use of. Section 3.2 explains our paper selection procedure, and Section 3.3 how we coded the selected papers as event sequences. Section 3.4 describes the analysis techniques we applied, and Section 3.5 provides a summary of which technique is applied to investigate which proposition.

3.1 Preliminaries

Our research method builds on the overall idea that a paper describing a software engineering experiment can be represented as a sequence of sections, and that this sequence can be compared with reporting guidelines by the help of process mining techniques. To this end, we have to map a paper to a structured format representing this sequence of sections. We define this paper structure as follows.

The formal structure of a paper $\mathbb {P}= \left \langle s_{1},...s_{n} \right \rangle $ is a sequence of sections and subsections s_i. For all pairs of indexes $i,j \in \mathbb {N}$ with i < j, we say that s_i appears before section s_j in the sequence of the paper structure. Each section s_i includes content k_i. A requirement for our analysis is to progress from the formal structure of a paper with its section contents to a logical sequence that is aligned with reporting guidelines. Our dataset ($\mathbb {D}$) is composed of such logical sequences, each corresponding to one paper.

For our analysis, we build on analysis techniques from process mining. Therefore, we recall the classical notions of process mining: event, event sequence and event log. An event is a tuple e = (c,a,t) where c is the case id, a is the activity (event type) and t is the timestamp imposing a temporal order over the events. An event sequence is defined as $\sigma =\left \langle e_{1}, \ldots , e_{\vert \sigma \vert } \right \rangle $ of events such that ∀i,j ∈{1,…,|σ|} e_i.c = e_j.c. An event log $\mathbb {L}$ is a multi-set $\left \{\sigma _{1}, \ldots , \sigma _{\vert L \vert }\right \}$ of sequences. In our dataset ($\mathbb {D}$), events represent content blocks that match an item of the reporting guidelines, activities define to which reporting activity a content block maps, and timestamps capture the order of how content blocks appear in the text of the paper. We define the alphabet A as the set of all activity types defined by the reporting guidelines of Jedlitschka et al. (2008). The content further describes an activity of the guidelines. In particular, we characterize each activity using a set of keywords. The keywords represent plausible terms to be used in the heading of a section. Therefore, an activity is described as a 5-tuple

$$a=(group, label, keywords, description, required)$$

where group is the name of the set of related activity that a belongs, label is the name of the activity, keywords is a set of terms that define the activity, description is a short text that describes its purpose and required indicates whether the activity is mandatory or not.

The logical sequence of activities defines the logical structure of reporting an experiment. Figure 1 shows this logical structure as a BPMN process model as an example describing the steps of reporting an experiment. Circles define the start and the end. Rectangles represent the activities and the arcs the sequence of the activities. Diamond shapes represent gateways indicating that consecutive activities can be performed in any order or mutually exclusive. In our example, the background for the experiment and the plan for the experiment can be reported in any order while the activity of defining the execution will only be included if there is a deviation in the experiment.

3.2 Paper Selection

We selected papers according to guidelines for systematic literature reviews (Kitchenham and Charters 2007). We focused on papers reporting controlled software engineering experiments with human participants. We selected papers from the four major software engineering journals with the highest impact factor: Information and Software Technology (IST), IEEE Transactions on Software Engineering (TSE), Journal of Systems and Software (JSS) and Empirical Software Engineering (ESS). We conducted a search directly on the publisher’s repository for papers with the term “experiment” appearing either in the title or in the abstract and which were published between 2000 and 2020.^{Footnote 2} The choice for evaluating the 20 years since 2000 was made due to the fact that the first guideline for reporting controlled experiments was published by Singer (1999) in the year before. Therefore, only papers published after 1999 would have had the chance to report their experiments following a guideline.

We used only the term “experiment” in our query to guarantee high recall. We deliberately accepted the risk of including papers with this query that report on any type of experiment. We addressed the challenge of low precision by manually inspecting and removing papers that (i) do not present controlled experiments with human participants and (ii) do not use an experimental design as a research method.

Table 2 shows the amount of papers retrieved using our search query (#Retrieved) and the number of papers remaining for analysis after the selection criteria were applied (#Analyzed). The last column shows the percentage of papers kept for analysis. It is worth mentioning that the journal Empirical Software Engineering has the highest percentage of kept papers. Our dataset of papers ($\mathbb {D}$) contains 168 papers. A list of all papers is included in Appendix Appendix.

Table 2 Number of papers retrieved for each of the journals and the number of those considered for analysis

Full size table

3.3 Coding Procedure

Next, we describe our coding procedure. Input to this procedure is a paper and its output is a structured representation of that paper’s structure in terms of a sequence σ. Therefore, we sequentially process all sections and each respective subsection of the input paper in the order as they appear in the text.

Each section or subsection s of the paper is matched with reporting activities of the guideline of Jedlitschka et al. (2008). This matching is done by members of the author team using rules. First, the label terms of s are compared with the keywords of the different reporting activities. If an activity clearly matches, it is chosen. Second, if the label is not clear or ambiguous, the content of the section is read (Holsti 1969). As a result, a set of activities is identified or the section is ignored due to a lack of fitness with the guideline.

The coding of each section or subsection based on their content and the meaning associated with each reporting activity is critical for the validity of our study. Therefore, we adopted the procedure of Recker et al. (2019) and divided the coding into two phases. The first one addressed the consistency of the coding scheme, i.e. the definition of the activities, and the second one the coding itself.

In the first phase, a random sample of 40 papers (approximately 24% of the papers) was selected for defining the coding scheme using keywords and a description for each activity. The refinement of this initial coding scheme was done in four rounds. In each round, two authors coded ten papers, discussed the inconsistencies and improved the coding scheme. After the fourth round, no further refinements of the coding scheme were identified, which completed the first phase. In the second phase, the remaining set of papers was coded by one author.

As done by Recker et al. (2019), we calculated at each round of refinement the agreement between the two authors using Kappa as defined by Cohen (1960) as a measure of intercoder reliability. Figure 2 depicts the Kappa analysis over the rounds. Figure 2(a) visualizes Kappa’s improvement over the rounds with the final round achieving a Kappa of 0.91 indicating almost perfect agreement (Neuendorf 2002, p.145). Figure 2(b) also shows for each round the value of Kappa and the number of codings done by each author (column Rows). Given that the quantity of sections varies from paper to paper, also the number of codings differs for each round. In the discussions between the authors, we kept track of the number of codes agreed between them. We calculated the number of correct codes for each author, i.e. the number of times their code was eventually chosen. The author with the best coding was chosen to do the rest of the coding of the dataset in the second phase.

Table 3 depicts the final set of activities. There is a one-to-one relation with the content of Jedlitschka et al. (2008) guideline showed in Table 1. The label of each activity is summarized together with its corresponding keywords. The indication whether an activity is mandatory or optional is determined by following the definition of required content presented in Jedlitschka et al. (2008) guideline. Sections considered required in the guideline are mandatory activities, while sections not required are optional. The required contents Abstract and Keyword of Jedlitschka et al. (2008) guideline are considered as mandatory activities in our approach with labels Define Abstract and Define Keywords. They were both omitted from Tables 1 and 3 respectively because of space restrictions. Thus, 29 distinct activities are considered. Among them 19 are required and 10 are optional.

Table 3 The list of activities considered with the final set of keywords (Define Abstract and Define Keywords omitted for space restrictions)

Full size table

As a final step of data preparation, we merged consecutive activities of the same type into one. For instance, a sequence as <Define Goals, Design Experiment, Design Experiment, Interpret Results,Interpret Results,Interpret Results, Summarize Findings> is compressed to <Define Goals, Design Experiment, Interpret Results, Summarize Findings>. The reduction of the total number of activities in the event log was 16%.

3.4 Applied Process Mining Techniques

We followed the described procedures and obtained a dataset $\mathbb {D}$ that codes a set of papers using the data structure of an event log as used in process mining. For this reason, various analysis techniques from process mining can be readily applied. Such analysis can be used to investigate to which extent reporting guidelines are considered in each paper, which patterns of reporting and which changes of patterns over time can be observed.

Next, we describe the analysis techniques that we include in our analysis. Sections 3.4.1 and 3.4.2 describe automatic process discovery and conformance checking, respectively. Section 3.4.3 explains how we use clustering techniques.

3.4.1 Process discovery

Process discovery takes an event log as an input and automatically generates a process model representing the sequences of that event log. Figure 3 describes how process discovery works using a simple artificial example. The starting point is the event log shown in Fig. 3(a). It contains three different sequences of activities. Process discovery algorithms construct a process model from such an event log based on behavioral relations between the activities. All sequences have the same two initial activities (Define Goals, Design Experiment). This pattern is reflected in the output model by including a sequence of these two activities as a mandatory flow. The subsequent behavior is different for the three sequences. Discovery algorithms spot that the first and the second sequence execute the same two activities (Explore Data, Test Hypothesis), but in a different order and that the third sequence includes a third activity instead (Interpret Results). These observations are reflected in the model by exclusive and parallel gateways, respectively, creating different flow options. The suffix is the same for all three sequences and therefore final activity Summarize Findings is included as mandatory. Figure 3.(b) shows the discovered model.

Event logs from practice are far more complex than this illustrative example. This implies the challenge of representing the behavior compactly and, specifically, a trade-off between: [(i)] fitness: the discovered model should allow for the behavior observed in the event log; [(ii)] precision (avoid underfitting): the discovered model should not allow behavior that was not observed in the event log; [(iii)] generalization (avoid overfitting): the discovered model should generalize the observed behavior seen in the event log; [(iv)] simplicity: the discovered model should be as simple as possible (van der Aalst 2016).

If the sequences are similar in terms of their behavior, the derived process model will have high fitness and a clear and simple structure. However, if the set of sequences are substantially different in terms of order and activities, the discovered model is often chaotic; it is also referred to as a spaghetti model (van der Aalst 2016). These spaghetti models are hard to analyze and show the lack of pattern in execution.

In this study, we use process discovery techniques to investigate Proposition 1, i.e. to which extent common reporting patterns exist in the considered papers on software engineering experiments. More specifically, we use the tool Disco^{Footnote 3}, a widely used commercial process mining tool, to discover a process model from the event log of our dataset (Section 3.3).

3.4.2 Conformance Checking

Conformance checking techniques provide insights into the extent of consistency between a process model and the sequences of an event log. They take an event log and a model as input and identify the extent to which the event sequences deviate from the model. Several technique exist, e.g. through replaying each sequence against the process model or by calculating an alignment (van der Aalst et al. 2012). Given a sequence and a process model, an alignment maps the event sequence to the process model’s best fitting run.

Consider, for instance, the process model in Fig. 3(b). If the event log includes the sequence <Define Goals, Design Experiment, Summarize Findings>, a conformance check algorithm will report that the activities Interpret Results or Explore Data together with Test Hypothesis are not observed in the log. In the same way, if a sequence like <Define Goals, Design Experiment, Interpret Results, Define Lessons Learned, Summarize Findings> is observed, conformance checking reports that an unexpected activity (Define Lessons Learned) was executed.

In our study, we use conformance checking techniques to investigate Proposition 2, i.e. to analyze to which extent software engineering experiments follow the guidelines proposed by Jedlitschka et al. (2008). To that end, we manually created a process model based on these guidelines. Since a paper is written in a sequential way, the list of activities in Table 3 defines a sequence. Optional activities are modeled within XOR-gateway blocks. Figure 4 shows the corresponding process model. We use this model for checking the conformance between reporting sequences of individual papers and the guidelines by Jedlitschka et al. (2008). We calculate conformance with plugin Replay a Log On Petri Net For Conformance Analysis^{Footnote 4} (van der Aalst et al. 2012) of Prom^{Footnote 5}.

Also, we use conformance checking and its evolution over time to evaluate Proposition 3, and partially for Proposition 5 in combination with potential factors associated with reporting practice, such as citation of guidelines and replication.

3.4.3 Cluster Analysis

Cluster analysis allows the identification of groups of sequences in an event log. Two sequences are put into the same cluster if they are similar in terms of a suitable distance function. Various techniques for calculating distances between sequences have been proposed for social sequence analysis (Abbott 1995; Gabadinho et al. 2011) and process mining (Song et al. 2008; De Koninck et al. 2017).

In the context of our work, we use cluster analysis to address Proposition 4, i.e. to investigate whether there are different recurrent patterns for reporting experiments. It is not clear how many clusters can be expected. If all papers considered the reporting guideline structure of Jedlitschka et al. (2008), we would obtain one cluster containing very similar reporting sequences. In case that papers arbitrarily reported experiments, we might obtain a high number of rather dissimilar sequence clusters. It is more plausible to expect only few clusters. In that case, it will be interesting to investigate which are representative sequences for each cluster and in how far they differ.

We use the TraMineR tool^{Footnote 6} (Gabadinho et al. 2011) for our cluster analysis, an R-package for exploring sequence data. For calculating the sequence distance, we used the optimum matching algorithm (Abbott and Tsay 2000).

3.5 Propositions and Corresponding Techniques

Table 4 summarizes Propositions 1-5, the corresponding analysis techniques that we apply for investigating them, and the corresponding tools used. In Fig. 5, we indicate the input and output for each of the analysis techniques.

Table 4 Analysis techniques used to investigate Propositions 1-5

Full size table

4 Results

This section describes the results of our study into reporting practices of software engineering experiments. Section 4.1 provides descriptive statistics of our dataset. Section 4.2 presents the results of analyzing the data using automatic process discovery. Section 4.3 discusses the conformance checking results, which provide insights into how well aligned the articles are with reporting guidelines. Section 4.4 describes the results of clustering articles according to their reporting sequences. Section 4.5 evaluates to which extent is citing guidelines connected with guideline conformance. Finally, Section 4.6 presents observations on how replication studies use reporting guidelines.

4.1 Descriptive statistics

Our event log contains 168 cases (each describing a paper and the sequence of its reporting steps) from the year 2000 until 2020. Figure 6 shows the temporal distribution of these papers for each of the four journals. For every year, there are three or more papers in our analysis.

Table 5 shows log statistics about the activities before (#Activities) and after compressing consecutive activities (#Activities_c). The table provides the number of activities per paper (maximum (Max), minimum (Min), and average (Avg)) and the number of distinct activities in the whole log (maximum (Max), minimum (Min), and average (Avg)). It is interesting to note that we did not encounter any sequence that occurred more than once, which means that every paper’s reporting sequence was unique. TSE was the only journal, in which not all 29 reporting activities were observed. Three optional activities were missing: Define Experiment Deviation, Define Experiment Preparation, and Define Relevance to Practice. Another interesting observation is that the average of distinct activities per paper is less than the number of required activities (19) for all journals.

Table 5 Log statistics

Full size table

Furthermore, there are several activities that are repeated in an average paper, i.e. the number of distinct activities per paper (#Distinct_Avg) varies from the number of activities per paper ($\#PerPaper_{{Avg}_{c}}$). Table 6 shows each activity’s frequency in the event log (Log Frequency) and in how many papers each activity appears (Paper Frequency). Also, the corresponding percentage is presented.

Table 6 Frequency of each activity in the entire event log and per paper

Full size table

4.2 Process discovery

To check if the papers follow recurring patterns of reporting, we applied automatic process discovery for the complete event log using the tool Disco. Figure 7 shows the process model discovered from the set of all papers. In this model, all possible paths are shown. We observe that the complexity of this model is overwhelming, and that it is difficult to spot patterns of recurring behavior.

The two activities that all the papers consider and that have a clear position in this spaghetti model are the Define Abstract and Define Keywords activities at the top of the model (not readable in the figure). This might probably not be due to the guidelines, but that paper submission formally enforces the inclusion of abstract and keywords. Therefore, it is not surprising that they are observed in all papers and in this order.

Once we apply filtering techniques provided by Disco to only show the minimum number of paths for connecting all 29 activities, we obtain the process model shown in Fig. 8. Compared with the spaghetti model from Fig. 7, this model is easier to understand and interpret. The darker the color of an activity, the more often this activity occurred in the event log. The Activity Explore data is the most observed activity, occurring 350 times (due to repetitions in various papers). The thicker the transition arrows, the more often the corresponding path is observed in the log. The most frequent sequence of activities is from activity Explore data to activity Test Hypotheses with 204 occurrences (due to repetitions in various papers). In this filtered process model, the frequency associated with an activity is greater or equal to the sum of its outgoing transition arrows frequency, because not all the possible arrows with its correspondent frequencies are shown. This process model shows that papers usually start with the definition of the problem (99% of the 168 papers) followed by the definition of the experiment’s objectives (97%). We also notice that many activities and many transitions are only observed for a smaller fraction of papers.

4.3 Conformance Checking

Conformance checking is a group of techniques that facilitates the comparison between the sequences represented in a process model (such as reporting guidelines) and sequences of papers observed in our event log. We conducted such a conformance check for each paper based on the process model shown above in Fig. 4 that captures the reporting guidelines by (Jedlitschka et al. 2008). We used the classical notion of fitness as a measure of conformance (van der Aalst 2016). A sequence fully conforming with the process model has a fitness of 1 while a sequence that does not conform at all has a fitness of 0. We summarize the results for each journal separately and in total. Figure 9 shows the Box plot of this conformance analysis.

The bulk of papers range between 0.6 and 0.7 in terms of conformance. Given that the data is normally distributed, we performed a one-way ANOVA test with no assumption of equal variances. The difference between the mean value of the four journals is statistically significant with 95% of confidence (F = 6.1574, num df = 3, denom df = 74.535, p-value = 0.0008497). The journal with the highest average conformance is ESE. This is not surprising given that it is the journal with the highest affinity with controlled experiments. It also has to be noted that we do not observe drastic differences in conformance between the journals. The Box plot of Fig. 9 also highlights some outliers either with outstanding conformance or very low conformance. The two papers with the highest conformance are from JSS and the third-highest from TSE. All three papers explicitly cited guidelines with two of them citing Jedlitschka et al. (2008) guideline. JSS and TSE are also the journals of the two papers with the lowest conformance of below 0.4. These papers did not cite any guidelines.

We also analyzed the evolution of the conformance over the 20 years in which the papers were published. Figure 10(a) shows a Box plot of the conformance of all papers for each year. We observe a slight increase in the average until the year 2008 when Jedlitschka et al. (2008) was published. Figure 10(b) shows the evolution of the average conformance over the years for each journal and also for the event log with all the papers. All journals show a similar evolution without any clear upward or downward trend over the years. More specifically, we do not observe any noticeable change after the year 2008 when Jedlitschka et al. (2008) was published. The peak of the curve for JSS in 2009 stems from the fact that only one experiment paper was published in that year in this journal and that this paper is an outlier with the highest guideline conformance of the whole set of papers. In summary, Fig. 10(a) and b show the same range of average conformance between 0.6 and 0.7 that we already observed in Fig. 9. Also, the KPSS test (Kwiatkowski et al. 1992) showed that the conformance time series is stationary with 0.05 significance (KPSS Level = 0.16735, Truncation lag parameter = 4, p-value = 0.1) without any clear trend up or down.

4.4 Cluster Analysis

For the cluster analysis, we use the TraMiner tool (Gabadinho et al. 2011). This tool supports clustering based on classical sequence alignment. This means, in essence, that sequences are clustered based on a notion of sequence edit distance. Different number of clusters were evaluated and the best result yielded four clusters. Figure 11 shows the four clusters of sequences. The X-axis represents the position (A) in a sequence. It is scales to 41, which is the maximum number of activities stemming from the paper with the longest sequence (also compared above in Table 5). The Y-axis represents the relative frequency of color-coded activities. Mind that the area covered by a color bears the relevant information, not necessarily the position of a color dot. Rows in the diagram should not be interpreted as individual papers. The red color corresponds to missing values, which are appended to fill up shorter sequences up to a length of 41. Also, each cluster has the information of how many sequences were grouped. Cluster 1 has 56 sequences, Cluster 2 has 47, Cluster 3 has 45 and Cluster 4 has 20 sequences.

All clusters have low Average Silhouette Width (0.2, 0.09, 0.12, 0.03, respectively) indicating that there is no clear structure of the clusters (Kaufman and Rousseeuw 1990). Also, there are 4, 11, 6 and 5 representative sequences respectively considering a coverage threshold of 25%. The absence of a unique representative sequence for the clusters underlines that the clusters are unstructured. However, it is possible to spot some commonalities. Given that the average length of a sequence is 22, which is almost half of the longest sequence, all clusters have a larger red area indicating missing values. We also observe that all four clusters have a similar start pattern that corresponds to the sequence Define Abstract, Define Keywords followed by Define Problem and Define Objectives.

There are also some noticeable differences. Cluster 1 contains the on average shortest sequences. It is also the cluster with the smallest number of explicit Test Hypothesis activities. Also less frequent than in the other clusters is the activity Design Experiment. The average fitness of this cluster with reporting guidelines is 0.69. In contrast to Cluster 1, Cluster 3 has longer sequences. Most of them range between 20 and 25, while the previous cluster ranges between 15 and 18. Cluster 3 also has a larger relative frequency of Test Hypothesis and Assess Threats to Validity. Still, the overall proportion of these activities is still low. All this suggests that Cluster 3 includes papers of a more structured reporting style. The average fitness is highest for this cluster with 0.73.

Cluster 2 appears to be the cluster with a rather homogeneous group of papers with a rather structured reporting style. There is ample attention to Define Task, Design Experiment, and Explore Data together with a clear methodological focus on Test Hypothesis and Assess Threats to Validity. In contrast to the heterogeneous Cluster 4, the papers in Cluster 3 seem to have a clear focus, which results in a detailed, yet compact reporting style. The average fitness of Cluster 2 is 0.70, and for Cluster 4 it is 0.62.

4.5 Guideline Citation Analysis

This section follows up on the question of whether an explicit consideration of guidelines leads to a more structured reporting style. To this end, we analyze to which extent papers cite established guidelines. First, Section 4.5.1 investigates how many papers cite reporting guidelines, including Jedlitschka et al. (2008), and whether this is connected with conformance. Second, Section 4.5.2 analyzes which papers have been citing Jedlitschka et al. (2008).

4.5.1 Guideline Citation of Software Engineering Experiment Papers

One potential explanation of higher conformance with reporting guidelines is whether reporting guideline papers were considered at all. We observed above that the conformance of papers with the guidelines by Jedlitschka et al. (2008) ranges between 0.6 and 0.7. The focus of our analysis here is on three questions: (i) is Jedlitschka et al. (2008) guideline (G1a) or its previous version (Jedlitschka and Pfahl 2005) (G1b) cited? (ii) is the book by Wohlin et al. (2012) (G1c) that points to Jedlitschka et al. (2008) cited? (iii) does a paper consider any other guideline by Singer (1999) (G2), Wohlin et al. (2000) (G3), Juristo and Moreno (2001) (G4), and Kitchenham et al. (2002) (G5)?

We analyzed for all 168 papers which guidelines they cite. As a result we found that 57 papers (34%) do not cite any guideline at all. The remaining 111 papers cite at least one guideline. Table 7 shows the number of citations each guideline received. The percentage is calculated considering the total number of papers that potentially could have cited the respective guideline. This means that a paper published in the year or onward of the publication year of the guideline is a paper that could have potentially cited it.

Table 7 Number of citations over the total number of possible citations for each guideline

Full size table

The guideline most cited is the book by Wohlin et al. (2000) with almost 43% of the potential papers referring to it. Jedlitschka et al. (2008) received 17.5% of the potential papers; however, the book by Wohlin et al. (2012) recommends Jedlitschka et al. (2008) as a guideline and received almost double the citations. Figure 12 presents a comparison of the number of citations that Wohlin et al. (2000), Jedlitschka et al. (2008) and Wohlin et al. (2012) received over time. We observe that from 2014 when papers started citing Wohlin et al. (2012), the citations of Jedlitschka et al. (2008) and Wohlin et al. (2012) evolved roughly in parallel to each other, and citation for Wohlin et al. (2000) decreased.

Figure 13 shows a Box plot comparison of the fitness between those papers that cited at least one of the guidelines (YES) and those that did not cite any guidelines at all (NO). We can observe a slightly higher fitness of those papers that cited guidelines. We performed Welch t-test (Ruxton 2006). The difference between the mean value of the two groups is statistically significant with 95% of confidence (t = − 4.1471, df = 132.02, p-value = 5.987e − 05), showing a correlation between conformance and guideline citation.

Also, 30% of the papers cite only one guideline. Among them, only 7 papers cite either Juristo and Moreno (2001) or Kitchenham et al. (2002) guidelines. The remaining papers cite versions of the Jedlitschka et al. (2008) guidelines (G1a,G1b or G1c) or the Wohlin et al. (2000) guidelines. A pairwise correlation analysis between Jedlitschka et al. (2008) guideline and the other guidelines (G2,G3,G4 and G5) shows a correlation with G2 and no correlation with the others (Pearson’s Chi-squared test with 0.05 of significance). A correlation between the four guidelines (G2,G3,G4 and G5) was also not found. Thus, it was not possible to find a pattern on the citation of guidelines.

4.5.2 Citation from Various Types of Papers

It is surprising that Jedlitschka et al. (2008) is highly cited, but the share of software engineering experiment papers citing it is rather low with 17.5%. Therefore, we investigate which type of papers cite Jedlitschka et al. (2008).

A search on Google Scholar yielded a list of 300 citations. These citations are distributed as follows: 100 conference papers, 103 journal papers and 97 other types of publication, such as books, dissertations and theses. We investigated more deeply into the journal and conference papers and could classify the papers in Method (M), Experiment (E), or Survey (S) papers. Table 8 (a) shows the statistics of papers per category. We notice that the majority of the papers are from the category Experiment (E). Considering the papers published in one of the four journals, we observed that 35 are experiments. We analyzed the 21 papers that cite Jedlitschka et al. (2008). The difference of 14 papers that we did not consider here corresponds to the papers out of our study’s scope (experiment with human participants in which the experiment is the main contribution of the paper). This means, a larger fraction of citations to Jedlitschka et al. (2008) are conference papers and articles that are not published in IST, TSE, JSS, or ESE.

Table 8 Number of journal and conference papers per category (a)

Full size table

4.6 Replication Analysis

Replication is one of the principles of the scientific method (Brown et al. 2016). Replication studies are an important means to further strengthen the trust in the validity of results obtained by the original experiment. It has been emphasized that a replication study has to describe how the original studies were performed, provide implementation differences of the replication, and a comparison of the results (Carver 2010). It is also important to describe the type of replication (Bettis et al. 2016) and to consider a shared terminology (Baldassarre et al. 2014). Due to these specific aspects, replication experiments require a separate structure of reporting. Table 9 shows the number of replication papers that are part of our analysis.

Table 9 Number of replications for each of the journals and their percentage

Full size table

IST and ESE are the journals with more replication papers. This is consistent with the results of a recent systematic mapping study on replication (Cruz et al. 2020). In their mapping study, Cruz et al. (2020) emphasize the lack of guidelines for reporting replications. They state that guideline proposal by Carver (2010) is the only high-level proposal for reporting replication so far. This lack of guidelines might also explain that we found that nine of the replications were internal replications published together with the original study. Twenty out of the twenty-five replication papers were published after the year 2010 and therefore could have followed Carver’s guidelines (Carver 2010). However, only 9 did (45%).

Figure 14 shows to which extent the replication papers conform with Jedlitschka et al. (2008). The fitness of the replication papers is comparable to other papers. The difference between the mean value of the two groups is not statistically significant with 95% of confidence based on Welch t-test (Ruxton 2006) (t = -0.22898, df = 29.114, p-value = 0.8205), showing there is not a correlation between conformance and replication.

4.7 Summary of Results

We summarize the results of our mix-method study as follows. The descriptive analysis revealed that a consistent reporting style is only followed to a limited extent. The variance is high, with some papers only reporting a fraction of the required reporting activities, others repeating activities, and only a few showing high fitness with reporting guidelines. The application of process mining techniques confirmed the impression of a rather inconsistent reporting practice. Conformance checking revealed that the fitness between the reporting sequences of most papers only partially considers the guidelines by Jedlitschka et al. (2008). The clustering analysis suggested that only Cluster 2 appears to consider most of the important activities. The conformance analysis over time showed that there is no consistent improvement of reporting practice over time. These findings are consistent with our citation analysis, revealing that the highly cited guidelines by Jedlitschka et al. (2008) are often not considered in experiments published in IST, TSE, JSS, and ESE. Furthermore, while replication studies have lower fitness due to special reporting requirements, we observe that they exhibit conformance similar to papers reporting original studies.

5 Discussion

In this section, we first discuss our findings in the light of the formulated propositions. Then, we describe implications for research on software engineering experiments. Finally, we reflect upon threats to validity of our study.