1 Introduction

Reporting guidelines are an important concern for software engineering experiments.Footnote 1 Arguably, using reporting guidelines makes it easier for the reader understand the experimental design and the validity of the conclusions (Jedlitschka et al. 2008). These benefits have motivated the community to design and refine guidelines that support systematic and consistent reporting (Singer 1999; Wohlin et al. 2000; Juristo and Moreno 2001; Kitchenham et al. 2002; Shaw 2003; Jedlitschka et al. 2008).

Despite these efforts to establish standards, it has been observed that reporting in practice is often heterogeneous and important information is missing (Jedlitschka and Ciolkowski 2004; Sjøberg et al. 2005; Jedlitschka et al. 2008). Indeed, research on reporting guidelines has remained largely prescriptive. We know little about the extent to which reporting guidelines are used and how the uptake has changed over time. This might be because the number of controlled experiments has drastically increased since 2000 and that quantitative analysis of how these are reported is difficult. Still, gaining insights into the actual reporting practice is important to further improve guidelines and reporting practices.

In this paper, we investigate actual reporting practices for controlled experiments with human subjects that have been published in major software engineering journals during the period between the years 2000 and 2020. To this end, we use a mixed-method approach combining coding techniques from qualitative research with a formal analysis of event sequences from process mining. Our analysis reveals the actual reporting path of experiment papers and the degree of conformance for different journals over time. We find that conformance oscillates between 55% and 75% for all covered journals without a clear trend towards increasing. Our consecutive citation analysis shows that roughly one-third of the papers do not refer to any of the experiment reporting guidelines, while replication studies hardly ever refer to the guideline by Carver (2010) for replications. Based on the observed results, we highlight several implications for improving both reporting practices and refining guidelines, among others by more explicitly covering theory.

The remainder of this paper is structured as follows. Section 2 discusses the role of experiments in software engineering with a focus on reporting guidelines. We present hypotheses on the presumable impact of these guidelines. Section 3 describes our mixed-method research design, including paper selection, coding procedures, and analysis techniques. Section 4 presents our analysis results, focusing on the conformance between the reporting sequence of papers and guidelines. Section 5 discusses the implications of our research and threats to validity. Section 6 concludes with a summary and an outlook on future research.

2 Background

In this section, we first discuss the role of experiments in software engineering. Then, we revisit reporting guidelines for experiments in software engineering. Finally, we hypothesize how reporting practice could be expected to develop over time.

2.1 Experiments in Software Engineering

Experiments are an important means for generating new scientific insights. Gauch (2003) highlights the strengths of experiments, including control and understanding of causal factors. For these reasons, experiments are also increasingly used in software engineering. Research by Basili (1993) and Basili (1996), Kitchenham et al. (2004), and Wohlin et al. (2000) and Wohlin et al. (2012) laid the foundations for developing the field of empirical software engineering. Basili (1993) and Basili (1996) emphasizes the benefits for software engineering to develop an experimental line of research similar to other scientific disciplines. Specifically, he proposes an experimental approach inspired by the quality improvement paradigm as often adapted in industrial development and engineering as the basis. Kitchenham et al. (2004) highlight the potential of adapting principles from evidence-based medicine to software engineering and discuss both benefits and difficulties of its adoption. Finally, Wohlin et al. (2012) provides an extensive summary of methodological principles for conducting software engineering experiments. They emphasizes the importance of experiments, given that the practice of software engineering builds on the interactions between software artifacts and human behavior of software developers and other stakeholders.

Various reviews have investigated how and to which extent experiments are used in software engineering. Shull et al. (2004) emphasize the importance of replication for establishing reliable insights into software engineering. They present guidelines that are meant to support a better transfer of knowledge about experimental designs and execution.

Sjøberg et al. (2005) review controlled experiments in software engineering published between 1993 and 2002, focusing on how relevant information on subjects and tasks is reported. Their findings suggest that reporting is often incomplete and unsystematic, with partially inconsistent terminology. They state that the software engineering community needs guidelines helping researchers to better tackle difficulties of methodological and practical complexity of conducting controlled experiments. They provide a list of experimental details that they consider necessary to be reported. The review by Kampenes et al. (2007) drills down into the aspect of effect sizes. They observe that only 29% of the reviewed experiments reported effect sizes, even though this information is considered essential for interpreting experimental results. Additionally, Dybå et al. (2006) review the quantitative assessment of statistical power in software engineering experiments, which they find to be below established norms. They stress the importance of reporting confidence intervals and effect sizes. Hannay and Sjøberg (2007) review to which extent software engineering experiments build upon a theoretical justification of hypotheses. Their results reveal that out of 103 articles, only 23 build in total on 40 theories. These theories mainly were used for two reasons: to justify research questions and hypotheses as a part of the experimental design and to provide additional explanations of the results. The benefits of building theories and building on theories for software engineering experiments are stressed by Hannay and Sjøberg (2007). As an aid, they provide an extensive overview of the theories used in the papers that they reviewed.

It is important to note that the mentioned reviews stem from the years 2004 to 2007. The weaknesses they uncovered led to a refinement of guidelines for reporting software engineering experiments.

2.2 Experimental Reporting Guidelines in Software Engineering

Reporting has been a concern of research on software engineering experiments since the paper by Singer (1999), and there are several papers afterwards that describe reporting guidelines. We provide a short description of these guideline papers and a comparison between them.

The first guideline paper by Singer (1999) introduced the APA style guidelines to the field of empirical software engineering. According to Singer (1999), the abstract should summarize the paper including information about hypotheses, population, and results. The paper itself should first present the study’s problem and a brief explanation of the research strategy; describe in detail the experimental method, participants, and materials; outline the experimental procedure; and then present the statistical analysis of results with a discussion of the overall findings.

The second guideline of interest is proposed in the book by Wohlin et al. (2000). The authors emphasize the need for describing the scope of the experiment, its goals and foundational definitions. A summary of the experimental planning should cover the context selection for its importance to validity and generalization as well as the hypothesis formulation, including justifications for the selection of variables and subjects. Also the instrumentation is meant to be described. Among others, Wohlin et al. (2000) discuss what threats to validity have to be considered and how they can be addressed. The book also provides guidelines on analyzing data and interpreting results, together with suggestions for writing a summary and conclusion. It is worth mentioning that the second edition (Wohlin et al. 2012) refers to the reporting guideline by Jedlitschka et al. (2008).

Guidelines are also presented in the book by Juristo and Moreno (2001), later reprinted as Juristo and Moreno (2010). These guidelines are motivated by the idea that good experimental documentation needs to describe all phases, including goal definition, experiment design, execution, and analysis at a level of detail that a replication study can be conducted. To this end, the motivation should justify the relevance of the experiment, provide pointers to prior experiments, and describe experimental goals and hypotheses. The experimental design should report the experimental factors, response variables, parameters, blocks, as well as experimental units, subjects, and experimental process. Also, information should be provided about internal replication, randomization procedure if applicable, as well as subject knowledge, experimental schedule, and various factors that may have influenced the experimental result, such as potential learning and boredom effects. Regarding experimental execution, details are warranted about experimenters, instruction to participants, available time for completing the study, as well as experimental deviations and data collection. Finally, the analysis best includes information on data normality, analysis methods, results, and result interpretation.

The guideline by Kitchenham et al. (2002) presents a hands-on approach for reporting the experiment. It suggests describing the context of the experiment at extensive detail. Then, the experimental design is described regarding the planning of subjects, sampling techniques, and the process of allocating and administering treatments. Next, the procedures of experiment execution and data collection are summarized. This leads to the data analysis, presentation, and interpretation of the results, before the report concludes. We have to emphasize that this guideline presents a more general instruction on how to run an experiment, but lacks the instructions on how to report it precisely. The most extensive work on reporting are arguably the guidelines by Jedlitschka et al. (2008), which we will use as a reference in this paper for its level of detail. Note that these guidelines are fairly consistent with other guidelines, but more fine-granular. These guidelines suggest starting with the title and authorship section, which should include the term “controlled experiment” in the title. The structure of the abstract is inspired by abstracts in the medical sciences. The actual paper starts with the introduction section, including the problem statement, research objectives and context. The related work discussion should summarize prior contributions as well as the technology under investigation and alternative technology. Next, the section on the planning and experimental design covers various aspects. These include research objectives; information on the sampling strategy, population, and sample size; experimental material, objects, tasks and subjects; subsections on hypotheses, experimental design, and the experimental procedure with details on data gathering; as well as a description of the analysis procedure. In turn, the section on the actual experiment execution is followed by the analysis section. Readers should be provided with descriptive statistics, data set preparation procedures, and hypothesis testing results. The discussion and interpretation section should cover results and implications, as well as threats to validity. The conclusion includes a summary and future work propositions.

Table 1 summarizes the guideline structure by Jedlitschka et al. (2008) together with its sections, subsections and a short description. The four columns on the right-hand side of this table describe which of its subsection are also considered by previous guidelines, namely [G2:] Singer (1999); [G3:] Wohlin et al. (2000); [G4:] Juristo and Moreno (2001); and [G5:] Kitchenham et al. (2002). Table 1 also highlights that the guidelines by Jedlitschka et al. (2008) are the most fine-granular ones, and they show substantial overlap with the activities of other guidelines. Table 1 marks those activities with a check if they are explicitly covered in the reporting structure. Furthermore, it is important to mention that there are guidelines that we did not include in this comparison. Jedlitschka et al. (2014) is an extension of Jedlitschka et al. (2008) suggesting the inclusion of additional information for practitioners, such as costs, quality schedule, and productivity in the context of software development. Although widely used in various research fields, we did not consider the introduction, methods, results, and discussion (IMRAD) guideline (Sollaci and Pereira 2004) as it is not specifically designed for software engineering experiments.

Table 1 Description of Jedlitschka et al. (2008) guideline (G1) and comparison with previous guidelines: G2: Singer (1999); G3: Wohlin et al. (2000); G4: Juristo and Moreno (2001); G5: Kitchenham et al. (2002)

2.3 Propositions on the Evolution of Reporting Practices

In this section, we formulate a set of propositions that help us to define clear analysis objectives. Such an approach is consistent with general guidelines on conducting systematic literature reviews (Kitchenham and Charters 2007), which emphasize the need to formulate research questions and objectives. Investigating reporting practices for software engineering experiments is warranted, because the last larger review covering this aspect dates back to Sjøberg et al. (2005) and various proposals for reporting guidelines have been made since then.

Proposition 1 (Patterns)

We expect that patterns of reporting can be observed. Two arguments support this assumption. First, reporting an experiment is a human activity of an author team that becomes routinized by means of social entrainment (Gersick and Hackman 1990). This means that the same author team will likely organize the reporting of a new experiment in such a way as they have done for the previous one. Such persistence of behavior has been studied among others by Kelly and McGrath (1985). Second, experimental reporting is subject to social norms of the scientific process in a particular field. Social norms contribute to the development of recurring patterns of behavior (DiMaggio and Powell 1983). These are further stabilized by mimetic behavior (Gersick and Hackman 1990) of imitating reporting of published experiments in case authors do not have yet established their own way of reporting. All these aspects contribute to the emergence of reporting patterns.

Proposition 2 (Conformance)

We expect that compliance with reporting guidelines can be observed. DiMaggio and Powell (1983) emphasize that normative pressure is a key factor that explains why organizations have been observed to be rather similar. These pressures are stronger in fields in which formal education and professional networks establish standards. Arguably, these attributes can be associated with empirical software engineering and reporting experiments in this field as well, contributing to compliance with reporting guidelines.

Proposition 3 (Evolution)

We expect that the reporting patterns have evolved over time. We expect that this evolution is associated with two forces. First, reporting practices might have presumably become more similar over time. Similar observations have been made by Levitt and Nass (1989), who compared the topic sequence in leading textbooks in physics and sociology over time. Levitt and Nass (1989) explain their results with institutional forces, including coercive, mimetic, and normative pressures (DiMaggio and Powell 1983). Such forces are arguably also relevant for reporting software engineering experiments.

Proposition 4 (Contingency)

We expect that papers with similar reporting can be observed. Burnes (1996) emphasizes that there is often “no one best way” of applying methods because contingent factors require an adaptation to circumstances. Similar observations have influenced situational method engineering (Brinkkemper 1996). This does not mean that reporting is arbitrary, but that differences are systematic and associated with contextual factors. As a consequence, we would be able to observe that certain types of papers would form clusters. Often, when there are hardly patterns overall (Proposition 1), it can still be possible to identify patterns for subgroups, which is investigated for this proposition.

Proposition 5 (Factors)

We finally expect several factors to be associated with reporting practices. First, we expect that the awareness that papers exhibit concerning the discourse on reporting guidelines is associated with reporting practice. The weakest indication of such awareness is arguably the citation of a reporting guideline. Second, the specific setting of an experiment might have an impact on reporting. Presumably, replication studies might define a context in which specific reporting needs have to be considered Carver (2010).

Next, we describe how we constructed our dataset with which we aim to investigate these propositions.

3 Method

In this section, we present the research design for investigating Propositions 1–5. To this end, we use a mixed-method approach that combines qualitative and quantitative research methods. More precisely, we apply a sequential mixed-method design (Venkatesh et al. 2013). We first conduct qualitative coding of experiment papers inspired by systematic mapping studies (Kitchenham and Charters 2007) and qualitative coding procedures (Saldaña 2015), which yields structured data that we analyze using computational methods (Berente et al. 2019), namely process mining (van der Aalst 2016) and method mining (Malinova et al. 2019).

We proceed as follows. Section 3.1 defines preliminary concepts that we make use of. Section 3.2 explains our paper selection procedure, and Section 3.3 how we coded the selected papers as event sequences. Section 3.4 describes the analysis techniques we applied, and Section 3.5 provides a summary of which technique is applied to investigate which proposition.

3.1 Preliminaries

Our research method builds on the overall idea that a paper describing a software engineering experiment can be represented as a sequence of sections, and that this sequence can be compared with reporting guidelines by the help of process mining techniques. To this end, we have to map a paper to a structured format representing this sequence of sections. We define this paper structure as follows.

The formal structure of a paper \(\mathbb {P}= \left \langle s_{1},...s_{n} \right \rangle \) is a sequence of sections and subsections si. For all pairs of indexes \(i,j \in \mathbb {N}\) with i < j, we say that si appears before section sj in the sequence of the paper structure. Each section si includes content ki. A requirement for our analysis is to progress from the formal structure of a paper with its section contents to a logical sequence that is aligned with reporting guidelines. Our dataset (\(\mathbb {D}\)) is composed of such logical sequences, each corresponding to one paper.

For our analysis, we build on analysis techniques from process mining. Therefore, we recall the classical notions of process mining: event, event sequence and event log. An event is a tuple e = (c,a,t) where c is the case id, a is the activity (event type) and t is the timestamp imposing a temporal order over the events. An event sequence is defined as \(\sigma =\left \langle e_{1}, \ldots , e_{\vert \sigma \vert } \right \rangle \) of events such that ∀i,j ∈{1,…,|σ|} ei.c = ej.c. An event log \(\mathbb {L}\) is a multi-set \(\left \{\sigma _{1}, \ldots , \sigma _{\vert L \vert }\right \}\) of sequences. In our dataset (\(\mathbb {D}\)), events represent content blocks that match an item of the reporting guidelines, activities define to which reporting activity a content block maps, and timestamps capture the order of how content blocks appear in the text of the paper. We define the alphabet A as the set of all activity types defined by the reporting guidelines of Jedlitschka et al. (2008). The content further describes an activity of the guidelines. In particular, we characterize each activity using a set of keywords. The keywords represent plausible terms to be used in the heading of a section. Therefore, an activity is described as a 5-tuple

$$a=(group, label, keywords, description, required)$$

where group is the name of the set of related activity that a belongs, label is the name of the activity, keywords is a set of terms that define the activity, description is a short text that describes its purpose and required indicates whether the activity is mandatory or not.

The logical sequence of activities defines the logical structure of reporting an experiment. Figure 1 shows this logical structure as a BPMN process model as an example describing the steps of reporting an experiment. Circles define the start and the end. Rectangles represent the activities and the arcs the sequence of the activities. Diamond shapes represent gateways indicating that consecutive activities can be performed in any order or mutually exclusive. In our example, the background for the experiment and the plan for the experiment can be reported in any order while the activity of defining the execution will only be included if there is a deviation in the experiment.

Fig. 1
figure 1

Example of a process model for the process of reporting an experiment

3.2 Paper Selection

We selected papers according to guidelines for systematic literature reviews (Kitchenham and Charters 2007). We focused on papers reporting controlled software engineering experiments with human participants. We selected papers from the four major software engineering journals with the highest impact factor: Information and Software Technology (IST), IEEE Transactions on Software Engineering (TSE), Journal of Systems and Software (JSS) and Empirical Software Engineering (ESS). We conducted a search directly on the publisher’s repository for papers with the term “experiment” appearing either in the title or in the abstract and which were published between 2000 and 2020.Footnote 2 The choice for evaluating the 20 years since 2000 was made due to the fact that the first guideline for reporting controlled experiments was published by Singer (1999) in the year before. Therefore, only papers published after 1999 would have had the chance to report their experiments following a guideline.

We used only the term “experiment” in our query to guarantee high recall. We deliberately accepted the risk of including papers with this query that report on any type of experiment. We addressed the challenge of low precision by manually inspecting and removing papers that (i) do not present controlled experiments with human participants and (ii) do not use an experimental design as a research method.

Table 2 shows the amount of papers retrieved using our search query (#Retrieved) and the number of papers remaining for analysis after the selection criteria were applied (#Analyzed). The last column shows the percentage of papers kept for analysis. It is worth mentioning that the journal Empirical Software Engineering has the highest percentage of kept papers. Our dataset of papers (\(\mathbb {D}\)) contains 168 papers. A list of all papers is included in Appendix Appendix.

Table 2 Number of papers retrieved for each of the journals and the number of those considered for analysis

3.3 Coding Procedure

Next, we describe our coding procedure. Input to this procedure is a paper and its output is a structured representation of that paper’s structure in terms of a sequence σ. Therefore, we sequentially process all sections and each respective subsection of the input paper in the order as they appear in the text.

Each section or subsection s of the paper is matched with reporting activities of the guideline of Jedlitschka et al. (2008). This matching is done by members of the author team using rules. First, the label terms of s are compared with the keywords of the different reporting activities. If an activity clearly matches, it is chosen. Second, if the label is not clear or ambiguous, the content of the section is read (Holsti 1969). As a result, a set of activities is identified or the section is ignored due to a lack of fitness with the guideline.

The coding of each section or subsection based on their content and the meaning associated with each reporting activity is critical for the validity of our study. Therefore, we adopted the procedure of Recker et al. (2019) and divided the coding into two phases. The first one addressed the consistency of the coding scheme, i.e. the definition of the activities, and the second one the coding itself.

In the first phase, a random sample of 40 papers (approximately 24% of the papers) was selected for defining the coding scheme using keywords and a description for each activity. The refinement of this initial coding scheme was done in four rounds. In each round, two authors coded ten papers, discussed the inconsistencies and improved the coding scheme. After the fourth round, no further refinements of the coding scheme were identified, which completed the first phase. In the second phase, the remaining set of papers was coded by one author.

As done by Recker et al. (2019), we calculated at each round of refinement the agreement between the two authors using Kappa as defined by Cohen (1960) as a measure of intercoder reliability. Figure 2 depicts the Kappa analysis over the rounds. Figure 2(a) visualizes Kappa’s improvement over the rounds with the final round achieving a Kappa of 0.91 indicating almost perfect agreement (Neuendorf 2002, p.145). Figure 2(b) also shows for each round the value of Kappa and the number of codings done by each author (column Rows). Given that the quantity of sections varies from paper to paper, also the number of codings differs for each round. In the discussions between the authors, we kept track of the number of codes agreed between them. We calculated the number of correct codes for each author, i.e. the number of times their code was eventually chosen. The author with the best coding was chosen to do the rest of the coding of the dataset in the second phase.

Fig. 2
figure 2

The graph in (a) shows the Kappa value evolution over the rounds. The table in (b) provides the details of the Kappa analyses: the value over the rounds and the total number of codings per round

Table 3 depicts the final set of activities. There is a one-to-one relation with the content of Jedlitschka et al. (2008) guideline showed in Table 1. The label of each activity is summarized together with its corresponding keywords. The indication whether an activity is mandatory or optional is determined by following the definition of required content presented in Jedlitschka et al. (2008) guideline. Sections considered required in the guideline are mandatory activities, while sections not required are optional. The required contents Abstract and Keyword of Jedlitschka et al. (2008) guideline are considered as mandatory activities in our approach with labels Define Abstract and Define Keywords. They were both omitted from Tables 1 and 3 respectively because of space restrictions. Thus, 29 distinct activities are considered. Among them 19 are required and 10 are optional.

Table 3 The list of activities considered with the final set of keywords (Define Abstract and Define Keywords omitted for space restrictions)

As a final step of data preparation, we merged consecutive activities of the same type into one. For instance, a sequence as <Define Goals, Design Experiment, Design Experiment, Interpret Results,Interpret Results,Interpret Results, Summarize Findings> is compressed to <Define Goals, Design Experiment, Interpret Results, Summarize Findings>. The reduction of the total number of activities in the event log was 16%.

3.4 Applied Process Mining Techniques

We followed the described procedures and obtained a dataset \(\mathbb {D}\) that codes a set of papers using the data structure of an event log as used in process mining. For this reason, various analysis techniques from process mining can be readily applied. Such analysis can be used to investigate to which extent reporting guidelines are considered in each paper, which patterns of reporting and which changes of patterns over time can be observed.

Next, we describe the analysis techniques that we include in our analysis. Sections 3.4.1 and 3.4.2 describe automatic process discovery and conformance checking, respectively. Section 3.4.3 explains how we use clustering techniques.

3.4.1 Process discovery

Process discovery takes an event log as an input and automatically generates a process model representing the sequences of that event log. Figure 3 describes how process discovery works using a simple artificial example. The starting point is the event log shown in Fig. 3(a). It contains three different sequences of activities. Process discovery algorithms construct a process model from such an event log based on behavioral relations between the activities. All sequences have the same two initial activities (Define Goals, Design Experiment). This pattern is reflected in the output model by including a sequence of these two activities as a mandatory flow. The subsequent behavior is different for the three sequences. Discovery algorithms spot that the first and the second sequence execute the same two activities (Explore Data, Test Hypothesis), but in a different order and that the third sequence includes a third activity instead (Interpret Results). These observations are reflected in the model by exclusive and parallel gateways, respectively, creating different flow options. The suffix is the same for all three sequences and therefore final activity Summarize Findings is included as mandatory. Figure 3.(b) shows the discovered model.

Fig. 3
figure 3

Illustration of an event log (a) and the discovered process model (b)

Event logs from practice are far more complex than this illustrative example. This implies the challenge of representing the behavior compactly and, specifically, a trade-off between: [(i)] fitness: the discovered model should allow for the behavior observed in the event log; [(ii)] precision (avoid underfitting): the discovered model should not allow behavior that was not observed in the event log; [(iii)] generalization (avoid overfitting): the discovered model should generalize the observed behavior seen in the event log; [(iv)] simplicity: the discovered model should be as simple as possible (van der Aalst 2016).

If the sequences are similar in terms of their behavior, the derived process model will have high fitness and a clear and simple structure. However, if the set of sequences are substantially different in terms of order and activities, the discovered model is often chaotic; it is also referred to as a spaghetti model (van der Aalst 2016). These spaghetti models are hard to analyze and show the lack of pattern in execution.

In this study, we use process discovery techniques to investigate Proposition 1, i.e. to which extent common reporting patterns exist in the considered papers on software engineering experiments. More specifically, we use the tool DiscoFootnote 3, a widely used commercial process mining tool, to discover a process model from the event log of our dataset (Section 3.3).

3.4.2 Conformance Checking

Conformance checking techniques provide insights into the extent of consistency between a process model and the sequences of an event log. They take an event log and a model as input and identify the extent to which the event sequences deviate from the model. Several technique exist, e.g. through replaying each sequence against the process model or by calculating an alignment (van der Aalst et al. 2012). Given a sequence and a process model, an alignment maps the event sequence to the process model’s best fitting run.

Consider, for instance, the process model in Fig. 3(b). If the event log includes the sequence <Define Goals, Design Experiment, Summarize Findings>, a conformance check algorithm will report that the activities Interpret Results or Explore Data together with Test Hypothesis are not observed in the log. In the same way, if a sequence like <Define Goals, Design Experiment, Interpret Results, Define Lessons Learned, Summarize Findings> is observed, conformance checking reports that an unexpected activity (Define Lessons Learned) was executed.

In our study, we use conformance checking techniques to investigate Proposition 2, i.e. to analyze to which extent software engineering experiments follow the guidelines proposed by Jedlitschka et al. (2008). To that end, we manually created a process model based on these guidelines. Since a paper is written in a sequential way, the list of activities in Table 3 defines a sequence. Optional activities are modeled within XOR-gateway blocks. Figure 4 shows the corresponding process model. We use this model for checking the conformance between reporting sequences of individual papers and the guidelines by Jedlitschka et al. (2008). We calculate conformance with plugin Replay a Log On Petri Net For Conformance AnalysisFootnote 4 (van der Aalst et al. 2012) of PromFootnote 5.

Fig. 4
figure 4

Process model capturing the list of activities depicted in Table 3

Also, we use conformance checking and its evolution over time to evaluate Proposition 3, and partially for Proposition 5 in combination with potential factors associated with reporting practice, such as citation of guidelines and replication.

3.4.3 Cluster Analysis

Cluster analysis allows the identification of groups of sequences in an event log. Two sequences are put into the same cluster if they are similar in terms of a suitable distance function. Various techniques for calculating distances between sequences have been proposed for social sequence analysis (Abbott 1995; Gabadinho et al. 2011) and process mining (Song et al. 2008; De Koninck et al. 2017).

In the context of our work, we use cluster analysis to address Proposition 4, i.e. to investigate whether there are different recurrent patterns for reporting experiments. It is not clear how many clusters can be expected. If all papers considered the reporting guideline structure of Jedlitschka et al. (2008), we would obtain one cluster containing very similar reporting sequences. In case that papers arbitrarily reported experiments, we might obtain a high number of rather dissimilar sequence clusters. It is more plausible to expect only few clusters. In that case, it will be interesting to investigate which are representative sequences for each cluster and in how far they differ.

We use the TraMineR toolFootnote 6 (Gabadinho et al. 2011) for our cluster analysis, an R-package for exploring sequence data. For calculating the sequence distance, we used the optimum matching algorithm (Abbott and Tsay 2000).

3.5 Propositions and Corresponding Techniques

Table 4 summarizes Propositions 1-5, the corresponding analysis techniques that we apply for investigating them, and the corresponding tools used. In Fig. 5, we indicate the input and output for each of the analysis techniques.

Table 4 Analysis techniques used to investigate Propositions 1-5
Fig. 5
figure 5

Research strategies and applied methods

4 Results

This section describes the results of our study into reporting practices of software engineering experiments. Section 4.1 provides descriptive statistics of our dataset. Section 4.2 presents the results of analyzing the data using automatic process discovery. Section 4.3 discusses the conformance checking results, which provide insights into how well aligned the articles are with reporting guidelines. Section 4.4 describes the results of clustering articles according to their reporting sequences. Section 4.5 evaluates to which extent is citing guidelines connected with guideline conformance. Finally, Section 4.6 presents observations on how replication studies use reporting guidelines.

4.1 Descriptive statistics

Our event log contains 168 cases (each describing a paper and the sequence of its reporting steps) from the year 2000 until 2020. Figure 6 shows the temporal distribution of these papers for each of the four journals. For every year, there are three or more papers in our analysis.

Fig. 6
figure 6

Distribution of the papers over time

Table 5 shows log statistics about the activities before (#Activities) and after compressing consecutive activities (#Activitiesc). The table provides the number of activities per paper (maximum (Max), minimum (Min), and average (Avg)) and the number of distinct activities in the whole log (maximum (Max), minimum (Min), and average (Avg)). It is interesting to note that we did not encounter any sequence that occurred more than once, which means that every paper’s reporting sequence was unique. TSE was the only journal, in which not all 29 reporting activities were observed. Three optional activities were missing: Define Experiment Deviation, Define Experiment Preparation, and Define Relevance to Practice. Another interesting observation is that the average of distinct activities per paper is less than the number of required activities (19) for all journals.

Table 5 Log statistics

Furthermore, there are several activities that are repeated in an average paper, i.e. the number of distinct activities per paper (#DistinctAvg) varies from the number of activities per paper (\(\#PerPaper_{{Avg}_{c}}\)). Table 6 shows each activity’s frequency in the event log (Log Frequency) and in how many papers each activity appears (Paper Frequency). Also, the corresponding percentage is presented.

Table 6 Frequency of each activity in the entire event log and per paper

4.2 Process discovery

To check if the papers follow recurring patterns of reporting, we applied automatic process discovery for the complete event log using the tool Disco. Figure 7 shows the process model discovered from the set of all papers. In this model, all possible paths are shown. We observe that the complexity of this model is overwhelming, and that it is difficult to spot patterns of recurring behavior.

Fig. 7
figure 7

Process model discovered from the complete event log. All possible paths are represented

The two activities that all the papers consider and that have a clear position in this spaghetti model are the Define Abstract and Define Keywords activities at the top of the model (not readable in the figure). This might probably not be due to the guidelines, but that paper submission formally enforces the inclusion of abstract and keywords. Therefore, it is not surprising that they are observed in all papers and in this order.

Once we apply filtering techniques provided by Disco to only show the minimum number of paths for connecting all 29 activities, we obtain the process model shown in Fig. 8. Compared with the spaghetti model from Fig. 7, this model is easier to understand and interpret. The darker the color of an activity, the more often this activity occurred in the event log. The Activity Explore data is the most observed activity, occurring 350 times (due to repetitions in various papers). The thicker the transition arrows, the more often the corresponding path is observed in the log. The most frequent sequence of activities is from activity Explore data to activity Test Hypotheses with 204 occurrences (due to repetitions in various papers). In this filtered process model, the frequency associated with an activity is greater or equal to the sum of its outgoing transition arrows frequency, because not all the possible arrows with its correspondent frequencies are shown. This process model shows that papers usually start with the definition of the problem (99% of the 168 papers) followed by the definition of the experiment’s objectives (97%). We also notice that many activities and many transitions are only observed for a smaller fraction of papers.

Fig. 8
figure 8

Process model discovered from the complete event log considering the minimum path for connecting all activities

4.3 Conformance Checking

Conformance checking is a group of techniques that facilitates the comparison between the sequences represented in a process model (such as reporting guidelines) and sequences of papers observed in our event log. We conducted such a conformance check for each paper based on the process model shown above in Fig. 4 that captures the reporting guidelines by (Jedlitschka et al. 2008). We used the classical notion of fitness as a measure of conformance (van der Aalst 2016). A sequence fully conforming with the process model has a fitness of 1 while a sequence that does not conform at all has a fitness of 0. We summarize the results for each journal separately and in total. Figure 9 shows the Box plot of this conformance analysis.

Fig. 9
figure 9

Conformance analysis of each journal independently and all journal together

The bulk of papers range between 0.6 and 0.7 in terms of conformance. Given that the data is normally distributed, we performed a one-way ANOVA test with no assumption of equal variances. The difference between the mean value of the four journals is statistically significant with 95% of confidence (F = 6.1574, num df = 3, denom df = 74.535, p-value = 0.0008497). The journal with the highest average conformance is ESE. This is not surprising given that it is the journal with the highest affinity with controlled experiments. It also has to be noted that we do not observe drastic differences in conformance between the journals. The Box plot of Fig. 9 also highlights some outliers either with outstanding conformance or very low conformance. The two papers with the highest conformance are from JSS and the third-highest from TSE. All three papers explicitly cited guidelines with two of them citing Jedlitschka et al. (2008) guideline. JSS and TSE are also the journals of the two papers with the lowest conformance of below 0.4. These papers did not cite any guidelines.

We also analyzed the evolution of the conformance over the 20 years in which the papers were published. Figure 10(a) shows a Box plot of the conformance of all papers for each year. We observe a slight increase in the average until the year 2008 when Jedlitschka et al. (2008) was published. Figure 10(b) shows the evolution of the average conformance over the years for each journal and also for the event log with all the papers. All journals show a similar evolution without any clear upward or downward trend over the years. More specifically, we do not observe any noticeable change after the year 2008 when Jedlitschka et al. (2008) was published. The peak of the curve for JSS in 2009 stems from the fact that only one experiment paper was published in that year in this journal and that this paper is an outlier with the highest guideline conformance of the whole set of papers. In summary, Fig. 10(a) and b show the same range of average conformance between 0.6 and 0.7 that we already observed in Fig. 9. Also, the KPSS test (Kwiatkowski et al. 1992) showed that the conformance time series is stationary with 0.05 significance (KPSS Level = 0.16735, Truncation lag parameter = 4, p-value = 0.1) without any clear trend up or down.

Fig. 10
figure 10

Conformance analysis of all papers over the years

4.4 Cluster Analysis

For the cluster analysis, we use the TraMiner tool (Gabadinho et al. 2011). This tool supports clustering based on classical sequence alignment. This means, in essence, that sequences are clustered based on a notion of sequence edit distance. Different number of clusters were evaluated and the best result yielded four clusters. Figure 11 shows the four clusters of sequences. The X-axis represents the position (A) in a sequence. It is scales to 41, which is the maximum number of activities stemming from the paper with the longest sequence (also compared above in Table 5). The Y-axis represents the relative frequency of color-coded activities. Mind that the area covered by a color bears the relevant information, not necessarily the position of a color dot. Rows in the diagram should not be interpreted as individual papers. The red color corresponds to missing values, which are appended to fill up shorter sequences up to a length of 41. Also, each cluster has the information of how many sequences were grouped. Cluster 1 has 56 sequences, Cluster 2 has 47, Cluster 3 has 45 and Cluster 4 has 20 sequences.

Fig. 11
figure 11

Clusters identified by using TramineR

All clusters have low Average Silhouette Width (0.2, 0.09, 0.12, 0.03, respectively) indicating that there is no clear structure of the clusters (Kaufman and Rousseeuw 1990). Also, there are 4, 11, 6 and 5 representative sequences respectively considering a coverage threshold of 25%. The absence of a unique representative sequence for the clusters underlines that the clusters are unstructured. However, it is possible to spot some commonalities. Given that the average length of a sequence is 22, which is almost half of the longest sequence, all clusters have a larger red area indicating missing values. We also observe that all four clusters have a similar start pattern that corresponds to the sequence Define Abstract, Define Keywords followed by Define Problem and Define Objectives.

There are also some noticeable differences. Cluster 1 contains the on average shortest sequences. It is also the cluster with the smallest number of explicit Test Hypothesis activities. Also less frequent than in the other clusters is the activity Design Experiment. The average fitness of this cluster with reporting guidelines is 0.69. In contrast to Cluster 1, Cluster 3 has longer sequences. Most of them range between 20 and 25, while the previous cluster ranges between 15 and 18. Cluster 3 also has a larger relative frequency of Test Hypothesis and Assess Threats to Validity. Still, the overall proportion of these activities is still low. All this suggests that Cluster 3 includes papers of a more structured reporting style. The average fitness is highest for this cluster with 0.73.

Cluster 2 appears to be the cluster with a rather homogeneous group of papers with a rather structured reporting style. There is ample attention to Define Task, Design Experiment, and Explore Data together with a clear methodological focus on Test Hypothesis and Assess Threats to Validity. In contrast to the heterogeneous Cluster 4, the papers in Cluster 3 seem to have a clear focus, which results in a detailed, yet compact reporting style. The average fitness of Cluster 2 is 0.70, and for Cluster 4 it is 0.62.

4.5 Guideline Citation Analysis

This section follows up on the question of whether an explicit consideration of guidelines leads to a more structured reporting style. To this end, we analyze to which extent papers cite established guidelines. First, Section 4.5.1 investigates how many papers cite reporting guidelines, including Jedlitschka et al. (2008), and whether this is connected with conformance. Second, Section 4.5.2 analyzes which papers have been citing Jedlitschka et al. (2008).

4.5.1 Guideline Citation of Software Engineering Experiment Papers

One potential explanation of higher conformance with reporting guidelines is whether reporting guideline papers were considered at all. We observed above that the conformance of papers with the guidelines by Jedlitschka et al. (2008) ranges between 0.6 and 0.7. The focus of our analysis here is on three questions: (i) is Jedlitschka et al. (2008) guideline (G1a) or its previous version (Jedlitschka and Pfahl 2005) (G1b) cited? (ii) is the book by Wohlin et al. (2012) (G1c) that points to Jedlitschka et al. (2008) cited? (iii) does a paper consider any other guideline by Singer (1999) (G2), Wohlin et al. (2000) (G3), Juristo and Moreno (2001) (G4), and Kitchenham et al. (2002) (G5)?

We analyzed for all 168 papers which guidelines they cite. As a result we found that 57 papers (34%) do not cite any guideline at all. The remaining 111 papers cite at least one guideline. Table 7 shows the number of citations each guideline received. The percentage is calculated considering the total number of papers that potentially could have cited the respective guideline. This means that a paper published in the year or onward of the publication year of the guideline is a paper that could have potentially cited it.

Table 7 Number of citations over the total number of possible citations for each guideline

The guideline most cited is the book by Wohlin et al. (2000) with almost 43% of the potential papers referring to it. Jedlitschka et al. (2008) received 17.5% of the potential papers; however, the book by Wohlin et al. (2012) recommends Jedlitschka et al. (2008) as a guideline and received almost double the citations. Figure 12 presents a comparison of the number of citations that Wohlin et al. (2000), Jedlitschka et al. (2008) and Wohlin et al. (2012) received over time. We observe that from 2014 when papers started citing Wohlin et al. (2012), the citations of Jedlitschka et al. (2008) and Wohlin et al. (2012) evolved roughly in parallel to each other, and citation for Wohlin et al. (2000) decreased.

Fig. 12
figure 12

Citation comparison between Jedlitschka et al. (2008) (G1a), Wohlin et al. (2012)(G1c) and Wohlin et al. (2000) (G3) guidelines

Figure 13 shows a Box plot comparison of the fitness between those papers that cited at least one of the guidelines (YES) and those that did not cite any guidelines at all (NO). We can observe a slightly higher fitness of those papers that cited guidelines. We performed Welch t-test (Ruxton 2006). The difference between the mean value of the two groups is statistically significant with 95% of confidence (t = − 4.1471, df = 132.02, p-value = 5.987e − 05), showing a correlation between conformance and guideline citation.

Fig. 13
figure 13

Conformance checking comparison between all the papers that cited any of the guidelines and the papers that did not cite any of the guidelines

Also, 30% of the papers cite only one guideline. Among them, only 7 papers cite either Juristo and Moreno (2001) or Kitchenham et al. (2002) guidelines. The remaining papers cite versions of the Jedlitschka et al. (2008) guidelines (G1a,G1b or G1c) or the Wohlin et al. (2000) guidelines. A pairwise correlation analysis between Jedlitschka et al. (2008) guideline and the other guidelines (G2,G3,G4 and G5) shows a correlation with G2 and no correlation with the others (Pearson’s Chi-squared test with 0.05 of significance). A correlation between the four guidelines (G2,G3,G4 and G5) was also not found. Thus, it was not possible to find a pattern on the citation of guidelines.

4.5.2 Citation from Various Types of Papers

It is surprising that Jedlitschka et al. (2008) is highly cited, but the share of software engineering experiment papers citing it is rather low with 17.5%. Therefore, we investigate which type of papers cite Jedlitschka et al. (2008).

A search on Google Scholar yielded a list of 300 citations. These citations are distributed as follows: 100 conference papers, 103 journal papers and 97 other types of publication, such as books, dissertations and theses. We investigated more deeply into the journal and conference papers and could classify the papers in Method (M), Experiment (E), or Survey (S) papers. Table 8 (a) shows the statistics of papers per category. We notice that the majority of the papers are from the category Experiment (E). Considering the papers published in one of the four journals, we observed that 35 are experiments. We analyzed the 21 papers that cite Jedlitschka et al. (2008). The difference of 14 papers that we did not consider here corresponds to the papers out of our study’s scope (experiment with human participants in which the experiment is the main contribution of the paper). This means, a larger fraction of citations to Jedlitschka et al. (2008) are conference papers and articles that are not published in IST, TSE, JSS, or ESE.

Table 8 Number of journal and conference papers per category (a)

4.6 Replication Analysis

Replication is one of the principles of the scientific method (Brown et al. 2016). Replication studies are an important means to further strengthen the trust in the validity of results obtained by the original experiment. It has been emphasized that a replication study has to describe how the original studies were performed, provide implementation differences of the replication, and a comparison of the results (Carver 2010). It is also important to describe the type of replication (Bettis et al. 2016) and to consider a shared terminology (Baldassarre et al. 2014). Due to these specific aspects, replication experiments require a separate structure of reporting. Table 9 shows the number of replication papers that are part of our analysis.

Table 9 Number of replications for each of the journals and their percentage

IST and ESE are the journals with more replication papers. This is consistent with the results of a recent systematic mapping study on replication (Cruz et al. 2020). In their mapping study, Cruz et al. (2020) emphasize the lack of guidelines for reporting replications. They state that guideline proposal by Carver (2010) is the only high-level proposal for reporting replication so far. This lack of guidelines might also explain that we found that nine of the replications were internal replications published together with the original study. Twenty out of the twenty-five replication papers were published after the year 2010 and therefore could have followed Carver’s guidelines (Carver 2010). However, only 9 did (45%).

Figure 14 shows to which extent the replication papers conform with Jedlitschka et al. (2008). The fitness of the replication papers is comparable to other papers. The difference between the mean value of the two groups is not statistically significant with 95% of confidence based on Welch t-test (Ruxton 2006) (t = -0.22898, df = 29.114, p-value = 0.8205), showing there is not a correlation between conformance and replication.

Fig. 14
figure 14

Conformance checking comparison between papers that report a replication (YES) and papers that report an original study (NO)

4.7 Summary of Results

We summarize the results of our mix-method study as follows. The descriptive analysis revealed that a consistent reporting style is only followed to a limited extent. The variance is high, with some papers only reporting a fraction of the required reporting activities, others repeating activities, and only a few showing high fitness with reporting guidelines. The application of process mining techniques confirmed the impression of a rather inconsistent reporting practice. Conformance checking revealed that the fitness between the reporting sequences of most papers only partially considers the guidelines by Jedlitschka et al. (2008). The clustering analysis suggested that only Cluster 2 appears to consider most of the important activities. The conformance analysis over time showed that there is no consistent improvement of reporting practice over time. These findings are consistent with our citation analysis, revealing that the highly cited guidelines by Jedlitschka et al. (2008) are often not considered in experiments published in IST, TSE, JSS, and ESE. Furthermore, while replication studies have lower fitness due to special reporting requirements, we observe that they exhibit conformance similar to papers reporting original studies.

5 Discussion

In this section, we first discuss our findings in the light of the formulated propositions. Then, we describe implications for research on software engineering experiments. Finally, we reflect upon threats to validity of our study.

5.1 Propositions and Findings

We formulated several propositions that motivated our study. We discuss the findings for each of them in turn.

Proposition 1 (Patterns)

described the expectation that clear reporting patterns can be identified. We actually found that the process model that we automatically discovered from our event log of all papers is overwhelmingly complex and that not even two papers exhibit the same sequence of reporting activities (Section 4.2). We argued in support of Proposition 1 that routinization of author teams (Gersick and Hackman 1990) and social norms enforcing reporting practices (DiMaggio and Powell 1983) would lead to recurring reporting patterns. As apparently, patterns exist only to a limited extent, there is doubt whether routinization and social norms are effectively in place.

Proposition 2 (Conformance)

argued that compliance of reporting with established reporting guidelines could be expected. We found that measures of conformance checking are observed in a middle range between 0.6 and 0.7 (Section 4.3). Apparently, institutional pressures are not so strong, with the effect that it is possible to publish experiments in the best software engineering journals without strictly following established guidelines.

Proposition 3 (Evolution)

formulated the expectation that reporting practices would have evolved over time. Our findings suggest that an increasing uptake of guidelines has not happened (Section 4.3). There is at least no evidence of an increase in fitness with established guidelines. This observation holds for the overall period from 2000–2020, and also the publication year of the guideline by Jedlitschka et al. (2008) does not seem to have made a difference. According to arguments formulated by Levitt and Nass (1989) on the similarity of textbooks in scientific areas, this could be interpreted that the overall discipline has not matured and converged in terms of reporting.

Proposition 4 (Contingency)

argued that similar reporting might be observable for subgroups of papers. What we found is that the clusters that we identified are largely connected with the level of guideline compliance (Section 4.4). This suggests that the differences are unsystematic.

Proposition 5 (Factors)

brought differences of reporting in connection with different factors. We found that papers citing reporting guidelines are more conforming (Section 4.5) and that the factor of whether an experiment is a replication does not imply any difference in fitness (Section 4.6). Also, considering the reporting of effect sizes (Kampenes et al. 2007) as a proxy for quality, the results showed no correlation between quality and conformance with a guideline. These observations suggest that differences in reporting are unsystematic, which means that the only explanation for differences that we found is whether established reporting guidelines were explicitly considered or not.

5.2 Implications for Reporting Software Engineering Experiments

Our findings have various implications for further developing reporting practices of software engineering experiments.

First, we were surprised by the observation that variation in reporting sequences is large and that 57 of the 168 analyzed papers do not refer to any guideline at all. This finding can be interpreted as that there is low institutional pressure (DiMaggio and Powell 1983) to adhere to experiment reporting guidelines in software engineering journals. This raises the question of why that might be the case. It may be speculated that this could be connected with the limited coverage of empirical methods in software engineering curricula. The IEEE Software Engineering Body of Knowledge (Version 3) at least covers experiments and empirical methods on eight pages, however, only in the recently added 15th and last chapter (Bourque et al. 2014). This is already an improvement as compared to the previous version published by Abran et al. (2004) that did not mention empirical methods at all. A direction for improving this situation could be to use more extensively method experts for reviewing experiment papers next to subject-matter experts. Guidelines are available and recognized such as the ones by Jedlitschka et al. (2008), later endorsed by Wohlin et al. (2012), and they can be also helpful guidelines during the review process. It is surprising that many papers are incomplete even in those parts where the guidelines are easy to comply with, like having a structured introduction and conclusion, and an explicit discussion of threats to validity.

Second, our findings also raise the question of whether some potential weaknesses of guidelines could explain at least a share of the non-conformance. We observe that the description of hypotheses and variables can be found in different sections, although Jedlitschka et al. (2008) suggest always presenting them as part of the experimental planning section. Also, not many papers used the execution section to describe deviations. The ones that used this section were often misinterpreting its purpose and describing an experimental procedure without providing any details about experiment deviations. Potentially, such misinterpretations can be addressed by sharpening the section’s title or its description. But it must be stated in defense of Jedlitschka et al. (2008) that it is also on the authors to carefully read and reflect upon the purpose of each reporting activity.

Third, guidelines like the ones by Jedlitschka et al. (2008) should be further evaluated regarding their completeness. Indeed, as our study design focuses largely on the conformance between Jedlitschka et al. (2008) and individual papers, we cannot observe directly to which extent authors felt the need to report experiment-related content that was not covered by these guidelines. We recognize the increasing importance of theorizing for empirical research in software engineering (Hannay and Sjøberg 2007; Wohlin et al. 2015; Ralph 2018; Ralph et al. 2021). In the neighboring field of information systems research, theorizing is given precedence over research design (Recker 2012), and experimental guidelines emphasize the importance of theory for justifying the overall research design (Burton-Jones et al. 2009). The reporting promoted by Jedlitschka et al. (2008) might be too restricted to provide appropriate space and attention for theorizing. A theoretical discussion would be a useful addition to the background section, which in Jedlitschka et al. (2008) only includes research technology, alternative technology, related studies, and relevance to practice. Future research, for instance, an interview-based study on guideline usage, can help to leverage a deeper understanding of guideline usage and its challenges.

5.3 Threats to Validity

As for any empirical study, threats to validity have to be considered for our study. We focus on the conclusion, construct and external validity. As our design does not single out causality, any threats to internal validity of correlational studies apply. Beyond the reflection that follows, we made a replication package availableFootnote 7, which is meant to help other researchers to criticize and to build upon our arguments.

Conclusion validity:

is defined as a degree to which the conclusions we reach about the relationship of the study’s data are considered reasonable Trochim and Donnelly (2001). Frequent threats to conclusion validity emerge from the erroneous interference of relationships that do not have the backing of the data, measurements, and analysis results (Trochim and Donnelly 2001). Our study established a correlation argument related to the connection between the publication of reporting guidelines for software engineering experiments and the practice of reporting. Various weaknesses of correlational studies in terms of conclusion validity, such as ones previously mentioned, apply. However, correlational studies such as surveys are appropriate for exploring, describing, and explaining a phenomenon (Recker 2012) such as experimental reporting in software engineering as in our study. We iterated our coding protocol to address potential threats to conclusion validity until we achieved high inter-rater reliability in terms of Kappa (0.91). Furthermore, we conducted several in-depth analyses on our sample, which gives us confidence that the conclusions drawn from the data are credible and reasonable. A limitation of our design is the lack of insights into the rationale why authors and reviewers do not adapt or enforce guidelines more strongly. A qualitative research agenda is needed to understand this phenomenon better.

Construct Validity:

is relevant for how we constructed our dataset (Wohlin et al. 2012). Although we adopted established procedures for coding research papers for literature reviews as applied by Recker et al. (2019), we can identify three construct validity threats related to the measurement of conformance between papers and guidelines. The first one is how sections of the paper describe what their respective titles promise to do. This threat can be mitigated by reading and coding the section contents. We refrained to code according to the explicit formal structure of sections, which means that relevant content is not visible to our coding procedure if it is not emphasized in a section title. The second threat is that papers might have cited guidelines other than the ones considered in our study. We focused on guidelines from software engineering. If guidelines from other fields would be found useful by authors, these could serve as a basis for revising software engineering guidelines.

Finally, the third threat relates to the situation in which the reporting guideline is cited but not followed. This issue can be addressed by performing conformance checking on the subset of the papers reported using at least one guideline.

External Validity:

is concerned with the generalization of the findings Trochim and Donnelly (2001). In turn, a threat to external validity points to limitations of generalizing the results to a larger context. Our study sampled papers that reported experiments from the four most prestigious software engineering journals published between 2000 and 2020. A possible limitation is that our study does not include experiment papers published in other software engineering journals and presented at conferences during the observed period. As our citation analysis of Jedlitschka et al. (2008) guideline in Section 4.5.2 has shown, this guideline is cited in other experiment journals and conference papers that may be potentially associated to the scope of this study. This restricts our findings and conclusions to the selected journals and the chosen time frame. However, we can still argue that our results are relevant for the whole field of software engineering experiments, because highly ranked journals can be regarded as best-in-class examples for other journals and conferences. Another threat to external validity is that we manually selected papers for inclusion in our study. To tackle this, we have followed guidelines by Kitchenham and Charters (2007) and defined formal criteria to make this manual selection process transparent.

6 Conclusion

This paper has presented findings of a mixed-method study into the reporting practice of controlled experiments published in major software engineering journals over the period from 2000 to 2020. To this end, we combined coding techniques from qualitative research with process mining techniques. We find that (1) many experimental papers in top software engineering journals do not report all of the information that published guidelines suggest to be required, (2) there is no de facto standard sequence for reporting these items, (3) many experimental papers do not cite any reporting guidelines, and (4) that these findings, on average, hold for both early as well as for recent papers.

Our findings have important implications both for the practice of reporting software engineering experiments and for guideline design. First, we discussed the potential of more explicitly using guidelines during the review process. Second, we also suggest partially refining them and for giving the theory a more explicit place. Some of the excellent experiment papers that consider reporting guidelines diligently can nicely serve as role models for future experiment papers in software engineering.

There are several important questions for which our research design did not offer an answer. First, a central point of discussion during the review process was the connection between adherence to guidelines and the paper’s research quality. By focusing on the four top journals only, we assumed that quality must have been assured in the review process. The variation in guideline conformance leaves doubt and calls for a research strategy that builds on the expert judgment of research quality. Second, our analysis does not offer insights into the motives of authors to either adopt or ignore reporting guidelines. A qualitative research strategy is required to investigate this point further. We hope that our appendix and the replication package corresponding to this paper will help to address these points in future research.