A Model-Based Approach for Specifying Changes in Replications of Empirical Studies in Computer Science

Context: The need of replicating empirical studies in Computer Science (CS) is widely recognized among the research community to consolidate acquired knowledge generalizing results. It is essential to report the changes of each replication to understand the evolution of the experimental validity across a family of studies. Unfortunately, the lack of proposals undermines these objectives. Objective. The main goal of our work is to provide researchers in CS, and in other areas of research, with a systematic, tool-supported approach for the reporting of changes in the replications of their empirical studies. Method: Applying DSR, we have developed and validated a composite artifact consisting of (i) a metamodel of the relevant concepts of replications and their changes; (ii) templates and linguistic patterns for reporting those concepts; and (iii) a proof-of-concept model-based software tool that supports the proposed approach. For its validation, we have carried out a multiple case study including 9 families of empirical studies from CS and Agrobiology. The 9 families encompass 23 replication studies and 92 replication changes, for which we have analyzed the suitability of our proposal. Results: The multiple case study revealed some initial limitations of our approach related to threats to experimental validity or context variables. After several improvement iterations, all the 92 replication changes could be properly specified, including also their qualitatively estimated effects on experimental validity across the family of experiments and its corresponding visualization. Conclusions: Our proposal for the specification of replication changes seems to fit the needs not only of replications in CS, but also in other research areas. Nevertheless, further research is needed to improve it and to disseminate its use among the research community.


Introduction
As in most research areas, empirical studies, especially controlled experiments, can be used in Computer Science to rigorously evaluate technologies, methods, and tools and help guide further research by revealing existing problems and difficulties [17]. However, for their results to be generalizable, reported experiments must be replicated in different contexts, at different times, and under different conditions [15] by means of so-called families of experiments. 1 As Basili et al introduced in 1999 [8], a family of experiments consists of a baseline or original study, followed up by a set of replications that answer the same research questions as the original study. Later, Santos et al in 2018 [57], proposed the following premises to consider a series of experiments as a family, namely: (i) access to the raw data is guaranteed; (ii) researchers know the exact setup of each experiment; and (iii) at least three experiments evaluate the effects of at least two different technologies (or methods or tools) on the same response variable. Nowadays, it is widely accepted in the research community that the knowledge obtained from a family of experiments is more robust and reliable than that obtained from a single isolated experiment, the results of which can only be considered as preliminary [64,68].
Let us consider for a moment a novice researcher in Computer Science who decides to replicate an original study or a previous replication of an original study. According to Carver [16], she needs to carefully review the entire family of previous studies, in order to acquire the necessary knowledge to properly adapt the experimental settings, improve its design-if possible-to increase its experimental validity, or avoid making the same mistakes than in previous studies. To alleviate this situation, several initiatives have recently emerged, the most relevant being Open Science [43], which promulgates that making available the datasets, the analyses, and a preprint version of an experiment and its related software, provides valuable knowledge allowing not only that any interested party may audit it, but also that others build directly upon the previous work through reuse and repurposing [5].
A complementary approach to increase the visibility and reproducibility of empirical studies in Computer Science is promoting the availability of artifacts such as laboratory or replication packages [8,66]. According to [64], replication packages should include not only datasets, analyses, and experimental material, but also guidelines to conduct a replication and a summary of the evolution of the experiment across the family [67], promoting traceability among replications.
Despite all these efforts, Shepperd et al [62] stated that although there is a consensus on replications being essential to consolidate the findings of empirical studies, there is a need for better reporting of both original and replicated studies. To report a replication of a controlled experiment, Computer Science researchers usually use as a reference some seminal works such as those by Wohlin et al [72], Jedlitschka et al [33], or Juristo and Moreno [35], complementing them with Carver's replications guidelines [16]. The recommendations in the aforementioned works are meant for controlled experiments in general, but not for the specification of the changes that often arise during replications to address threats to experimental validity and, therefore, improve the original study and the validity of its results [14].
To the best of our knowledge, there is a lack of specific proposals to specify the changes introduced between replications of the same family of experiments. In this situation, researchers choose either to report the experimental setting of the replication without highlighting the changes from the original study or previous replications [24,31,44,58], or they report the changes in an ad-hoc manner, describing them in narrative text [1,4,49,50]. This lack of detail in the specification of changes leads to some problems in carrying out new replications, since the replicator has difficulties not only in deciding which aspects of the experimental setup are best suited to the new environment, but also in avoiding mistakes made in previous experiments that threatened their validity [5,16]. On the other hand, a proper knowledge of the changes allows a better meta-analysis of the family, since they impact on issues such as the definition of moderator variables, the definition of the aggregated family design, or the choice of the most appropriate analysis, among others [10].
The proposal described in this article aims to specify replication changes in a structured, systematic way, identifying aspects such as the rationale for each change or its effect on experimental validity, which can be quantified and used to visualize the evolution of the entire family of experiments, thus supporting decision making for new replications [16]. By specifying changes explicitly, our proposal helps to decrease not only the so-called tacit knowledge [63], but also experimenter bias [59], that occurs when replicators have to interact with the original experimenters to request missing information about the family, specifically their changes.
In this work, we focus on the specifications of replications and their corresponding changes of empirical studies in general and controlled experiments in particular. For that purpose, we have adopted Design Science Research (DSR) as our research methodology, creating and evaluating an artifact that is designed to solve an identified problem [70]. In our case, the developed artifact consists of (i) a metamodel developed by an iterative and incremental process that represents the relevant concepts about replications and their changes; (ii) templates for reporting the information included in the metamodel, thus promoting reusability, avoiding redundancies, and tracing the effects of changes on experimental validity across the family of experiments; and, (iii) CAESAR, a proof-of-concept model-driven software tool developed to provide an initial evaluation and support for our proposal, including the aforementioned templates and the automatic visualization of the effect of the changes on experimental validity across the family of experiments.
The rest of the paper is organized as follows. Section 2 summarizes a brief stateof-the-art on families of experiments, replications, and changes in Computer Science. Section 3 presents the three components of the developed artifact, i.e. the metamodel, the templates, and the tool support. Section 4 reports a multiple case study carried out to evaluate the artifact, encompassing 9 families of experiments from Computer Science and Agrobiology. Related work is commented in Section 5 and, finally, concluding remarks and future work are presented in Section 6.

Background
According to [2], a replication can be defined as the repetition of an experiment, either as closely following the original experiment as possible, or with deliberate changes to the original experiment's settings, in order to achieve, or ensure, greater validity in the carried out research. Despite the importance of replications, and although their practice has increased in recent years [18,20], the number of replications in Computer Science in general, and in Software Engineering in particular, remains low [67]. Among the causes influencing this situation are (i) the tacit knowledge not explicit in replication reports [63]; (ii) the lack or incompleteness of laboratory packages [67]; (iii) the lack of agreement on common terminology and criteria for reporting replications [16]; and, last but not least, (iv) the effort and resources needed to carry out an experiment [20].
Several taxonomies have been proposed to classify the different types of replications. There is a wide agreement to use the term internal replication for those replications carried out by the same experimenters at the same site than the original study, whereas external replication is used when the experimental team and site are different from the original ones. With respect to the process to carry out a family of experiments, once the original experiment concludes, it is advisable to carry out internal replications to confirm preliminary results and adjust experimental settings. Then, external replications can be carried out for generalizing the provisional results from internal replications [13].
Nevertheless, there is a lack of agreement in the used terminology in other replication taxonomies. For example, according to the degree to which the original experiment protocol is followed, Shull et al [64] classify replications as exact and conceptual, but Juristo and Vegas [38] use closed and differentiated instead. Basili et al refer to strict replications when the original study is duplicated as accurately as possible [8], whereas Gómez et al [28], classify them as literal, operational and conceptual depending on the changes carried out and their purpose. Other taxonomies such as those proposed in [2] and [6] use some of the terms commented above. In [42], de Magalhães et al compare different taxonomies, concluding that any attempt to establish a replication typology must be done with care since, as stated in [27], there are authors who use the same term for different types of replications and conversely, use different terms to refer to the same type of replication.
In our opinion, the main underlying concept behind the different proposed taxonomies are the changes between the original study and its replications. In particular, in human-oriented experiments such as those common in Software Engineering, each change in the experimental settings, even if the protocol of the original experiment is followed, can eventually produce different results [23]. As a consequence, specifying the changes and their motivation allows the comparison of results and increase knowledge by analyzing the conditions under which the results were obtained, thus encouraging further replication.
Last but not less important, and although several authors [16,62] have reported their relevance, only a few replications report their related changes when published. When included, changes are reported in narrative text, e.g. [1,32,60], or in a simple, non-standard tabular form, with one row per change including a few properties only, such as elements affected by the change (population, experimental design, etc.) or the situation after the change, e.g. [4,49,50].

Model-based proposal for specifying replication changes
In this section, the model-based proposal for the specification of replication changes in empirical studies, encompassing a metamodel, some templates, and a proof-ofconcept software tool, is presented. Figure 1 shows the proposed metamodel for replication changes using a UML class diagram. As can be seen, the classes at the top model the necessary context information for understanding replication changes, whereas the classes at the bottom model the structure and properties of the changes themselves. The enumerations on the right of the diagram correspond to the categorical domains of some class attributes. With respect to the replication context, two types of empirical studies are considered, original studies and replications. Both of them are identified by an acronym, take place in a given site at a given date, and usually have a report that can be accessed using a URL. In the case of original studies, a specification of their goalthat should include at least the cause and effect constructs and the population under study, for example using a well-known template such as GQM [7]-and their description are also included, since they are supposed to be shared by its replications, thus providing a context which is needed for understanding related replications and their changes. For replications, their type (internal, external), their purposes (confirm results, generalize results, or overcome limitations), and their changes are recorded   together with their base study, i.e. the study they replicate, which can be an original study or a previous replication, as modeled by the abstract superclass EmpiricalStudy.

Metamodel for Replication Changes
Regarding replication changes, they are identified by a descriptive name and must describe narratively what changes from the base study (base situation) to the replication (replication situation). The purpose of the changes must also be recorded, as well as their impacts (if any) on experimental validity (modeled by the ChangeImpact class), following the validity taxonomy described by Wohlin et al in [72] and modeled by the Validity enumeration. For each impact of a change, its rationale, i.e. why the change affects a given validity type, must be recorded together with its effect, using a 7-point linear scale modeled by the Effect enumeration. Note that associating a value in an linear scale from substantially (-3), moderately (-2), and slightly (-1) decreases to their positive counterparts, including a does not affect (0) central point, allows to have an idea on how the changes in the different experiments in a family increase or decrease original study's experimental validity and it can be easily visualized graphically, as it will be shown in Section 3.3.

Change Dimensions
For classifying the different types of changes in replications, we have followed and expanded the dimensions for experimental configuration proposed by Gómez et al in [28], resulting in the classification described below.
Operationalization change In a controlled experiment, the cause construct is operationalized as one or more treatments, whereas the effect construct is operationalized as one or more metrics, which are measured following measurement procedures. Any change related to any of the above items (modeled as the OperationalizationElement enumeration), e.g. changing the duration of a treatment, should be an instance of the OperationalizationChange class.
Population change This class encompasses any change related to the experimental subjects or objects, e.g. changing the experience level or the average age of the experimental subjects with respect to those in the base study.
Protocol change As defined by Juristo and Gómez in [34], the experimental protocol is the setup of the experimental design, experimental material, experimental guides, measuring instruments, and data analysis techniques, which are the elements modeled by the ProtocolElement enumeration in our metamodel. Any change affecting one of those elements, such as changing the tasks that subjects have to perform or using Bayesian instead of frequentist statistics, should be an instance of the ProtocolChange class.
Experimenter change These are changes related to the experimenters and their roles in the replication when compared to the base study. As modeled by the Ex-perimenterRole enumeration, the roles considered are designer, trainer, monitor, measurer and analyst.
Context change This class, not present in [28], models any change related to the context in which the replication is carried out compared to the context in which the base study took place. For example, in human-oriented experiments using students as subjects, changing the time of year when the study is conducted from before final exams to after final exams is a change of this kind. In technology-oriented experiments, moving from running the software on a real machine to running it on a virtual machine would also be a context change.

Templates for Replications Changes
Templates have been successfully applied in Computer Science, Software Engineering, and related areas. For example, the well-known GQM template proposed in 1994 by Basili et al [7], the Wieringa's proposal for the specification of the problem under study in DSR, and many others, including templates for requirements [21,22], process performance indicators [51,52], or metamorphic testing relations [61].
Templates help visualizing information in a standard form, which can be easily adopted by practitioners, especially by novice researchers. In order to improve its usability, we have augmented templates with linguistic patterns (L-patterns) when possible, in a similar manner than in [21,22,51,52]. L-patterns are pre-written, parametrized sentences that can be used for filling in some template fields in a easier,

Proof-of-concept Software Tool: CAESAR
Applying a model-driven development approach, we have built a proof-of-concept tool called CAESAR (ChAngE SpecificAtion for Replications) using the Grails framework (https://grails.org/) and the metamodel described in Section 3.1. CAESAR is deployed for demonstration at https://caesarus.herokuapp.com?lang=en, providing a simple interface for manipulating instances of the entities in the metamodel and displaying their information using the templates described in the previous section, as shown in Figures 4 and 5.
As can be seen in Figure 4, the context template for replications have been augmented with information corresponding to the effects of their changes-expressed in a 7-point linear scale from -3 to +3 as commented in Section 3.1-on the four types of experimental validity proposed in [72], which can be used by experimenters when reporting their replications according to their preferences. Figure 5: Example of change template in CAESAR for one of the changes in [11] Apart from validating the metamodel displayed in Figure 1 and adding computed information to the templates, the main added value of building this tool is the possibility of visualizing the evolution of the experimental validity along a family of experiments. For that purpose, a template for original studies has also been developed into the tool in which all the studies in an associated family of experiments are queried and the effects of their changes are displayed accumulatively from the original study-for which all types of validity are assumed to start at zero-as a line graph for the four aforementioned types of experimental validity, as shown in Figures 6 and  7.
This kind of visualization provides a valuable information about a family of experiments that is not usually reported and that can drive changes in new replications. For example, note how in the case of the family of experiments described in [3] and displayed in Figure 6, three types of experimental validity increase along time-especially internal validity-whereas in the family described in [36,39,40] and displayed in Figure 7, internal validity substantially decrease, mainly because of the lack of resources in follow-up replications, as commented later in Section 4.1.1. In the latter case, the graph clearly indicates that experimenters carrying out a new replication should try to increase internal validity whenever possible.   [36,39,40] CAESAR has been successfully used in the validation of our proposal, which is discussed in the next section.

Multiple Case Study for Artifact Evaluation
As commented in [72], experiments in Computer Science may be classified as human-oriented or technology-oriented depending mainly on the nature of experimental subjects. In order to evaluate the suitability of our artifact, we conducted a multiple case study involving both human and technology-oriented families of experiments in Computer Science.
Since we had the opportunity to meet experimenters from a different discipline, Agrobiology, we decided to include some families of experiments from that discipline in the multiple case study with a twofold purpose. On one hand, to check whether our artifact could also be applied to a completely different type of experiments, using plants as subjects, which somehow have some common characteristics not only to human-oriented experiments, but also to technology-oriented ones. On the other hand, to identify reported information that could be incorporated into the templates and thus improve them.
The research method followed for such evaluation was based on the case study research process proposed by Runeson et al in [54,55]. An overview of the phases of the multiple case study, which are detailed in the next sections, is depicted in Figure 8.

Design and Data Collection
In the design phase, those families of experiments to evaluate our proposal were selected applying the criteria of having a relevant number of replications and changes in order to cover as many different options in the proposed templates as possible and to answer the following research questions (RQs): RQ 1 (Expressiveness) Can the proposed templates be used to specify reported replication changes properly? Do they need to be augmented to include more reported information? Is the usually reported information sufficient to specify all the information included in the templates? Are they suitable for disciplines other than Computer Science?
RQ 2 (Usability) Do researchers find the proposed templates useful? Do they find them easy to use? Is the terminology used understood by researchers outside Computer Science community?
For each case study, a brief description of the selected families, the followed protocol, and the collected data is presented below. Note that the collected data, i.e. the specifications of all the selected replication studies and their changes using a L A T E X version of the proposed templates, is publicly available at Zenodo [19], although the meeting minutes are not included for the sake of the privacy of the participant researchers.

Human-Oriented Case (HOE-Case)
For this first case study, we selected three families of human-oriented experiments dealing with (i) the effect of mindfulness on conceptual modeling performance (Mind) [9][10][11], (ii) requirements elicitation (Req) [3], and (iii) code testing techniques (Test) [36,39,40], due to our familiarity with these topics. This case study was designed as a self-evaluation case study, so all the replications and changes were specified by ourselves after a close reading of the corresponding reports. All the limitations and issues found were registered for further analysis and potential artifact improvement.

Technology-Oriented Case (TOE-Case)
To evaluate our proposal with other types of Computer Science experiments where the subjects are not human beings, two families of experiments on automated software testing (Testing) [47], and software product line testing (SPL) [56] were also selected at suggestion of the researchers who accepted to participate and to whom we had direct access. In this case study, the selected replication changes were also specified by ourselves, but then several meetings were held with the researchers who carried out the experiments to validate our specifications and obtain feedback from them, recording the valuable information.

Agrobiology (Agrobio-Case)
In order to evaluate our proposal in other areas of knowledge, we selected four families of experiments belonging to Agrobiology dealing with soil decontamination (Soil) [30,45,53], harvesting systems (Harvest) [48], extraction of olive oil components (Olive) [26], and influence of diet on cholesterol accumulation (Diet) [46]. These families were selected at the suggestion of the Agrobiology researchers who accepted to participate and to whom we had direct access.
Several meetings were held where we followed a protocol consisting of the following steps: (i) we explained the templates to ensure that the researchers understood how to use them; (ii) we asked each researcher to select one of her experiments with at least one replication and, if possible, already published; (iii) for the selected experiment, we asked each researcher to fill in the corresponding templates for all the experiment replications and their associated changes, providing us with as much feedback as possible; and (iv) we asked the researchers whether they had found any limitations, usability problems, or unknown terminology using the templates, recording all the details. By the time we conducted this case study, an earlier version of the CAESAR tool was already available, and some researchers agreed to use it for reporting their replications, providing very valuable feedback.

Analysis and Report
In this section, the results of the three case studies, which are summarized in Table 1, are analyzed and reported.

HOE-Case analysis
When this first case study was conducted, the metamodel of our artifact was in a early stage. This is the reason why we designed it as a self-evaluation case study and that is why most of the evolution of the metamodel-and therefore of the templates and the CAESAR tool-is based on the findings obtained during its conduction. As a matter of Some changes impact more than one validity type. This results in an evolution of the structure for modeling changes, impacts and effects. Rationale attribute is added to ChangeImpact class. 11 out of 33 (33%) changes could not be entirely specified with the early version of the metamodel and templates due to impacting more than one validity type. In general, change purpose is seldom reported. Impacts on validity and modified dimensions are not always straightforward to identify. fact, only 76% of the changes in the selected families of experiments could be specified with the initial version of the metamodel, but 100% with the evolved version. One of the main evolutive changes in the early metamodel was motivated by the fact that some of the replications in the Mind family were based not on an original study but on a previous replication. Since replications in the initial metamodel were associated to original studies only, we had to include the abstract class EmpiricalStudy to allow replications based on replications, in a similar way to the Composite design pattern [25]. As shown in Figure 9, in this evolutionary change we also introduced some missing information like date, site, and report and an acronym as a public unique identifier to facilitate the referencing of empirical studies and their traceability.

TOE
On the other hand, during the specification of the replication changes in this case study, we observed that some of them affected more than one type of experimental validity, so we evolved the metamodel accordingly, as shown in Figure 10. Sometimes, the impact of a change on a specific type of experimental validity was not obvious, so we also decided to add the rationale attribute to the new ChangeImpact class, letting the experimenters register why they thought a given change increased or decreased a given validity type. In a subsequent iteration, we considered that this effect could be subjectively quantified using a 7-point linear modeled by the Effect valuebased enumeration and the corresponding effect attribute in the ChangeImpact class. As previously commented in Section 3.    ChangeImpact * substantially increases = 3 moderately increases = 2 slightly increases = 1 does not affect = 0 slightly decreases = -1 moderately decreases = -2 substantially decreases = -3 Effect <<enumeration>> substantially increases = 3 moderately increases = 2 slightly increases = 1 does not affect = 0 slightly decreases = -1 moderately decreases = -2 substantially decreases = -3 Effect <<enumeration>> substantially increases = 3 moderately increases = 2 slightly increases = 1 does not affect = 0 slightly decreases = -1 moderately decreases = -2 substantially decreases = -3   It is worth mentioning that in most of the reported replications the purposes of the changes were rarely stated, making it difficult to identify their effects on validity and the modified dimensions.

TOE-Case analysis
The main goal of this case study was to confirm the results of the previous one and to consolidate the evolutive changes on the artifact. Using the evolved templates, we were able to specify all the replications changes, although it took us some time to agree on the dimensions modified by some changes.
Overall, considering that most of the limitations of the initial artifact had been identified in the first case study, the participating researchers found our approach satisfactory, providing very positive feedback.

Agrobio-Case analysis
After validating our approach for the specification of changes in Computer Science experiments, the main goal of the third case study was to test it in a different research area with users other than ourselves. One of the first issues identified during this case study was the need of a context for replication changes to be understood due to their high specificity. Clearly, that context had to be provided by the original study since it was shared by all its replications. As shown in Figure 9, in the early version of the metamodel, only the goal of the original study was registered whereas in the evolved metamodel, a description is added to provide such a context. See the description of the original study of the Soil-2018 replication in Figure 11 for a clear example.
In families of experiments in Agrobiology, it is a common practice to change the growing medium of the plants from Petri dishes to culture chambers, to greenhouses, and finally, to natural soils. It is also common to repeat the same experiment in different seasons to confirm results. These kinds of change, that we were not able to classify in any of the change dimensions by Gómez et al [28], were the reason to expand their proposal with the context 2 dimension. Including this new change dimension, we were also able to classify some changes in the HOE-Case in which the experiments were held with students at different moments of the academic course, e.g., at the beginning of the course, during examination period, and that we had not been able to completely specify before.
Regarding the user experience of the participating researchers, they found the templates and L-patterns very useful for reporting their replications, helping them to focus on the essential information. Those who agreed to use CAESAR, despite being a proof-of-concept tool, found it very useful for generating reports of their experiments and showed a great interest to use it in the future. Nevertheless, some differences in the terminology used in the templates were detected. Particularly, the concept of change dimension was foreign to them, something that it seemed reasonable considering that it is a recent concept proposed by an Computer Science Figure 11: Specification of the Soil-2018 replication [45] in the CAESAR tool researcher [28]. Another foreign concepts to them were those of experimental validity and validity threats, which was very surprising to us considering the importance they are given in our discipline. After several meetings, we found out that they handle a list of usual experimental risks with their corresponding actions to mitigate them, but they use the term avoiding experimental errors instead of minimizing validity threats as we do in Computer Science. On the other hand, they preferred the term repetition to replication.

Threats to Validity
Runeson et al [55] provide a detailed description of threats to the validity of case studies. In this section, such threats and the actions performed to mitigate them in our multiple case study are described.

Construct validity
This validity is concerned with the degree to which the operationalization actually reflects what is to be the investigated. According to [73], using multiple sources of evidence during data collection increases construct validity, as it has been our case. On the other hand, it is also recommended in [73] that a draft of the case study report be reviewed by key informants. In our case, the participant researches have not only followed the evolution of the artifact across the whole evaluation process, but they have also reviewed a draft report of our multiple case study.

Internal validity
This validity is concerned when causal relations are examined. The main hypothesis in our study is that the use of the proposed artifact is useful for the specification and reporting of replications, not only in Computer Science, but also in other areas of knowledge. The only potential threat to the internal validity is that the differences in the used terminology could affect the usage of the artifact by participant researchers, especially for those from research areas other than Computer Science. To mitigate this threat, we explained the use of the templates and assisted the Agrobiology researchers during the Agrobio-Case, apart from investigating the differences and similarities between their terminology and that used in Computer Science, in order to improve our communication with them.

External validity
This validity is concerned with the generalization of results in cases that have common characteristics. The fact that the templates were firstly used by ourselves could limit the generalization of the findings. In order to overcome this threat, we invited other researchers from our own research area and from other research areas to participate in the study, thus reducing the potential bias. Nevertheless, further research is needed to validate the artifact in more replications, belonging to the Computer Science discipline or to other different disciplines. In this sense, the CAESAR tool is available to the scientific community, so other researchers can report their replication changes using it and provide feedback for further improvements.

Reliability
This aspect is concerned with obtaining similar results when the study is carried out by other researchers. Since the specification of all replications with their corresponding changes is available at Zenodo [19], the study can be repeated by other researchers which can then compare their results with ours.

Related work
Regarding replications in Computer Science, several literature reviews show the relevance of the topic [12,18,20,42]. These reviews are systematic mapping studies dealing with general aspects such as types of replications, conceptual frameworks, or addressed research topics. However, to the best of our knowledge, there is a lack of literature reviews on more specific aspects of replications such as the reporting of changes. Despite of the recommendations by Carver [16], Shepperd et al [62], Juristo and Vegas [37,38], Gómez et al [28], and Baldassarre et al [6] to report replication changes, no concrete proposals have been provided on how to specify them in Computer Science.
The works that are most related to ours are [2], [65], and [69], who propose a tabular form to summarize the experiments that compose a family. In [2], replications are reported including their motivation, their changes (in unstructured narrative text), the confirmation or non-confirmation of results in previous experiments, other characteristics such as subjects, tasks, and materials, and whether hypotheses or research questions changed from previous experiments. In [65], a tabular replication summary is used, including information on, among others, experimental design, laboratory package, material preparation, replication operation, data analysis, and experimenter evaluation. Finally, in [69], the specification of each experiment is summarized including elements such as factors, treatments, response variables, design, experimental objects, and participants.
After reviewing these proposals, we consider that all of them provide an in-depth overview of the configuration of each experiment in a family. However, to identify replication changes, table content corresponding to the different replications must be compared, whereas in our proposal, apart from being model-based and having tool support, each change is specified separately in a explicit manner, thus reducing tacit knowledge as recommended in [41].
In a recent survey closely related to the approach presented in this article, 137 articles published between 2013 and 2018 reporting at least one replication were analyzed [18]. Most of the replication changes were defined in controlled experiments, confirming that this is the most frequently applied empirical strategy in the research area [29]. In general, most changes-also referred to as adjustments or differenceswere described with respect to an original experiment using natural language or tabular forms, only about 25%-30% of the studies reported the purpose of the changes and their effect on experimental validity.

Conclusions and Future work
In the context of replications of empirical studies in Computer Science, and applying DSR, we have develop and evaluated a composite artifact to systematically specify and report replication changes. We have developed the artifact following a modelbased approach, generating (i) a metamodel formalizing all the relevant concepts related to replication changes; (ii) templates for reporting replication contexts and replication changes using the information in the metamodel together with L-patterns to facilitate their use; and (iii) a proof-of-concept tool based on the metamodel and supporting the management of the proposed templates. We have also evaluated the artifact by means of a multiple case study in which 92 replications changes corresponding to 23 replication studies in 9 families of experiments in the areas of Computer Science and Agrobiology were specified by ourselves and by other researchers and subsequently analyzed. The evaluation revealed some initial limitations of our approach, but they were overcome after several improvement iterations. One of the most relevant improvements was the quantitative determination of the effect of changes on experimental validity, that allows the visualization of the evolution of the validity of a family of experiments thus supporting decision-making.
Our model-based proposal for the specification of replication changes seems to fit the needs not only of Computer Science, but also of other research areas such as Agrobiology, obtaining very positive feedback from the participant researchers from both disciplines.
As future work, our aim in the middle term is to apply our templates not only to report changes of already conducted replications, but to use them during replication design, as a means of analyzing and documenting the purpose and effects of replication changes before they are performed. This design-oriented approach needs a more advanced version of the CAESAR tool, including a virtual assistant, probably a chatbot, that guides the researcher in the process and suggests, for example, potential effects of changes on experimental validity depending on the modified dimension or on other criteria that might be identified from experience. In the short term, we want to provide Computer Science researchers with a consolidated L A T E X package for using the proposed templates in their articles when they need to report some replication changes, thus disseminating our artifact in the community.