1 Introduction

Model-based development [1, 2] is a much appraised and promising methodology to tackle the complexity of modern software-intensive systems, notably for embedded systems in various domains such as transportation, telecommunications, or industrial automation [3]. It promotes the use of models in all stages of development as a central means for abstraction and starting point for automation, e.g., for the sake of simulation, analysis or software production, with the ultimate goal of increasing productivity and quality.

Consequently, model-based development strongly depends on good tool support to fully realize its manifold promises [4]. Research on model-based development often reports on novel methods and techniques for model management and processing which are typically embodied in a tool. In addition to theoretical and conceptual foundations, some form of evidence is required concerning the effectiveness of these tools, which typically demands for an experimental evaluation [5,6,7].

In turn, experimental results should be replicableFootnote 1 in order to increase the validity and reliability of the outcomes observed in an experiment [9, 10]. Therefore, both the tool and the experimental subject data, essentially the models used in the experiments, should be made available following the so-called FAIR principles—Findability, Accessibility, Interoperability, and Reusability [11, 12]—aiming at the replicability of experimental results.

In this paper, we investigate to which degree recent research on tools for model-based development of embedded systems meets the requirements for replicability of experimental results. We focus on tools for MATLAB/Simulink (referred to as Simulink, for short), which has emerged as a de facto standard for automatic control and digital signal processing. In particular, we strive to answer the following research questions:

RQ1::

Are the experimental results of evaluating tools supporting model-based development with Simulink replicable?

RQ2::

From where do researchers acquire Simulink models used as experimental subjects and what are their basic characteristics?

RQ3::

Does the replicability of experimental results correlate with the impact of a paper?

We conduct a systematic literature review in order to compile a list of relevant papers from which we extract and synthesize the data to answer these research questions. Starting from an initial set of 942 papers that matched our search queries on the digital libraries of IEEE, ACM, ScienceDirect and dblp, we identified 65 papers which report on the development and evaluation of a tool supporting Simulink, and for which we did an in-depth investigation. Details of our research methodology, including the search process, paper selection and data extraction, are presented in Sect. 2.

In a nutshell, we found that models used as experimental subjects and prototypical implementations of the presented tools, both of which are essential for replicating experimental results, are accessible for only a minor fraction (namely 22% and 31%) of the investigated papers. We further found the results for none of these papers to be fully replicable (RQ1), achieving only partial replicability for 6%. The models come from a variety of sources, e.g., from other research papers, industry partners of a paper’s authors, open source projects, or examples provided along with Simulink or any of its toolboxes. Interestingly, the smallest fraction of models (only 3%) is obtained from open-source projects, and the largest one (about 18%) is provided by industrial partners (RQ2). While we think that, in general, the usage of industrial models strengthens the validity of experimental results, such models are often not publicly available due to confidentiality agreements. These findings are confirmed by other research papers which we investigated during our study. Finally, we found papers having a better replicability also being cited more often (RQ3), confirming results of [13, 14]. Our results are presented in detail in Sect. 3.

While we do not claim our results to represent a complete image of how researchers adopt the FAIR principles of good scientific practice and deal with the replicability of experimental results in our field of interest (see Sect. 4 for a discussion of major threats to validity), we see our findings as an alarming signal. Given that tools are still being listed among the major obstacles of a more widespread adoption of model-based principles in practice [15], we need to overcome this “replicability problem” in order to make scientific progress. We are strongly convinced that this can only be achieved as a community effort. The discussion in Sect. 5 is meant to serve as a starting point for this, primarily based on the lessons learnt from our study. Finally, we review related work in Sects. 6 and  7 concludes the paper.

This paper is a revised version of our previous conference paper [16], providing the following extensions:

  • In our previous work, we only investigated the principal replicability of the experimental results presented in the research papers compiled by our systematic literature search. In this extended version, for each research paper which was deemed to be replicable in principle, we follow up with an actual attempt of replicating the results in terms of a replicability study (C1.4).

  • In the extended version, we are not only interested in the origins of the models used as experimental subjects, but also in their basic characteristics such as size and active maintenance span (C2.3).

  • We answer the new RQ3, which analyzes whether the replicability of experimental results correlates with the impact of a paper. The rationale behind this is to seek evidence whether carefully dealing with replicability of experimental results may positively influence the impact of the underlying research.

Fig. 1
figure 1

Digital libraries and corresponding search strings used to obtain an initial selection of research papers

2 Research methodology

We conduct a systematic literature review in order to compile a list of relevant papers from which we extract the data to answer our research questions RQ1 and RQ2. Our research methodology is based on the basic principles described by Kitchenham [17]. Details of our search process and research paper selection are described in Sect. 2.1. To answer RQ3, we use the relevant papers found by the systematic literature review and how replicable we found them in RQ1. This is used to compare replicability of a paper to its impact. Section 2.2 is dedicated to our data extraction policy, structured along a refinement of our overall research questions.

2.1 Search process and research paper selection

2.1.1 Scope

We focus on research papers in the context of model-based development that report on the development of novel methods, techniques or algorithms for managing or processing Simulink models. Ultimately, we require that these contributions are prototypically implemented within a tool whose effectiveness has been evaluated in some form of experimental validation. Tools we consider to fall into our scope are supporting typical tasks in model-based development, such as code generation [18] or model transformation [19], clone detection [20], test generation [21] and prioritization [22], model checking [23] and validation [24], model slicing [25] and fault detection [26]. On the contrary, we ignore model-based solutions using Simulink for solving a specific problem in a particular domain, such as solar panel array positioning [27], motor control [28], or wind turbine design [29].

2.1.2 Databases and search strings

As illustrated in Fig. 2, we used the digital libraries of ACM,Footnote 2IEEE,Footnote 3ScienceDirect,Footnote 4 and dblpFootnote 5 to obtain an initial selection of research papers for our study. These platforms are highly relevant in the field of model-based development and were used in systematic literature reviews on model-based development like [30] or [31]. By using these four different digital libraries, we are confident to capture a good snapshot of relevant papers.

According to the scope of our study, we developed the search strings shown in Fig. 1. We use IEEE’s and ACM’s search feature to select publications based on keywords in their abstracts. Some of the keywords are abbreviated using the wildcard symbol (\(\star \)). Since the wildcard symbol is not supported by the query engine of ScienceDirect[32], we slightly adapted these search strings for querying ScienceDirect. The same applies to dblp [32], where we also included the keyword “method” to obtain more results. To compile a contemporary and timely representation of research papers, we filtered all papers by publication date and keep those that were published between January 1, 2015, and February 24, 2020. With these settings, we found 625 papers on IEEE, 88 on ACM, 214 on ScienceDirectFootnote 6 and 15 on dblp.

Fig. 2
figure 2

Overview of the search process and research paper selection. Numbers of included research papers are shown for each step. After the initial query results obtained from the digital libraries of ACM, IEEE, ScienceDirect and dblp, the study has been performed by two researchers, referred to as R1 and R2

Using the bibliography reference manager JabRef,Footnote 7 these 942 papers were first screened for clones. Then, we sorted the remaining entries alphabetically and deleted all duplicates sharing the same title. As illustrated in Fig. 2, 912 papers remained after the elimination of duplicates.

2.1.3 Inclusion and exclusion criteria

From this point onwards, the study was performed by two researchers, referred to R1 and R2 in the remainder of this paper (cf. Fig. 2).

Of the 912 papers (all written in English), R1 and R2 read title and abstract to see whether they fall into our scope. Both researchers had to agree on a paper being in scope in order to include it. R1 and R2 classified 92 papers to be in scope, with an inter-rater reliability, measured in terms of Cohen’s kappa coefficient [17, 33], at 0.86. To foster a consistent handling, R1 and R2 classified the first 20 papers together in a joint session, and reviewed differences after 200 papers again.

Next, R1 and R2 read the abstracts and checked whether a paper mentions some form of evaluation of a presented tool. Because such hints may be only briefly mentioned in the abstract, we included papers where either R1 or R2 gave a positive vote. As a result of this step, the researchers identified 79 papers to be in scope and with some kind of evaluation.

We then excluded all papers for which we could not obtain the full text. Our university’s subscription and the social networking site ResearchGateFootnote 8 could provide us with 45 full text papers. In addition, we found 5 papers on personal pages and obtained 28 papers in personal correspondence. We did not manage to get the full text of 3 papers in one way or the other. In sum, 76 papers remained after this step.

Finally, we read the full text to find out whether there was indeed an evaluation, as indicated in the abstract, and whether Simulink models were used in that evaluation. We excluded 10 full papers without such an evaluation and one short paper which we considered to be too unclear about their evaluation. For this last step R1 and R2 resolved all differences in classification: concerning papers were read a second time, to decide together about their inclusion or exclusion. We did this so that R1 and R2 could work with one consistent set for the data extraction. After all inclusion and exclusion steps, R1 and R2 collected 65 papers which were to be analyzed in detail in order to extract the data for answering our research questions.

2.2 Refinement of research questions and data extraction

In order to answer our research questions, R1 and R2 extracted data from the full text of all the 65 papers selected in the previous step. To that end, we refined our overall research questions into criteria which are supposed to be answered in a straightforward manner, typically by a classification into “yes”, “no”, or “unsure”.

2.2.1 RQ1: Are the experimental results of evaluating tools supporting model-based development with Simulink replicable?

To answer RQ1, we start with an investigation of the accessibility of models and tools, which are basic prerequisites for replicating experimental results, followed by full replication studies provided these basic prerequisites are fulfilled.

Accessibility of the models We assume that the effectiveness of a tool supporting model-based development can only be evaluated using concrete models serving as experimental subjects. These subjects, in turn, are a necessary precondition for replicating experimental results. They should be accessible as a digital artifact for further inspection. In terms of Simulink, this means that a model should be provided as a \(^{\star }\).mdl or \(^{\star }\).slx file. Models that are only depicted in the paper may be incomplete, e.g., due to parameters that are not shown in the main view of Simulink, sub-systems which are not shown in the paper, etc.

The aim of C1.1 is to find out whether all models which are required for the sake of replication are accessible:

  1. C1.1:

    Are all models accessible?

The accessibility of models can only be checked if the paper provided us some hint of how to access them. For a given paper, we thus read the evaluation section of a paper closely, and also looked at footnotes, the bibliography as well as at the very start and end of the paper. In addition, we did a full text search for the keywords “download”, “available”, “http” and “www.”. Next we checked whether given links indeed worked and models could be found, there. For all papers, a positive answer to C1.1 requires that each of the models used in the paper’s evaluation falls into one of the following categories:

  • There is a non-broken hyperlink to an online resource where the Simulink model file can be obtained from.

  • There is a known model suite or benchmark comprising the model, such as the example models provided by Simulink.

  • The model is taken from another, referenced research paper. In this case, we assume it to be accessible without checking the original paper.

Accessibility of the tool Next to the models, the actual tool being presented in the research paper typically serves as the second input to replicate the experimental results. In some cases, however, we expect that the benefits of a tool can be shown “theoretically”, i.e., without any need for actually executing the tool. To that end, before dealing with accessibility issues, we assess this general need in C1.2:

  1. C1.2:

    Is the tool needed for the evaluation?

We read the evaluation section to understand whether there is the need to execute the tool in order to emulate the paper’s evaluation. For those papers for which C1.2 is answered by “yes”, we continue with C1.3. All papers answered with “no” are treated as if they provide their tool in C1.3.

Similarly to our investigation of the accessibility of models, we also assess if a paper provides access to their presented tool:

C1.3::

Is the tool accessible?

In contrast to the accessibility of models, which we assume to be described mostly in the evaluation section, we expect that statements on the accessibility of a tool being presented in a given research paper may be spread across the entire paper. This means that the information could be “hidden” anywhere in the paper, without us being able to find it in a screening process. To decrease oversights, we did full text searches, for the key words “download”, “available”, “http” and “www.”. If a tool was named in the paper, we also did full text searches for its name. A tool was deemed accessible if a non-broken link to some download option was provided. If third-party tools are being required, we expected some reference on where they can be obtained. We considered Simulink or any of its toolboxes as pre-installed and thus accessible by default.

Replicability studies For those papers where we determined the models and tools to be publicly available, we also investigate whether the experiments are actually replicable by us or not:

C1.4::

Are the experiments replicable?

Following the general meaning of replicability used throughout this paper, an experiment was deemed replicable if its results can be replicated by researchers different from those of the paper. To that end, all replication studies have been conducted by two graduate students R2 and R3Footnote 9 of our department, both of them holding a Bachelor’s degree in computer science. As the replication studies were only done for 8 papers, we report on our experience for each of these studies from a qualitative point of view.

2.2.2 RQ2: From where do researchers acquire Simulink models used as experimental subjects and what are their basic characteristics?

Next to the accessibility of models as part of RQ1, we are interested in where the researchers acquire Simulink models for the sake of experimentation and what are their basic characteristics such as size and active maintenance span. By collecting these insights, researchers in need of models for analysis or tool validation may emulate successful ways of getting models. In order to learn more about the context of a model or to get an updated version, it may be useful to contact the model creator, which motivates C2.1:

C2.1::

Are all model creators mentioned in the paper?

By the term “creator” we do not necessarily mean an individual person. Instead, we consider model creation on a more abstract level, which means that a model creator could also be a company which is named in a paper or any other referenced research paper. If creators of all models were named, we answered C2.1 with “yes”.

Next to the model creator, C2.2 dives more deeply into investigating a model’s origin:

C2.2::

From where are the models obtained?

C2.2 is one of our sub-research questions which cannot be answered by our usual “yes/no/unsure scheme”. Possible answers were “researchers designed model themselves”\(^{\star }\), “generator algorithm”, “mutator algorithm”, “industry partner”\(^{\star }\), “open source”, “other research paper”\(^{\star }\), “Simulink -standard example”\(^{\star }\), “multiple” and “unknown”. The categories marked with a \(^{\star }\) were also used in [31]. As opposed to us, they also used the category “none”, which we did not have to consider, due to our previous exclusion steps. The category “multiple” was used whenever two or more of these domains were used in one paper. Note that even if C2.1 was answered with “no”, we may still be able to answer this question. For example, if the model was acquired from a company which is not named in the paper (e.g. due to a non-disclosure agreement), we may still be able to classify it as from an industry partner.

Finally, with C2.3 we give an outline about the basic characteristics of the models that were used in the experiments:

C2.3::

What are the basic characteristics of the experimental models?

With this, we are interested in what kinds of models researchers use: Are the models small toy examples created in a one-shot manner for the sake of illustration, or are they bigger and actively maintained over a certain period of time?

Fig. 3
figure 3

A sample Simulink model showing Inport/Outport, Add and Subsystem blocks connected by signal lines. It computes the stopping distance from a car’s velocity, by summing up reaction distance and braking distance. The details of the computation of reaction distance and braking distance are abstracted from in this view

Simulink models are block diagrams used to model dynamical systems, where computing blocks are connected by signal lines (see Fig. 3 for a sample model). Blocks of various kinds (e.g., Sum, Logic, Switch, etc.) can apply transformations on their incoming signals, thereby producing modified outgoing signals. Inport and Outport blocks are specific blocks connecting a model with its surrounding context. Another special kind of block is the Subsystem block which can be used to modularize a Simulink model in a hierarchical manner. For more details, we kindly refer to Simulink introductions, e.g., [34]. We used basic MATLAB scripts to compute the following model metrics in order to characterize the models used as experimental subjects:

  • Number of blocks,

  • Amount of unique block types,

  • Number of subsystems,

  • Length of active maintenance span (creation date to last save date of a model).

We also counted, how many models were provided in the replication packages.

2.2.3 RQ3: Does the replicability of experimental results correlate with the impact of a paper?

With this RQ, we are interested whether a paper’s impact may be influenced by its handling of replicability. We thus investigate whether papers that rank better in terms of replicability (using our results of C1.1, C1.3, C1.4) have a higher relative impact than the other papers. While our analysis cannot show an actual cause and effect relationship between replicability and impact, there are other studies [13, 14] which let us hypothesize about a possible connection between the two.

We compute a paper’s relative impact with the normalized citation score of Waltman et al. [35], and use a paper’s citation count as the basis of this score. We collected the citation counts of each included paper from GoogleScholarFootnote 10 and ResearchGate.Footnote 11

To strengthen our confidence in our computed value of the relative impact (see threats to validity), we compared it to the Scopus Field-Weighted Citation Impact,Footnote 12 whenever Scopus provides it for a paper. We did not use Scopus for our computation because they did not list impacts for 4 papers, and their opaque computation of the Field-Weighted Citation Impact. Finally, we group our included papers into 5 groups according to our classification of C1.1/C1.3/C1.4: no replication package, only software provided, only models provided, both provided and replicable. Comparisons then take place based on the average relative impact of the groups.

3 Results

In this section, we synthesize the results of our study. All paper references found, raw data extracted and calculations of results synthesized can be found in the replication package of this paper [36]. The package includes all the Simulink models we found during our study.

Table 1 Detailed summary of replicability in principle
Table 2 List of papers whose experiments we examined for replicability

3.1 Are the experimental results of evaluating tools supporting model-based development with Simulink replicable? (RQ1)

First we summarize the results for the criteria C1.1 through C1.4, before we draw our conclusions for answering the overall research question RQ1. Table 1 provides details for C1.1 through C1.3, while Table 2 shows the results for C1.4.

It can be seen that R1 and R2 generally had a high inter-rater reliability and agreed that most papers did not make their models accessible. Almost all paper’s evaluations needed a tool to be executed for its evaluation. Finally with we were unsure for the majority of the papers whether they provide access to their tools.

While answering C1.3, R1 and R2 first used the additional category “no”, but this produced an unsatisfactory inter-rater reliability of only 0.42. To remedy this, we revised the answers, merging the categories “no” and “unsure”, acknowledging that R1 and R2 interpreted “no” and “unsure” too differently.

C1.4: Are the experiments replicable?

Table 2 summarizes those papers for which the models and tools were determined to be publicly available by R1 or R2, and for which R2 and R3 attempted to fully replicate the experimental results. The replication studies have been conducted during October and November 2020, with two different hardware/software setups. We used a main setup running Windows 10 on a 64GB RAM, AMD Ryzen 7 3800x CPU desktop PC and MatlabR2020b. In case of a replication failure, we retried on a server running SuSe Leap 15 on a 1TB RAM, 4 Intel Xeon E7-4880 CPUs and MATLAB R2019a. When parts of the replication package were inaccessible,Footnote 13 we contacted the authors to provide us with access to models and tools for replication purposes. We will use the numbers listed in the first column of Table 2 to refer to each of the papers.

Paper No.1 provided a Docker container for all its files. Still, R2 and R3 were not able to replicate the experiments, because they lacked knowledge of the Z\(\acute{\mathrm{e}}\)lus tool. More explicit instructions in the handling of it were needed. The authors of the paper acknowledge that showing an equivalence between the Z\(\acute{\mathrm{e}}\)lus blocks and Simulink blocks is hard to prove.

Paper No.2 offers a detailed documentation on how to perform the experiments, however documentation was missing in how to edit the configuration file for experimental replication. Both R2 and R3 tried to map the configuration parameters to those given in the paper, but the replication still failed for RQ2. A more detailed description would be necessary, as well as instructions on how to acquire the CyFuzz data for comparison, which was not provided by the authors.

Paper No.3 could not be replicated because the link to the tool was not accessible at the time of conducting our replication studies.

Paper No.4 was not replicable. At the time of conducting our replication studies, we were unable to gain access to the necessary QVtrace package. Similarly no access to the author’s Google Drive was granted to us.

Fig. 4
figure 4

Are studies of model-based development replicable in principle?

Fig. 5
figure 5

C2.1: Are all model creators mentioned in the paper?

Paper No.5 offers all software components and models including a minimal documentation. While trying to replicate the experiments, we received numerous warnings and errors and were able to generate simulation times for only three out of four models. It was unclear how to deal with the simulation times, since the paper and documentation do not offer explicit instructions on how to aggregate the results such that they are comparable to those offered by the authors.

Paper No.6 is a revised version of paper No.5. It adds two more models to the experimental evaluation. As opposed to the first version of the paper, the ReadMe provided claims that scripts to execute the experiments are now being provided. However, we could only find one of the scripts, which was not executable. Thus, we were able to get simulation times for four out of six models, which were again not comparable with the authors’ data.

Paper No.7’s online tool described in the paper was not accessible. The authors instead provided us with a similar offline Docker container. Even though the documentation was very detailed, the important parameter \(\tau \) was missing. Furthermore, models mentioned in the paper were not accessible to us anymore. Nevertheless, we were able to generate models with the tool, concluding that these experiments were the closest to replication.

Paper No.8 was not replicable, as implementation files were missing. As their model “Gearbox” and the Reactis tool are publicly available, only the very first step of their experiments could be replicated.

Aggregation and summary of the results To answer RQ1, we first assess whether the experiments described in the studied papers are replicable in principle. Therefore, we combine C1.1 and the revised answers of C1.3. For those papers where there was no tool needed (C1.2), C1.3 was classified as “yes”. The formula we used is “If C1.1 = ’no’ then ’no’ else C1.3”. This way, on average, 6 (R1: 5, R2: 7) papers have been classified as replicable in principle, 50.5 (R1: 52, R2: 49) as not replicable, and 8.5 (R1: 8, R2: 9) for which we were unsure (see Fig. 4), with Cohen’s kappa of 0.67. In sum, 8 papers were classified to be replicable in principle by at least one of the researchers.

Fig. 6
figure 6

C2.2: From where are the models obtained?

Table 3 Basic characteristics of the models used as experimental subjects in the papers which were found by our systematic literature search

We were not able to fully replicate any of the experiments of those 8 papers, and achieved partial replicability for four of them. Three tools were not completely available to us due to denied access, timeouts and a missing implementation. One paper could not be replicated due to incomplete documentation. Four software setups were closely examined and principally functional, but the experiments could not be fully replicated due to incomplete documentation, errors or broken links to the models used as experimental subjects.

figure a

3.2 From where do researchers acquire Simulink models used as experimental subjects and what are their basic characteristics? (RQ2)

C2.1 Are all model creators mentioned in the paper? As can be seen in Fig. 5, of the 65 papers investigated in detail, on average, 44 (R1: 43, R2: 45) papers mention the creators of all models. On the contrary, no such information could be found for an average of 20.5 (R1 22, R2 19) papers. Finally, there was one paper for which R2 was not sure, leading to an average value of 0.5 (R1: 0, R2: 1) for this category. In sum, this question was answered with an inter-rater reliability of 0.79.

C2.2 From where are the models obtained? As shown in Fig. 6, there is some variety for the model’s origins. Only 3% used open source models, 8% used models included in Simulink or one of its toolboxes, 12% cited other papers, 13% built their own models, and 18% obtained models from industry partners. A quarter of all papers used models coming from two or more different sources. For 19% of the papers, we could not figure out where the models come from. This mostly coincides with those papers where we answered “no” in C2.1. For some papers, we were able to classify C2.2, even though we answered C2.1 with “no”. E.g. we classified the origin of a model of [44] as “industry partner” based on the statement “a real-world electrical engine control system of a hybrid car”, even though no specific information about this partner was given. C2.2 was answered with Cohen’s kappa of 0.68.

Fig. 7
figure 7

Overview of the most frequently used kinds blocks in the Simulink models. The horizontal axis lists Simulink block types, and the vertical axis shows how many models did use this type of block

An interesting yet partially expected perspective arises from combining C2.2 and C1.1. None of the models obtained from an “industry partner” are accessible. Three papers which we classified as “multiple” in C2.2 did provide industrial models though: [23] provides models from a “major aerospace and defense company” (name not revealed due to a non-disclosure agreement), while [40] and [41] use an open-source model of electro-mechanical braking provided by Bosch in [45]. Finally, [46] and [47] use models for an advanced driver assistance system by Daimler [48], that can be inspected on the project website.Footnote 14

C2.3 What are the basic characteristics of experimental models? Table 3 lists basic characteristics of the models used as experimental subjects in the papers investigated by our study. For all of the characteristics, the standard deviation was high, and the distributions of the measures are right-skewed (median smaller than mean). Most of the papers used only a few models for their experiments, as the median is at only three. We found the models themselves to be not only toy examples, as there were 26 subsystems and 236 blocks per model in the median. A further indication of this is that the median model uses 12 different block types. Fig. 7 gives an impression of the 25 most frequently used blocks. The three most frequently used kinds of blocks are the Inport, Outport and Subsystem blocks; these are used for the sake of modularizing a model into different parts. All this shows some degree of sophistication in the models. Certainly, a “degree of sophistication” is not objectively measurable, but talking to Simulink experts in private conversation, they reported typical models in industry often having 1000-10000 blocks, most of the models we found in this study are smaller. Finally, we determined the maintenance spans of the models, which has the highest degree of deviation. There are many models with a maintenance span of “zero” days, but we also observe models with a maintenance spans of multiple years.

figure b

3.3 Does the replicability of experimental results correlate with the impact of a paper? (RQ3)

To compute the relative impact of a paper, we collected the citation counts on GoogleScholar and ResearchGate. GoogleScholar provided slightly higher counts of 8.6 average citations versus 5.8 on ResearchGateFootnote 15. Because of the complete citation record and the generally higher values, we used the results of GoogleScholar for further computations of the relative impact. A histogram of the relative impact of the papers is shown in Fig. 8.

Fig. 8
figure 8

Histogram of the relative impact of the papers. An average paper has a relative impact of 1.00, which means that it is cited as often as the average of its peers. One paper (the maximum) is cited 5.93 as often as its peers and some (the minima) were never cited and thus have a relative impact of 0.00

We compared our computed relative impacts with the Scopus Field-Weighted Citation Impact by computing the correlation between the two impacts with SPSS.Footnote 16 Spearman’s rank correlation coefficient is 0.838 with significance \(p<0.001\), and Kendall’s rank correlation coefficient is 0.668 with significance \(p<0.001\). We also applied Wilcoxon’s signed-rank test which indicates that the median of differences between our impact and the Scopus impact is the same, with significance \(p=0.074 > \alpha = 0.05\). Finally, the average value of our relative impact is 1.0 vs. 1.08 for the Scopus Field-Weighted Citation. With these similarities between the two measures, we concluded that our relative impact can be used for further analysis.

Next, we grouped the papers according to our findings of C1.1, C1.3 into the 4 groups: “nothing provided”, “only models provided”, “only software provided”, “both provided”. If at least one researcher found the models were provided, or at least one researcher found the software was provided or not necessary, the respective group is used. Note that if both researchers were “unsure”, this was not the case. The average relative impacts are presented in Table 4. The 4 papers achieving partial replicability (part of “both provided”) even got an average impact of 2.072. The average relative impacts of the 4 groups are growing monotonically as listed in the table. The findings signal a positive correlation between a paper’s replicative quality and its relative impact. Papers that only provided their software scored higher than papers that only provided their models. This does not imply a cause and effect relationship, though (cf. Sect. 4).

Table 4 Average relative impacts grouped by our results of C1.1 and C1.3
figure c

4 Threats to validity

There are several reasons why the results of this study may not be representative, the major threats to validity as well as our countermeasures to mitigate these threats are discussed in the remainder of this section.

Our initial paper selection is based on selected databases and search strings. The initial query result may not include all the papers being relevant w.r.t. our scope. We tried to remedy this threat by using four different yet well-known digital libraries, which is sufficient according to [49, 50] and using wild-carded and broad keywords in the search string.

For the first two inclusion/exclusion steps, we only considered titles and abstracts of the papers. If papers do not describe their topic and evaluation here, they could have been missed.

It turned out to be more difficult than originally expected to find out whether a paper provides a replication package or not. One reason for this is that just scanning the full text of a paper for occurrences of Simulink or the name of the tool is not very useful here since there are dozens of matches scattered all over the paper. In fact, almost every sentence could “hide” a valuable piece of information. We tried to remedy this problem by searching the \(^{\star }\).pdf files for the key words mentioned in Sect. 2.2. We also merged our answers of “no” and “unsure” for C1.3 in reflection of this problem.

We were very strict in rating the accessibility, i.e., we expected not to have to contact a paper’s authors for getting model or tool access. This may have lowered the number of papers we deemed to be replicable in principle.

More generally, the data collection process was done by two researchers, each of which may have overlooked important information or misinterpreted a concrete criterion. We tried to mitigate these issues through intermediate discussions. Furthermore, we calculated Cohen’s kappa coefficient to better estimate the reliability of our extracted data and synthesized results.

Furthermore, the replication attempt of the experiments of 8 papers in C1.4 was only conducted by two researchers. The researchers did not have complete expert knowledge in the fields of the analyzed papers, which could have caused difficulties to reproduce experiments. This together with our strict grading of replicability may have led to a lower number of papers being replicable in principle (only 8 of 65) and papers we could fully replicate (none of the 8).

We did not investigate the paper’s Simulink versions or hardware setups for our assessment of principal replicability. This is because in many cases newer versions of Simulink or our own hardware setups would produce equivalent results.

Our methodology section does not present a separate quality assessment which is typical in systematic literature review studies [51]. Thus, our results could be different if only a subset of high-quality papers, e.g., those published in the most prestigious publication outlets, would be considered. Nonetheless, a rudimentary quality assessment (paper’s language, experimental evaluation instead of “blatant assertion” [5]) was done in our inclusion/exclusion process.

For the analysis of the Simulink models in C2.3, we used self-written Matlab scripts, which could be faulty. We especially cannot rule out that the maintenance span was computed correct for some models, as many models had a maintenance span of 0 days. This could naturally occur by a very short lived model or an automatic script creating and saving a model instantaneously. Another possibility is that the date feature of Simulink is buggy for some models.

As already indicated, there is a threat to the conclusion validity for answering RQ3 since there is not necessarily a cause and effect relationship between replicability of experimental results and the relative impact of a paper. Another possible cause could be the reputation or impact factor of the publication venue. If this is the case, however, then our results may point to higher replication standards for these venues. Moreover, our computed relative impact score is based on only 65 papers grouped by their publication year into 5 peer groups of size 9, 20, 15, 15 and 6. This is why we compared our relative impact with the Scopus metric for average, correlation and distribution and found it to be highly similar. Another mitigating factor is that the papers were manually selected in our systemic literature review. This ensures comparing papers only with highly relevant peers in each peer group.

5 Discussion

Limited accessibility of both models and tools Although generally accepted, the FAIR guiding principles of good scientific practice are hardly adopted by current research on tools for model-based development with Simulink. From the 65 papers which have been selected for an in-depth investigation in our systematic literature review, we found that only 22% of the models and 31% of the tools required for replicating experimental results are accessible. Thus, future research that builds on published results, such as larger user or field studies, studies comparing their results with established results, etc., are hardly possible, which ultimately limits scientific progress in general.

Difficulties regarding replicability We found none of 8 thoroughly examined papers to be fully replicable. This was largely caused due to insufficient documentation on the experiment setups, parameters and tools, or missing parts of implementations, tools or models. We suggest providing Docker containers or similar for ease of experimental evaluation. These can come pre-installed and pre-configured. This way, the replicator simply has to download the container and start a script.

Replicability and relative impact We can confirm the finding of papers publishing their data being cited more often [13, 14, 52]. Our results further show, that publishing software or both data and software does have a higher correlation with citation counts, than just publishing the experimental datasets.

Open-source mindset rarely adopted One general problem is that the open-source mindset seems to be rarely adopted in the context of model-based development. Only 3% of the papers considered by our study obtained all of their models from open source projects. On the contrary, 18% of the studied papers obtain the models used as experimental subjects from industry partners, the accessibility of these models is severely limited by confidentiality agreements.

Selected remarks from other papers These quantitative findings are also confirmed by several authors of the papers we investigated during out study. We noticed a number of remarks w.r.t. the availability of Simulink models for evaluation purposes. Statements like “To the best of our knowledge, there are no open benchmarks based on real implementation models. We have been provided with two implementation models developed by industry. However, the details of these models cannot be open.”[53]; “Crucial to our main study, we planned to work on real industrial data (this is an obstacle for most studies due to proprietary intellectual property concerns).”[46]; “[...] most public domain Stateflows are small examples created for the purpose of training and are not representative of the models developed in industry.”[54]; or “Such benchmarking suites are currently unavailable [...] and do not adequately relate to real world models of interest in industrial applications.”[24] discuss the problem of obtaining real-world yet freely accessible models from industry. Other statements such as “[...] as most of Simulink models [...] either lack open resources or contain a small-scale of blocks.”[55] or “[...] no study of existing Simulink models is available [...].”[38, 56] discuss the lack of accessible Simulink models in general.

Reflection of our own experience In addition, the findings reflect our own experience when developing several research prototypes supporting model management tasks, e.g., in terms of the SiDiff/SiLift project [57, 58].

Likewise, we made similar observations in the SimuComp project. Companies want to save the intellectual property and do not want their (unobfuscated) models to be published.

As opposed to the lack of availability of models, we do not have any reasonable explanation for the limited accessibility of tools. Most of the tools presented in research papers are not meant to be integrated into productive development environments directly, but they merely serve as experimental research prototypes which should not be affected by confidentiality agreements or license restrictions.

Suggestions based on the lessons learnt from our study. While the aforementioned problems are largely a matter of fact and cannot be changed in a short-term perspective, we believe that researchers could do a much better job in sharing their experimental subjects. Interestingly, 12% of the studies obtain their experimental subjects from other papers, and 13% of the papers state that such models have been created by the authors themselves. Making these models accessible is largely a matter of providing adequate descriptions.

However, such descriptions are not always easy to find within the research papers which we considered in our study. Often, we could not find online resources for models or software. It should be made clear where to find replication packages. In some cases a link to the project’s website was provided, but we couldn’t find the models there. To prevent this, we suggest direct links downloadable files or very prominent links on the website. The web resource’s language should match the paper’s language: e.g., the project site of [59] is in German. Four papers referenced pages that did not exist anymore, e.g., a private DropboxFootnote 17 account. These issues can be easily addressed by a more thorough archiving of a paper’s replication package.

We also suggest to name or cite creators of the models, so they can be contacted for a current version or context data of the model. In this respect, the results of our study are rather promising. After all, model creators have been mentioned in 68% of the studied papers, even if the models themselves were not accessible for a considerable amount of these cases.

Towards larger collections of Simulink models Our study not only reveals the severity of the problem, it may also be considered as a source for getting information about publicly available models and tools. In fact, we found a number of papers that did publish their models. This includes models with a degree of sophistication, e.g., models with more than a few hundred blocks. These models could be reused, updated or upgraded in other studies. We provide all digital artifacts that were produced in this work (.bibtex files of all papers found, exported spread sheets and retrieved models or paper references) online for download at [36]. Altogether we downloaded 517 Simulink models. We also found 32 referenced papers where models were drawn from. These models could be used by other researchers in their evaluation. Further initiatives of providing a corpus of publicly available models, including a recent collection of Simulink models, will be discussed in the next section.

6 Related work

The only related secondary study we are aware of has been conducted by Elberzhager et al. [31] in 2013. They conducted a systematic mapping study in which they analyzed papers about quality assurance of Simulink models. This is a sub-scope of our inclusion criteria, see Sect. 2.1. One research question of them was “How are the approaches evaluated?”. They reviewed where the models in an evaluation come from and categorized them into “industry example”, “Matlab example”, “own example”, “literature example” and “none”. We include more categories, see Sect. 2.2, apart from “none”. All papers that would fall in their “none”-category were excluded by us beforehand. Compared to their findings, we categorized 2 papers using open source models, one with a generator algorithm and 16 with multiple domains. Furthermore we found 11 papers, where the domain was not specified at all. They also commented on our RQ1: “In addition, examples from industry are sometimes only described, but not shown in detail for reasons of confidentiality.”

The lack of publicly available Simulink models inspired the SLforge project to build the only large-scale collection of public Simulink models [38, 60] known to us. To date, however, this corpus has been only used by the same researchers in order to evaluate different strategies for testing the Simulink tool environment itself (see, e.g., [61]). Another interesting approach was used by Sanchez et al. [62]. They used Google BigQueryFootnote 18 to find a sample of the largest available Simulink models on GitHub. In sum, they downloaded 70 Simulink models larger than 1MB from GitHub. Some authors created datasets of Simulink models as benchmarks. For example, Bourbouh et al. [63] compiled a set of 77 Stateflow models to demonstrate the effectiveness of their tool. Another benchmark of Simulink models was created as part of the Applied Verification for Continuous and Hybrid Systems (ARCH) workshop [64]. Regarding works describing characteristics of Simulink models, Dajsuren et al. [65, 66] reported coupling and cohesion metrics they found in ten industrial Simulink models. They measured the inter-relation of subsystems as well as the inter-relation of blocks within subsystems.

In a different context focusing on UML models only, Hebig et al. [67] have systematically mined GitHub projects to answer the question when UML models, if used, are created and updated throughout the lifecycle of a project. A similar yet even more restricted study on the usage of UML models developed in Enterprise Architect has been conducted by Langer et al. [68].

Apart from individual studies, there is an increasing community effort towards the adoption of open science principles within the field of software and systems engineering. One of the goals of such efforts is to create the basis for replicating experimental results. Most notably, a number of ACM conferences and journals have established formal review processes in order to assess the quality of digital artifacts associated with a scientific paper, according to ACM’s “Artifact Review and Badging” policyFootnote 19. Artifacts may receive three different kinds of badges, referred to as “Artifacts Evaluated”, “Artifacts Available”, and “Results Validated”. In terms of our study, we investigated the accessibility of digital artifacts in terms of C1.1 and C1.3, which is an essential prerequisite for receiving an “Artifacts Available” badge. The notion of replicability used in terms of this paper and particularly addressed in C1.4 is one of the two possible forms of how experimental results may receive a “Results Validated” badge. As a natural side-effect of conducting our replication studies, we also investigated the documentation, completeness, consistency and exercisability of artifacts, which are the requirements for receiving an “Artifacts Evaluated” badge.

7 Conclusion

In this paper, we investigated to which degree the principles of good scientific practice are adopted by current research on tools for model-based development, focusing on tools supporting Simulink. To that end, we conducted a systematic literature review and analyzed a set of 65 relevant papers on how they deal with the accessibility of experimental replication packages.

We found that only 31% of the tools and 22% of the models used as experimental subjects are accessible. Given that both artifacts are needed for a replication study, only 9% of the tool evaluations presented in the examined papers can be classified to be replicable in principle. We found none of those papers to be fully replicable and only 6% of them partially replicable. Moreover, only a minor fraction of the models is obtained from open-source projects, but some of those open source models show a degree of sophistication and could be useful for other experimental evaluations. Altogether, we see this as an alarming signal w.r.t. making scientific progress on better tool support for model-based development processes centered around Simulink. Giving access to the models and tools also could potentially result in a higher impact in the scientific community—this may serve as another motivation to give more care to replicability.

While both tool and models are essential prerequisites for replication and reproducibility studies, the latter may also serve as experimental subjects for evaluating other tools. In this regard, our study may serve as a source for getting information about publicly available models. Other researchers in this field have even started to curate and analyze a much larger corpus of Simulink models [60, 69]. Open-source models were found to be highly diverse in almost all metrics applied with some being complex enough to be representative of industry models.