1 Introduction

Many domains are subject to a vast and ever-growing number of rules and constraints stemming from sources including laws, legislation, regulations, standards, guidelines, contracts and best practices. One example is compliance in the corporate and financial sector. The Sarbanes–Oxley Act of 2002 (SOX) [55] is a federal law that defines rules in reaction to major corporate accounting scandals in the USA (e.g., Enron and WorldCom). Basel III [4] has been established in response to weaknesses in financial regulation responsible for the financial crisis in 2007/2008. Another example of heavily regulated domains is the construction industry. Compliance rules in this domain are often related to occupational safety and health. For example, certain precautions and safe practices are required if a lead contamination is present or to be presumed in buildings built before 1978 that undergo renovation (cf. United States Environmental Protection Agency’s Lead-Based Paint Renovation, Repair and Painting Rule [83]). A third example is the healthcare sector. Processes in hospitals must comply with state-of-the-art medical knowledge and treatment procedures (e.g., Rovani et al. [71]).

From cooperations with industry partners (e.g., Tran et al. [80]), their customers and other company representatives at conferences and workshops, we were able to gain valuable insights into the current situation on how compliance rules are handled in practice. Most often, compliance documents are transformed to internal policies first. They are often described in natural language, but there is also a shift toward structured approaches like the Semantics of Business Vocabulary and Business Rules (SBVR) standard [60]. Later these internal policies become considered in business process models (e.g., BPMN [59]) or other behavioral models (e.g., UML activity diagrams), and/or they become hard-coded in a programming language. That often leads-to consistency problems and to a poor maintainability and traceability between compliance specifications, internal policies, models and the source code. This is especially the case when compliance specifications change frequently. Additionally, practitioners report that it often takes a long time until new compliance specifications are actually supported by their software. Often the compliance rule has long been obsolete before the implementation is ready (cf. [20, 48]). Consequently, the industry shows a strong interest in approaches that are applicable in practice. Such approaches should support a comprehensible, fast and accurate adoption of compliance specifications as well as their automated enactment and verification. All modeling languages that we study in this article are well suited for automated computer-aided compliance checking or monitoring. Nonetheless, companies are still often reluctant to expose their customers or employees to such approaches. In discussions with industry partners (cf. [79, 81]), uncertainty regarding how understandable these approaches are became evident. This uncertainty was stated as one of the major reasons for the reluctance in practical adoption.

1.1 Problem statement

Most existing work on design time verification and runtime monitoring focuses on technical contributions rather than empirical contributions. From the perspective of a potential end user who has to implement compliance specifications, the understandability of an offered formal specification language appears to be a major interest. To the best of our knowledge, there are no empirical studies that investigate and compare the understandability of representative languages with respect to the formal modeling of compliance specifications. In particular, the following representative specification languages are considered in this empirical study:

  • Linear temporal logic (LTL) was proposed in 1977 by Pnueli [65]. LTL is a popular way for defining compliance rules according to Reichert and Weber [66]. In general, LTL is a widely used specification language commonly applied in model checking (cf. Cimatti et al. [12] for NuSMVFootnote 1, Blom et al. [9] for LTSminFootnote 2, Holzmann [42] for SPINFootnote 3) and runtime monitoring by non-deterministic finite automata (cf. De Giacomo and Vardi [23] and De Giacomo et al. [25]).

  • Event processing language (EPL) is the query language of the open-source complex event processing engine EsperFootnote 4. EPL is well suited as a representative for CEP query languages as it supports common CEP query language concepts, such as leads-to (sequence, followed-by) and every (each) operators, that are present in many CEP query languages and engines (e.g., SiddhiFootnote 5 and TESLA [15]). Several existing studies on compliance monitoring make use of EPL (cf. Awad et al. [2], Holmes et al. [41] and Tran et al. [82]).

  • Property specification patterns (PSP) are a collection of recurring temporal patterns proposed by Dwyer et al. [27, 28]. This pattern-based approach abstracts underlying technical and formal languages, most notably LTL and CTL (Computation Tree Logic; cf. Clarke et al. [13]). Numerous existing approaches are based on PSP. Among them are the Compliance Request Language proposed by Elgammal et al. [29] and the declarative business process approach Declare proposed by Pešić et al. [61].

In previous controlled experiments carried out by Czepa and Zdun [17], the understandability of already existing formal specifications in those language was studied. That experiments can be seen as the first step toward studying the understandability of those languages. To further study the understandability of these languages, it is crucial to consider the modeling itself as well.

1.2 Research objectives

This empirical study has the research objective to investigate the understandability construct of representative languages with regard to the modeling of compliance specifications. The understandability construct focuses on the degree of correctness achieved and on the time spent on modeling compliance specifications.

The experimental goal using the goal template of the Goal Question Metric proposed by Basili et al. [5] is stated as follows:

Analyze LTL, PSP and EPL for the purpose of their evaluation with respect to their understandability related to modeling compliance specifications from the viewpoint of the novice and moderately advanced software engineer, designer or developer in the context/environment of the Software Engineering 2 Lab and the Advanced Software Engineering Lab courses at the Faculty of Computer Science of the University of Vienna.

Based upon the stated goal, questions concerning understandability were generated as shown in Table 1.

Table 1 Questions based upon the goal

The understandability is measured by three dependent variables, namely the syntactic correctness and semantic correctness achieved in trying to formally model compliance specifications as well as the response time. Correctness and response time are commonly used to measure the construct understandability, for example, in empirical studies by Feigenspan et al. [31] and Hoisl et al. [40]. The study design enables a more fine-grained analysis of the correctness by differentiating between syntactic and semantic correctness as suggested by numerous existing studies, such as Ferri et al. [32], Hindawi et al. [39] and Harel and Rumpe [37].

Besides the main research goal, which focuses on understandability, this work addresses subjective aspects, namely the perceived ease of application and the perceived correctness, which are measures of self-assessment and not directly related to the understandability construct.

1.3 Guidelines

This work follows the guidelines for reporting experiments in empirical software engineering by Jedlitschka et al. [45]. These guidelines integrate among others the “Preliminary guidelines for empirical research in software engineering” by Kitchenham et al. [50] and standard books on empirical software engineering by Wohlin et al. [86] and Juristo and Moreno [47]. The “Robust Statistical Methods for Empirical Software Engineering” article by Kitchenham et al. [49] had a strong impact on the statistical evaluation of the data in this article.

2 Background

This section provides a brief introduction to the specification languages used in this study. Readers already familiar with one or more of the discussed approaches may consider skipping parts of this section. Examples of compliance specifications formalized in all three representations are available in “Appendix A.” These examples are based on the experimental tasks (cf. Sect. 3.3) of this experiment.

2.1 Linear Temporal Logic (LTL)

Propositional logic is not expressive enough to describe temporal properties, so a logic called linear temporal logic (LTL) for reasoning over linear traces with the temporal operators \({\mathcal {G}}\) (or \(\square \)) for “globally” and \({\mathcal {F}}\) (or \(\lozenge \)) for “finally” was proposed by Pnueli [65]. Additional temporal operators are \({\mathcal {U}}\) for “until,” \({\mathcal {W}}\) for “weak until,” \({\mathcal {R}}\) for “release” and \({\mathcal {X}}\) (or \(\circ \)) for “next.” The meaning of these operators is described in Table 2. LTL formulas are composed of the aforementioned temporal operators, atomic propositions (the set AP) and the Boolean operators \(\wedge \) (for “and”), \(\vee \) for “or,” \(\lnot \) for “not,” \(\rightarrow \) for “implies” (cf. Baier and Katoen [3]). The weak-until operator \(\psi ~{\mathcal {W}}~\phi \) is defined as \(({\mathcal {G}}~\psi ) \vee (\psi ~{\mathcal {U}}~\phi )\).

Table 2 Informal meanings of LTL operators

An LTL formula is inductively defined as follows: For every \(a \in AP\), a is an LTL formula. If \(\psi \) and \(\phi \) are LTL formulas, then so are \({\mathcal {G}}\psi \) (or \(\square \psi \)), \({\mathcal {F}}\psi \) (or \(\lozenge \psi \)), \(\psi ~{\mathcal {U}}~\phi \), \(\psi ~{\mathcal {R}}~\phi \), \({\mathcal {X}}\psi \) (or \(\circ \psi \)), \(\psi \wedge \phi \), \(\psi \vee \phi \) and \(\lnot \psi \).

The semantics of LTL over infinite traces is defined as follows: LTL formulas are interpreted as infinite words over the alphabet \(2^{AP}\). The alphabet is all possible propositional interpretations of the propositional symbols in AP. \(\pi (i)\) denotes that state of the trace \(\pi \) at time instant i. \(\pi , i \vDash \psi \) means that a trace \(\pi \) at time instant i satisfies the LTL formula \(\psi \), and is defined as follows:

  • \(\pi , i \vDash a\), for \(a \in AP\) iff \(a \in \pi (i)\).

  • \(\pi , i \vDash \lnot \psi \) iff \(\pi , i \nvDash \psi \).

  • \(\pi , i \vDash \psi \wedge \phi \) iff \(\pi , i \vDash \psi \) and \(\pi , i \vDash \phi \).

  • \(\pi , i \vDash \psi \vee \phi \) iff \(\pi , i \vDash \psi \) or \(\pi , i \vDash \phi \).

  • \(\pi , i \vDash {\mathcal {X}} \psi \) iff \(\pi , i+1 \vDash \psi \).

  • \(\pi , i \vDash {\mathcal {F}} \psi \) iff \(\exists j \ge i\), such that \(\pi , j \vDash \psi \).

  • \(\pi , i \vDash {\mathcal {G}} \psi \) iff \(\forall j \ge i\), such that \(\pi , j \vDash \psi \).

  • \(\pi , i \vDash \psi ~{\mathcal {U}}~\phi \) iff \(\exists j \ge i\), such that \(\pi , j \vDash \phi \) and \(\forall k, i \le k < j\), we have \(\pi , k \vDash \psi \).

  • \(\pi , i \vDash \psi ~{\mathcal {R}}~\phi \) iff \(\forall j \ge i\), iff \(\pi , j \nvDash \phi \), then \(\exists k, i \le k < j\), such that \(\pi , k \vDash \psi \).

For the definition of the semantics of LTL over finite traces, we refer the interested reader to the work of De Giacomo and Vardi [23] and De Giacomo et al. [25].

In model checking, LTL formulas commonly have two possible truth value states, namely true and false. In case of monitoring a compliance specification in a running system, it might be the case, that it is not only of interest if it is satisfied or violated but also whether further state changes are possible that could resolve or cause a violation of it. That is, the runtime state of a specification is either temporary or permanent. Consequently, an LTL specification at runtime is either temporarily satisfied, temporarily violated, permanently satisfied or permanently violated (cf. Bauer et al. [6, 7]). Several existing studies make use of the concept of four LTL truth value states (cf. Pešić et al. [62], De Giacomo et al. [24] and Maggi et al. [54]).

2.2 Event Processing Language (EPL)

In this section, the event processing language (EPL) [30] is discussed and how it can be applied for runtime monitoring of compliance specifications. An EPL-based specification consists of an initial truth value, which is either assigned to temporarily satisfied or temporarilyviolated, and one or more query–listener pairs. A query–listener pair causes a truth value change in the specification as soon as a matching event pattern is observed in the event stream. Consequently, an EPL-based compliance specification always consists of EPL queries that are composed of EPL operators and listeners that cause truth value changes to temporarily satisfied, temporarily violated, permanently satisfied, permanently violated, as already discussed for LTL in Sect. 2.1. The truth value state of the specification is updated by a positive match of the related expression in the event stream. Based on the notation suggested by Czepa et al. [18, 19], the short notation \(\texttt {<EPL query> ==> <truth} \mathtt{value{>}}\) is used for an EPL query–listener pair responsible for changing the truth value of a compliance rule. Obviously, further truth value changes are not possible once a permanent state, namely either permanently violated or permanently satisfied, has been reached. According to the EPL reference [30], the semantics is given as shown in Table 3.

Table 3 Semantics of EPL operators

2.3 Property specification patterns (PSP)

Dwyer et al. proposed the property specification patterns (PSP) [27, 28], a collection of recurring specification patterns. For each pattern, there exist transformation rules to underlying formal representations , including LTL and CTLFootnote 6. The patterns are categorized into Occurrence Patterns and Order Patterns as shown in Tables 4 and 5, respectively. Figure 1 shows the area of effect of available scopes, whereas Table 6 discusses their meaning.

Table 4 Intents of occurrence patterns
Table 5 Intents of order patterns
Fig. 1
figure 1

Available scopes for property specification patterns (shaded areas indicate the extent over which the pattern must hold)

The available runtime states of PSP specifications are no different from those of LTL and EPL specifications (cf. Sects. 2.1 and 2.2), namely temporarily satisfied, temporarily violated, permanentlysatisfied and permanently violated.

3 Experiment planning

This section describes the outcome of the experiment planning phase, and it provides all information that is required for a replication of the study.

3.1 Goals

The primary goal of the experiment is measuring the construct understandability of representative languages that are suitable for modeling compliance specifications. This construct is defined by the syntactic correctness, semantic correctness and response time of the answers given by the participants.

This study differentiates between syntactic and semantic correctness as it enables a more fine-grained analysis. This is in line with Chomsky [11], who stressed that the study of syntax must be independent from the study of semantics. Numerous existing studies differentiate between syntactic and semantic correctness (cf. Ferri et al. [32], Hindawi et al. [39] and Harel and Rumpe [37]). On the other hand, an LTL formula can be syntactically totally correct without catching the desired meaning. For example, the specification “activity 2 must not happen unless activity 1 has already happened” is not covered at all in a semantic way by the syntactically correct formula “\({\mathcal {F}}~activity_1~\wedge ~{\mathcal {F}}~activity_2\).” In contrast, the formula “\(\lnot ~activity_2~{\mathcal {U}}~activity_1\)” is both syntactically and semantically correct.

Table 6 Meaning of scopes

In addition to the understandability construct, the experiment aims at studying the perceived ease of application of the languages and the perceived correctness of the formalized compliance specifications.

3.2 Experimental units

All 215 participants of the experiment are students who enrolled in the courses “Software Engineering Lab (SE2)” and “Advanced Software Engineering Lab (ASE)” at the Faculty of Computer Science, University of Vienna, Austria. Two kinds of participants can be differentiated:

  • 149 participants of the bachelor-level course SE2 are used as proxies for novice software engineers, designers or developers.

  • 66 participants of the master-level course ASE are used as proxies for moderately advanced software engineers, designers or developers.

Using students as proxies for non-expert users is not an issue according to Kitchenham et al. [50]. Other studies even suggest that students can be used as proxies for experts under certain circumstances (cf. Höst et al. [43], Runeson [72], Svahnberg et al. [78] and Salman et al. [73]). As an incentive for participation and proper preparation, up to 10 bonus points (\(10 \%\) of total course points) were awarded based on the participant’s performance in the experiment. All participants were randomly allocated to experiment groups.

3.3 Experimental material and tasks

In total, the experiment comprised five distinct tasks stemming from three different domains, as shown in Table 7. Tasks 1 and 2 are related to compliance in the context of lending, Task 3 focuses on compliance regarding hospital processes, and Tasks 4 and 5 are based on compliance specifications in the construction industry. Each task was presented to the participants by stating first the context, then the specification and last the available elements that are to be used during formal modeling of the specification. For an example, how experimental tasks were presented to the participants, see Fig.2. The full experimental material is available online (cf. Czepa et al. [22]). For sample solutions of all experimental tasks, see “Appendix A.” It is important to note that these sample solutions show just one way to model the compliance specifications. In the grading process, each proposed solution was carefully assessed under constant consideration that the sample solution might not be the only way to correctly formalize the specification.

Table 7 Experimental tasks
Fig. 2
figure 2

Sample task as presented to the participants

3.4 Hypotheses, parameters and variables

PSP abstracts underlying formal representations, such as LTL formulas, by high-level patterns with the intention to facilitate reuse and to enable ease of use. That is, the pattern representations are assumed to provide a better understandability than their underlying LTL formulas. EPL-based constraints are composed of an initial truth value and one or more query–listener pairs that change the truth value state. In contrast to LTL where meaning is encoded in a formula, different concerns, namely defining the initial truth value and change criteria for the truth value, are separated from each other in EPL-based constraints. This separation of concerns is assumed to facilitate the understandability of EPL-based constraints as opposed to LTL formulas where this separation is not present.

Consequently, we hypothesized that PSP, as a highly abstract pattern language, is easier to understand than LTL and EPL and that EPL, due to separation of concerns, is easier to understand than LTL. Consequently, the following hypotheses for the controlled experiment were formulated:

  • \(H_{0,1}\): There is no difference in terms of understandability between PSP and LTL.

  • \(H_{A,1}\): PSP has a higher level of understandability than LTL.

  • \(H_{0,2}\): There is no difference in terms of understandability between PSP and EPL.

  • \(H_{A,2}\): PSP has a higher level of understandability than EPL.

  • \(H_{0,3}\): There is no difference in terms of understandability between EPL and LTL.

  • \(H_{A,3}\): EPL has a higher level of understandability than LTL.

The construct understandability is measured by three interval-scaled dependent variables, namely:

  • the syntactic correctness achieved in trying to formally model the compliance specifications,

  • the semantic correctness achieved in trying to formally model the compliance specifications,

  • the response time, which is the time it took to complete the experimental tasks.

In addition, there are hypotheses that are concerned with the participants’ opinion on the languages under investigation, namely:

  • \(H_{0,4}\): There is no difference in terms of perceived correctness between PSP and LTL.

  • \(H_{A,4}\): PSP has a higher level of perceived correctness than LTL.

  • \(H_{0,5}\): There is no difference in terms of perceived correctness between PSP and EPL.

  • \(H_{A,5}\): PSP has a higher level of perceived correctness than EPL.

  • \(H_{0,6}\): There is no difference in terms of perceived correctness between EPL and LTL.

  • \(H_{A,6}\): EPL has a higher level of perceived correctness than LTL.

  • \(H_{0,7}\): There is no difference in terms of perceived ease of application between PSP and LTL.

  • \(H_{A,7}\): PSP has a higher level of perceived ease of application than LTL.

  • \(H_{0,8}\): There is no difference in terms of perceived ease of application between PSP and EPL.

  • \(H_{A,8}\): PSP has a higher level of perceived ease of application than EPL.

  • \(H_{0,9}\): There is no difference in terms of perceived ease of application between EPL and LTL.

  • \(H_{A,9}\): EPL has a higher level of perceived ease of application than LTL.

The dependent variables associated with these hypotheses are ordinal scaled since the data were gathered by agree–disagree scales. In accordance with the results of a study by Revilla et al. [68], each scale had five categories.

3.5 Experiment design and execution

According to Wohlin et al. [86], “it is important to try to use a simple design and try to make the best possible use of the available subjects.” For that reason, a completely randomized experiment design with one alternative per experimental unit was used. That is, each participant is randomly assigned to exactly one experiment group. This assignment took place fully automated in an unbiased manner.

Preparation documents were distributed to the participants one week before the experiment run. In these documents, the basics of the approaches are discussed, and the participants were encouraged to prepare for the experiment by applying the assigned behavioral constraint representation before the experiment session. To avoid bias, all three preparation documents are similar in length and depth. The approaches were presented in an approachable manner to the participants as suggested by numerous existing research on teaching undergraduate students in theoretical computer science, formal methods and logic (cf. Habiballa and Kmeť [34], Knobelsdorf and Frede [51], Carew et al. [10] and Spichkova [77]). The used training material is available online (cf. Czepa et al. [22]).

3.6 Procedure

To ensure a smooth procedure and to avoid unnecessary stress, the preparation document informed the participants about the procedure on the experiment day as detailed as possible. Seating arrangements were made to limit chances for misbehavior, and the participants were instructed how to find a suitable seat. The participants were allowed to use printouts of the preparation material and notes at their own discretion. After a brief discussion of the contents and structure of the experiment document by the experimenters, the participants started trying to solve the experimental tasks. The duration of the experiment was limited to 90 min. Due to organizational reasons, the experiment was done on paper, and time record keeping was the responsibility of each participant (please see Sect. 5.2 for a discussion of this potential threat to validity). After experiment execution, the answers given were evaluated. For that purpose, a method proposed by Lytra et al. [53] was applied, which comprises the independent evaluation of the answers by three experts, and a discussion of large differences in grading until a consensus is achieved. The attempted formalization in each experiment tasks was graded independently by the first, second and third author, who are experts in the investigated languages. To mitigate the risk of grading bias, the participant’s given answers were graded in random order by each of the experts, and, in case of large differences in grading, a discussion took place until a consensus was achieved. Figures 3 and 4 depict the grading process schematically from the individual and overall perspective, respectively. This evaluation of more than a thousand distinct answers comprising approximately 17,000 constraints took about half a year besides the authors’ normal responsibilities such as teaching and other research. All other given answers, which are related to previous knowledge, time records and agree–disagree scale responses, were digitized and double-checked subsequently.

Fig. 3
figure 3

Individual grading procedure

4 Analysis

This section is concerned with the treatment and statistics of the data.

4.1 Data set preparation

To preserve the integrity of the acquired data, it was necessary to drop potentially unreliable items. In total, the data of eight participants were not considered in the statistical evaluations. Table 8 summarizes all dropped participants including the reasons for non-consideration.

4.2 Descriptive statistics

In this section, the acquired data (cf. Czepa et al. [22]) are analyzed by the help of descriptive statistics.

Fig. 4
figure 4

Overall grading procedure

Table 9 shows the number of observations, central tendency and dispersion of the dependent variables syntactic correctness, semantic correctness and response time per group. In the bachelor-level course Software Engineering 2, the sample size is relatively large and evenly distributed (9 : 47 : 49). In the master-level course Advanced Software Engineering, there are less than half as many observations. Unfortunately, the number of participants of the group with the smallest number of observations, namely PSP, was further diminished by the exclusion of three participants (cf. Sect. 4.1). In consequence, the distribution in the ASE course is 21 : 17 : 24. The median and mean correctness values of the LTL groups in both SE2 and ASE are smaller than those of the other two groups. In SE2, the mean syntactic correctness of the LTL group is 56.52, thus about \(5 \%\) less than in the EPL group (\(61.82 \%\)) and about \(12 \%\) less than in the PSP group (\(68.64 \%\)), and the mean semantic correctness of the LTL group is at \(28.49 \%\), so about \(10 \%\) below the EPL group (\(38.20 \%\)) and \(22 \%\) below the PSP group (\(50.19 \%\)). In ASE, the mean syntactic correctness of the LTL group is \(57.01 \%\), thus about \(8 \%\) less than in the PSP group (\(65.13 \%\)) and about \(15 \%\) less than in the EPL group (\(71.91 \%\)). While the PSP group overall achieved a higher syntactic and semantic correctness than the LTL group in SE2, this ranking is reversed in the ASE course where EPL participants overall achieved a higher syntactic and semantic correctness than their colleagues of the PSP group. The mean syntactic correctness achieved by the PSP group (\(65.13 \%\)) is about \(7 \%\) higher than in the EPL group (\(71.91 \%\)) in SE2, whereas the EPL group achieved an about \(7 \%\) higher mean syntactic correctness (\(71.91 \%\)) than the PSP group (\(65.13 \%\)) in ASE. In SE2, the mean semantic correctness of the PSP group (\(50.19 \%\)) is about \(12 \%\) higher than in the EPL group (\(38.20 \%\)). In ASE, the mean semantic correctness is about \(3 \%\) higher in the EPL group (\(49.71 \%\)) than in the PSP group (\(46.93 \%\)). The mean and median response times are overall faster in the SE2 course than in the ASE course. In SE2, the mean response time of the LTL group (43.49 min) is slightly faster than in EPL (44.87 min) and a few minutes faster than in the PSP group (48.68 min). In ASE, the mean response time of the LTL group (52.32 min) is 3–4 min faster than in the PSP group (55.99 min) and 6–7 min faster than in the EPL group (58.82 min).

Skew is a measure of the shape of a distribution. A positive skew value indicates a right-tailed distribution (e.g., more cases of low correctness than high correctness), a negative skew value indicates a left-tailed distribution (e.g., more cases of high correctness than low correctness), and a skew value close to zero indicates a symmetric distribution. Differences in skew are, for example, present

  • between the semantic correctness distributions of LTL (0.75 indicating that the mass of the distribution is concentrated at lower levels of correctness) and PSP (\(-\,0.08\) indicating a rather symmetric distribution) in SE2,

  • between the syntactic correctness distributions of LTL (\(-0.15\) indicating a curve that is slightly leaned to the right) and EPL (\(-0.9\) indicating a distribution with only few measurements in lower correctness ranges) in ASE,

  • between the semantic correctness distributions of LTL (0.6 indicating higher densities in lower correctness ranges) and EPL (\(-0.37\) indicating higher densities in higher correctness ranges) in ASE, and

  • between the response time distributions of LTL (0.42 indicating a left-leaning curve) and PSP (\(-0.61\) indicating a right-leaning curve) in ASE.

Table 8 Summary of dropped participants
Table 9 Number of observations, central tendency and dispersion of the dependent variables semantic/syntactic correctness and response time per group and course

Kurtosis is another measure for the shape of a distribution which focuses on the general tailedness. Positive kurtosis values indicate skinny tails with a steep distribution, whereas negative kurtosis values indicate fat tails. The most severe difference in kurtosis is present between the syntactic correctness distributions of the LTL group (1.22) and PSP group (\(-1.02\)).

So far, the dependent variables were analyzed on the basis of separating between course groups, which reflects the participants academic level of progression. Next, the dependent variables are investigated focusing on participants with industrial working experience. Table 10 summarizes the descriptive statistics of the dependent variables when focusing on participants with industrial working experience of one year and above. Based on the demographic data collected (cf. “Appendix D”), we consider this subset of participants to be close to the population of industrial practitioners with basic to modest experience. The mean syntactic correctness in the LTL group (\(58.65 \%\)) is about \(8 \%\) lower than in the PSP (\(66.79 \%\)) and EPL (\(66.01 \%\)) groups. The PSP group achieved the highest degree of semantic correctness (\(48.58 \%\)), closely followed by the EPL group (\(44.46 \%\)). The LTL group achieved \(30.51 \%\) semantic correctness, which is noticeable lower than in the two other groups. Present differences in skew and kurtosis are indications of differences in central location and distribution shape.

Table 10 Number of observations, central tendency and dispersion of the dependent variables semantic/syntactic correctness and response time per group of participants with working experience \(\ge 1\) year
Fig. 5
figure 5

Participants’ perceived correctness

For additional descriptive statistics of the dependent variables syntactic correctness, semantic correctness and response time, we refer the interested reader to “Appendix B.”

With regard to the stacked bar chart (cf. Bryer and Speerschneider [44]) in Fig.5a showing the perceived correctness in SE2, the share of strongly agree responses to the statement “I think that my transformation of the requirement to the constraint language is correct” is \(2 \%\) higher in PSP (\(37 \%\)) than in the other two groups, and the share of (strongly) disagree answers is \(22 \%\) in PSP while it is higher in LTL (\(37 \%\)) and EPL (\(33 \%\)). With \(41 \%\) the share of neutral answers is largest in PSP. In ASE (cf. Fig. 5b), the participants appear to be overall slightly more confident regarding the correctness of their formalizations. The largest share of (strongly) agree responses is again present in the PSP group (\(51 \%\)), followed by LTL (\(46 \%\)) and EPL (\(44 \%\)). According to the stacked bar charts in Fig. 5, the perceived correctness of PSP appears to be slightly higher than in the other experiment groups in SE2, while EPL has a slightly lower perceived correctness than the other languages in ASE. According to Fig. 5c, a large share (\(44 \%\)) of participants with industry experience in the PSP is undecided whether the given answer is correct. The percentage of neutral answers of participants with industry experience is lowest in the EPL group (\(30 \%\)) and only slightly higher in the LTL group. The largest share of (strongly) agree responses of participants with industry experience is present in the EPL group (\(42 \%\)), followed by LTL (\(38 \%\)) and PSP (\(34 \%\)).

Fig. 6
figure 6

Participants’ perceived ease of application

Figure 6 contains stacked bar charts of the participants’ perceived ease of application of the tested languages. Interestingly, there appears to be a strong similarity between the perceived correctness and perceived ease of application responses in SE2 regarding the ranking of the approaches (cf. Figs. 6a, 5a). PSP with \(25 \%\) (strongly) agreeing and \(42 \%\) (strongly) disagreeing appears to be slightly easier to apply than EPL with \(21 \%\) (strongly) agreeing and \(48 \%\) (strongly) disagreeing, and LTL with \(17 \%\) (strongly) agreeing and \(48 \%\) (strongly) disagreeing is perceived slightly more difficult to apply than EPL. In ASE (cf. Fig. 6b), the application of PSP is perceived to be even easier than in SE2. Interestingly, EPL is perceived to be similarly easy as PSP with regard to application. Like in SE2, LTL is ranked last in perceived ease of application. Figure 6c focuses on industry participants and reveals striking differences between the groups. The perceived ease of application is highest rated in the EPL group with \(33 \%\) (strongly) agreeing and \(38 \%\) (strongly) disagreeing, which means that there is still a shift toward a negative rating. The strongest shift toward low ease of application is present in the LTL group with only \(7 \%\) (strongly) agreeing and \(52 \%\) (strongly) disagreeing. In between are the results of the PSP group with \(22 \%\) (strongly) agreeing and \(49 \%\) (strongly) disagreeing.

4.3 Statistical inference

Before applying any statistical test, its model assumption must be tested and met. For a discussion whether or not the normality assumption is violated by the acquired data, see “Appendix C.” Since there is uncertainty regarding normality, a core assumption of parametric testing, nonparametric testing is the preferable approach.

Table 11 Cliff’s d of syntactic/semantic correctness and response time in SE2, one-tailed with confidence intervals calculated for \(\alpha = 0.05\) (cf. Cliff [14] and Rogmann [70]), adjusted p-values (cf. Benjamini and Hochberg [8]) [level of significance: * for \(\alpha = 0.05\), ** for \(\alpha = 0.01\), *** for \(\alpha = 0.001\)] and effect size magnitudes (cf. Kitchenham et al. [49])

Standard nonparametric tests like Kruskal–Wallis cannot be applied if distribution shapes differ apart from their central location (cf. descriptive statistics in “Appendix B”), so Cliff’s delta (cf. Cliff [14] and Rogmann [70]), a robust nonparametric test, is applied. Table 11 summarizes the test results for the bachelor-level course SE2. To take multiple testing into account, the p-values are adjusted based on the method proposed by Benjamini and Hochberg [8]. There is a highly significant result with a medium effect size magnitude, indicating that PSP provides a higher syntactic correctness than LTL. After p-value adjustments, no such result is present in the remaining syntactic correctness tests. All semantic correctness test results are highly significant with medium- to large-sized effects. There is no significant difference between the response times. Consequently, \(H_{0,1}\) is rejected on the basis of syntactic and semantic correctness whereas \(H_{0,2}\) and \(H_{0,3}\) can only be rejected based on semantic correctness.

Table 12 Cliff’s d of syntactic/semantic correctness and response time in ASE, one-tailed with confidence intervals calculated for \(\alpha = 0.05\) (cf. Cliff [14] and Rogmann [70]), adjusted p-values (cf. Benjamini and Hochberg [8]) [level of significance: * for \(\alpha = 0.05\), ** for \(\alpha = 0.01\), *** for \(\alpha = 0.001\)] and effect size magnitudes (cf. Kitchenham et al. [49])

In the master-level course ASE (cf. Table 12), there is a large-sized difference in syntactic correctness between EPL and LTL. Regarding semantic correctness, there are large-sized effects between PSP/LTL and EPL/LTL, indicating that the former outperforms the latter mentioned approach. As in SE2, there are no significant differences regarding the response times. Consequently, \(H_{0,1}\) can only be rejected on the basis of semantic correctness, whereas \(H_{0,3}\) is rejected based on both types of correctness.

Table 13 contains the test results for participants with industry experience. There is no significant difference in terms of syntactic correctness and response time. Similarly to ASE, there is no significant difference in semantic correctness between PSP and EPL, while there are significant differences with large-sized effects when comparing PSP against LTL and EPL against LTL.

Table 13 Cliff’s d of syntactic/semantic correctness and response time for participants with industry experience \(\ge 1\) year, one-tailed with confidence intervals calculated for \(\alpha = 0.05\) (cf. Cliff [14] and Rogmann [70]), adjusted p-values (cf. Benjamini and Hochberg [8]) [level of significance: * for \(\alpha = 0.05\), ** for \(\alpha = 0.01\), *** for \(\alpha = 0.001\)] and effect size magnitudes (cf. Kitchenham et al. [49])
Table 14 Cliff’s d of perceived correctness and ease of application in SE2 and ASE, one-tailed with confidence intervals calculated for \(\alpha = 0.05\) (cf. Cliff [14] and Rogmann [70]), adjusted p-values (cf. Benjamini and Hochberg [8]) [level of significance: * for \(\alpha = 0.05\), ** for \(\alpha = 0.01\), *** for \(\alpha = 0.001\)] and effect size magnitudes (cf. Kitchenham et al. [49])

Tables 14 and 15 summarize the test results regarding perceived correctness and perceived ease of application. Almost all test results are not significant with two exceptions: (1) A significant test result (\(p = 0.0316\)) with a medium-sized effect is present in SE2 between PSP and LTL with regard to perceived correctness. Consequently, \(H_{0,4}\) can be rejected in SE2. That is, PSP participants are significantly more confident that the formalization is correct than LTL participants at the bachelor level while such an effect is not measurable at the master level or within the sample of industry participants. (2) Participants with industry experience rate the ease of application of EPL significantly higher than of LTL (\(p = 0.0023\)). Consequently, \(H_{0,9}\) can be rejected for participants with industry experience.

Table 15 Cliff’s d of perceived correctness and ease of application for participants with industry experience, one-tailed with confidence intervals calculated for \(\alpha = 0.05\) (cf. Cliff [14] and Rogmann [70]), adjusted p-values (cf. Benjamini and Hochberg [8]) [Level of significance: * for \(\alpha = 0.05\), ** for \(\alpha = 0.01\), *** for \(\alpha = 0.001\)], and effect size magnitudes (cf. Kitchenham et al. [49])

The statistics software RFootnote 7 was used for all statistical analyses. In particular, the following libraries were used in the course of the performed statistical evaluations: biotools [75], car [33], ggplot2 [85], mvnormtest [76], mvoutlier [63], orddom [70], psych [67] and usdm [58].

5 Discussion

This sections discusses the results and threats to validity of the study.

5.1 Evaluation of results and implications

The experimental goal was stated as AnalyzeLTL, PSP and EPLfor the purpose oftheir evaluationwith respect totheir understandability related to modeling compliance specificationsfrom the viewpoint ofthe novice and moderately advanced software engineer, designer or developerin the context/environment ofthe Software Engineering 2 Lab and the Advanced Software Engineering Lab courses at the Faculty of Computer Science of the University of Vienna. Due to the large number of participants with industry experience, it became possible to consider a third population, namely participants with industry experience, who function as proxies for industrial practitioners with basic to modest industry experience. Based upon the stated goal, questions concerning understandability were generated. The understandability construct focuses on the degree of syntactic and semantic correctness achieved and on the time spent on modeling compliance specifications. The results per question are summarized in Table 16. By differentiating between syntactic and semantic correctness, it became possible to reveal that differences in understandability in formal modeling of compliance specifications predominately lie in semantic correctness. Almost all test results regarding semantic correctness are highly significant with large-sized effects. Interestingly, no significant difference in semantic correctness is present between the pattern-based PSP approach and the CEP-based EPL language in the master-level course ASE and in the subset of participants with industry experience. That might imply that more experienced users are able to cope equally well with both approaches. Aside from that, the results suggest that the pattern-based PSP approach is more understandable than EPL and LTL and that EPL provides a higher level of understandability than LTL. In terms of syntactic correctness, PSP seems to be more understandable than LTL for less experience users, while EPL seems to be more understandable than LTL for more experienced users. This study did not reveal any significant differences in response time. Regarding perceived correctness and perceived ease of application, there are two significant test results, which imply that transformations to PSP are perceived to be more correct than LTL transformations by less experienced users, and more experienced users with industry experience find that EPL is easier to apply than LTL.

Overall, the results imply that the pattern-based PSP approach has advantages with regard to understandability. Therefore, the pattern-based approach seems to be particularly well suited for modeling compliance specifications. Moreover, the results indicate that EPL is more understandable than LTL. This could be important in cases where the set of available PSP patterns is not sufficient to model a compliance specification. In such cases, the compliance specification could be encoded in EPL for runtime verification or an extension of the pattern catalog could take place. In this regard, EPL specifications could be used to aid the creation of new patterns with underlying LTL formalizations by checking the plausibility of the LTL formula (cf. Czepa et al. [18, 19]).

Moreover, the results are overall in line with two controlled experiments on the understandability of already existing formal specifications in LTL, EPL and PSP carried out by Czepa and Zdun [17]. The results of these controlled experiments with 216 participants in total suggested that existing specifications in PSP are significantly easier to understand than existing specifications in EPL and LTL. Moreover, the results implied that existing specifications in EPL are significantly easier to understand than existing specifications in LTL. The correctness of understanding was evaluated by letting the participant decide whether a truth value is the correct truth value of a specification, given a specific trace. In contrast to the current study, which focuses on the formal modeling of compliance specifications, no major differences between novice and moderately advanced users were found in understandability of existing specifications. Interestingly, the response times between the experimental groups were significantly different in most cases, an effect which appears to be absent during modeling (cf. Sect. 4.3).

Table 16 GQM summary

5.2 Threats to validity

In the following, all known threats that might have an impact on the validity of the results of this study are discussed.

5.2.1 Threats to internal validity

Threats to internal validity are unobserved variables that might have an undesired impact on the result of the experiment by disturbing the causal relationship of independent and dependent variables. There exist several threats to internal validity, which must be discussed:

  • History effects refer to events that happen in the environment resulting in changes in the conditions of a study. The short duration of the study limits the possibility of changes in environmental conditions, and none were observed. The occurrence of such effects prior to the study cannot be entirely ruled out. However, in such a case, it would be extremely unlikely that the scores of one experiment group are more affected than another, because of the random allocation of participants to groups.

  • Maturation effects refer to the impact the passage of time has on an individual. Like history effects, maturation effects are rather problematic in long-term studies. Since the duration of the experiment was short, maturation effects are considered to be of minor importance, and none were observed.

  • Testing effects comprise learning effects and experimental fatigue. Learning effects were avoided by testing each person only once. Experimental fatigue is concerned with happenings during the experiment that exhaust the participant either physically or mentally. The short time frame of the experiment session limits chances of fatigue. Neither were any signs of fatigue observed nor were there any reports by participants indicating fatigue.

  • Instrumental bias occurs if the measuring instrument (i.e., a physical measuring device or the actions/assessment of the researcher) changes over time during the experiment. Since the answers given in the experiment tasks were evaluated manually, this is a serious threat to validity. It might be the case that the experience gained in scoring some answers had an influence on subsequent evaluations. This threat was mitigated by evaluating the results in no specific prescribed order, and in case of substantial differences in grading, a discussion took place until consensus was achieved.

  • Selection bias is present if the experimental groups are unequal before the start of the experiment (e.g., severe differences in previous experience). Selection bias is likely to be more threatening in quasi-experimental research. By using an experimental design with the fundamental requirement to randomly assignment participants to the different groups of the experiment, it became possible to avoid selection bias to a large extent. In addition, the investigation of the composition of the groups did not reveal any major differences between them. (cf. “Appendix D”).

  • Experimental mortality more likely occurs in long-lasting experiment since the chances for dropouts increase (e.g., participants leaving the town). Due to the short time frame of this study, experimental mortality did not occur.

  • Diffusion of treatments is present if at least one group is contaminated by the treatments of at least one other group. Since the participants share the same social group, and they are interacting outside the research process as well, a cross-contamination between the groups cannot be entirely rule out.

  • Compensatory rivalry is present if participants of a group put in extra effort when the impression arises that the treatment of another group might lead to better results than their own treatment. This threat was mitigated by clarifying that different degrees of difficulty will be considered and compensated in the calculation of bonus points.

  • Demoralization could occur if a participant is assigned to a specific group that she/he does not want to be part of. No indications of demoralization such as increased dropout rates or complaints regarding group allocation were observed.

  • Experimenter bias refers to undesired effects on the dependent variables that are unintentionally introduced by the researcher. All participants received a similar training and worked on the same set of tasks. A manual evaluation of the given answers regarding their correctness was performed. To mitigate the threat of experimenter bias in that regard, the first, second and third author performed the evaluation of all tasks individually. Differentiating between semantic and syntactic correctness overall simplified the evaluation process by enabling a separation of concerns. A potential threat in that regard could be falsely classifying defects. Therefore, after the completion of all individual evaluations, in case of substantial differences in grading, a discussion took place until consensus was achieved.

5.2.2 Threats to external validity

The external validity of a study focuses on its generalizability. In the following, potential threats that hinder a generalization are discussed. Different types of generalizations must be considered:

  • Generalizations across populations: By statistical inference, generalizations from the sample to the immediate population are made. The initial study design considered two populations, namely computer science students that enrolled in the course SE2 as proxies for novice software engineers, designers or developers, as well as computer science students that enrolled in the course ASE as proxies for moderately advanced software engineers, designers or developers. Due to the large number of participants with industry experience, it became possible to consider a third population, namely participants with industry experience, who function as proxies for industrial practitioners with basic to modest industry experience. The results of this study show interesting discrepancies between these populations. In particular, there are no significant differences in understandability between PSP and EPL for more advanced users while a significant difference is measurable when testing less experienced users. In general, this study does not intent to claim generalizability to other populations without further empirical evidence. For example, it might be plausible that leading experts working in the software industry or as business administrators perform similarly to ASE participants or the subset of participants with industry experience, but this study can neither support nor reject such claims.

  • Generalizations across treatments: The treatments are equivalent to specific tested languages. Treatment variations would likely be related to changing the contents, amount or difficulty of experiment tasks or the amount of training provided. The experiment design attempts to be as general as possible by using compliance specifications stemming from different domains and applying a moderate amount of training.

  • Generalizations across settings/contexts: The participants of this study are students who enrolled computer science courses at the University of Vienna, Austria. The majority of the students are Austrian citizens, but there is a large presence of foreign students as well. Surely, it would be interesting to repeat the experiment in different settings/context to evaluate the generalizability in that regard. For example, repeating the experiment with English native speakers might lead to different and presumably better results.

  • Generalizations across time: It is hard to foresee whether the results of this study will hold over time. For example, if teaching of a specific tested language is intensified in the computer science curricula at the University of Vienna, then the students would bring in more expertise, which likely would have an impact on the results.

5.2.3 Threats to construct validity

There are potential threats to the validity of the construct that must be discussed:

  • Inexact definition and Construct confounding: This study has a primary focus on the construct understandability, which is measured by the dependent variables syntactic correctness, semantic correctness and response time. This construct is exact and adequate, and the dependent variables syntactic correctness and semantic correctness make even a more fine-grained analysis possible than in existing studies that measure correctness by a single variable (cf. Feigenspan et al. [31] and Hoisl et al. [40]).

  • Mono-method bias: Due to organizational reasons, keeping time records was the personal responsibility of each participant. The participants were carefully instructed how to record start and end times, and we did not detect any irregularities (e.g., overlapping time frames or long pauses) in those records. Nonetheless, this measuring method leaves room for measuring errors, and an additional or alternative measuring method (e.g., direct observation by experimenters or performing the experiment with an online tool that handles record keeping) would reduce this threat. However, these methods would have influenced the overall study design and potentially could have introduced other threats to validity (e.g., prolonged experiment execution potentially leading to an exposure of the experiment task contents or technical problems during experiment execution). To avoid mono-method bias in evaluating the syntactic and semantic correctness, the grading was not performed by a single but by three experimenters individually.

  • Reducing levels of measurements: Both correctness variables and the response time are continuous variables. That is, the levels of measurements are not reduced. The Likert scales used in this study offer 5 answer categories rather than 7 or 11, because the latter mentioned would produce data of lower quality according to Revilla et al. [68].

  • Treatment-sensitive factorial structure: In some empirical studies, a treatment might sensitize participants to develop a different view on a construct. The actual level of understandability based on the task solutions provided was measured, so the participants’ view on this construct appears to be irrelevant.

5.2.4 Threats to content validity

Content validity is concerned with the relevance and representativeness of the elements of a study for the measured construct:

  • Relevance: The tasks of this study are based on realistic scenarios stemming from three different domains in which compliance is highly relevant (cf. Elgammal et al. [29], Rovani et al. [71], and United States Environmental Protection Agency [83]).

  • Representativeness: In the formal modeling of the compliance specifications, the use of all core temporal LTL operators and EPL operators was required, which means that the construct understandability was measured comprehensively. The use of each PSP pattern was required two or more times (cf. sample solutions of experimental tasks in “Appendix A”). Unfortunately, it was not possible to test all available pattern–scope combinations. However, the majority of specifications are based on the global scope (cf. Dwyer et al. [27, 28]), which is as well reflected in the realistic specifications used in the tasks of this experiment (cf. experimental tasks in Table 7 and sample solutions in “Appendix A”). That is, a representative subset of PSP was tested.

5.2.5 Threats to conclusion validity

Thorough statistical investigations of model assumptions were performed before applying the most suitable statistical test with the greatest statistical power, given the properties of the acquired data. That course of action is considered to be highly beneficial to the conclusion validity of this study. The decision to retain outliers might be a threat to conclusion validity, but all outliers appear to be valid measurements, so deleting them would pose a threat to conclusion validity as well.

6 Related work

We are not aware of any empirical studies evaluating the understandability related to the formal modeling of compliance specifications in particular. There exists, however, related work focusing on similar issues.

Related studies in the field of business process management are concerned with declarative workflows (cf. van der Aalst [1]), which use graphical patterns with underlying formal representations in LTL (cf. Montali [56]) or event calculus (cf. Montali et al. [57]). Haisjackl and Zugal [35] investigated differences between textual and graphical declarative workflows in an empirical study with 9 participants. The descriptive statistics of this study indicates that the graphical representation is advantageous in terms of perceived understandability, error rate, duration and mental effort. The lack of hypothesis testing and the small number of participants are severe threats to the validity of this study. Zugal et al. [87] investigated the understandability of hierarchies on basis of the same data set. The results of their research indicate that hierarchies must be handled with care. While information hiding and improved pattern recognition are considered to be positive aspects of hierarchies since the mental effort for understanding a process model is lowered, the fragmentation of processes by hierarchies might lower overall understandability of the process model. Another important finding of their study is that users appear to approach declarative process models in a sequential manner even if the user is definitely not biased by previous experiences with sequential/imperative business process models. They conclude that the abstract nature of declarative process models does not seem to fit the human way of thinking. Moreover, they observed that the participants of their study tried to reduce the number of constraints to consider by putting away sheets that describe irrelevant sub-process or by using the hand to hide parts of the process model that are irrelevant. Like in the previously discussed study, it must be assumed that the validity of this study is strongly limited by the extremely small sample size. Haisjackl et al. [36] investigate the users’ understanding of declarative business process models, again on the same data set. As in the previously mentioned study, they point out that users tend to read such models sequentially despite the declarative nature of the approach. The larger a model, the often are hidden dependencies overlooked, which indicates increasing numbers of constraints lower understanding. Moreover, they report that single constraints are overall well understood, but there seem to be problems with understanding the precedence constraint. As the authors point out, this kind of confusion could be related to the graphical arrow-based representation of the constraints where subtle differences decide on the actual meaning. That is, the arrow could be confused with a sequence flow as present in flow-driven, sequential business processes. As previously stated for the other two studies that are based on the same data set, the validity of this study is possibly strongly affected by the small sample size. De Smedt et al. [26] tried to improve the understandability of declarative business process models by explicitly revealing hidden dependencies. They conduced an experiment with 95 students. The result suggests that explicitly showing hidden dependencies enables a better understandability of declarative business process models. Pichler et al. [64] compared the understandability of imperative and declarative business process modeling notations. The results of this study are in line with Zugal et al. [87] and suggest that imperative process models are significantly better understandable than declarative models, but the authors also state that the participants had more previous experience with imperative process modeling than with declarative process modeling. The small sample size (28 participants) is a threat to validity of this study. Rodrigues et al. [69] compared the understandability of textual and graphical BPMN [59] business process descriptions with 32 students and 41 practitioners. They conclude that experienced users understand a process better if it is presented by a graphical BPMN process model whereas for inexperienced users there is no difference in understandability between the textual and graphical process descriptions. Jost et al. [46] compared the intuitive understanding of process diagrams with 103 students. They conclude that UML activity diagrams provide a higher level of understandability than BPMN diagrams and EPCs.

Software architecture compliance, which focuses on the alignment of software architecture and implementation, and requirements engineering are also related to this study. Czepa et al. [21] compared the understandability of three languages for behavioral software architecture compliance checking, namely the natural language constraint (NLC) language, the cause–effect constraint (CEC) language and the temporal logic pattern-based constraint (TLC) language, in a controlled experiment with 190 participants. The NLC language is simply referring to using the English language for documenting software architectures. CEC is a high-level structured architectural description language that abstracts EPL. It supports the nesting of cause parts, which observe an event stream for a specific event pattern, and effect parts, which can contain further cause–effect structures and truth value change commands. TLC is a high-level structured architectural description language based on PSP. Interestingly, the statistical inference of this study suggests that there is no difference in understandability of the tested languages. This could indicate that the high-level abstractions employed bring those structured languages closer to the understandability of unstructured natural language architecture descriptions. Moreover, it might also suggest that natural language leaves more room for ambiguity, which is detrimental for its understanding. Potential limitations of that study are that its tasks are based on common architectural patterns/styles (i.e., a participant possibly recognizes the meaning of a constraint more easily by having knowledge of the related architectural pattern) and the rather small set of involved patterns (i.e., only very few patterns of PSP were necessary to represent the architecture descriptions). A controlled experiment carried out by Heijstek et al. [38] with 47 participants focused on finding differences in understanding of textual and graphical software architecture descriptions. Interestingly, participants who predominantly used textual architecture descriptions performed significantly better, which suggests that textual architectural descriptions could be superior to their graphical counterparts. An eye-tracking experiment with 28 participants by Sharafi et al. [74] on the understandability of graphical and textual software requirement models did not reveal any statistically significant difference in terms of correctness of the approaches. The study also reports that the response times of participants working with the graphical representations were slower. Interestingly though, the participants preferred the graphical notation. Hoisl et al. [40] conducted a controlled experiment on three notations for scenario-based model tests with 20 participants. In particular, they evaluated the understandability of a semi-structured natural language scenario notation, a diagrammatic scenario notation and a fully structured textual scenario notation. According to the authors, the purely textual semi-structured natural language scenario notation is recommended for scenario-based model tests, because the participants of this group were able to solve the given tasks faster and more correctly. That is, the study might indicate that a textual approach outperforms a graphical one for scenario-based model test, but the validity of the experiment is limited by the small sample size and the absence of statistical hypothesis testing.

7 Conclusion and future work

The main goal of this empirical study was testing and comparing the understandability of representative approaches for the formal modeling of compliance specifications. The experiment was conducted with 215 participants in total. Major differences were found especially in semantic correctness of the approaches. Since formalizations in the property specification patterns (PSP) were overall more correct than in linear temporal logic (LTL) and event processing language (EPL), there is evidence that the pattern-based PSP approach provides a higher level of understandability. More advanced users, however, seemingly are able to cope equally well with PSP and EPL. That is, for more advanced users, these approaches can be used interchangeably as fitting best to a concrete domain or task. Moreover, EPL provides a higher level of understandability than LTL. Therefore, EPL is well suitable in situations that demand runtime verification in which the set of available patterns in PSP is not sufficient to model a compliance specification or to aid the creation of new patterns with underlying LTL formalizations (cf. Czepa et al. [18, 19]).

Moreover, the results are overall in line with two controlled experiments with 216 participants in total on the understandability of already existing formal specifications in LTL, EPL and PSP (cf. Czepa and Zdun [16]). In contrast to the current study, which focuses on the formal modeling of compliance specifications, no major differences between novice and moderately advanced users were found in understandability of existing specifications. Interestingly, the response times between the experimental groups were significantly different in most cases, an effect which appears to be absent during modeling.

Opportunities for further empirical research are the consideration of an extended set of representations including, for example, event calculus (cf. Kowalski and Sergot [52]) or Declare (cf. Pešić and van der Aalst [61]) and studying the understandability construct in different settings with other user groups (e.g., business administrators or professional software engineers). Moreover, besides the understandability construct, additional metrics such as changeability (i.e., “Is one representation easier to change when taking new/amended compliance specifications into account?”) and verifiability (i.e., “Are there differences between the representations when it comes to assessing whether a given compliance specification is fully covered?”) could be investigated.