Modeling compliance specifications in linear temporal logic, event processing language and property specification patterns: a controlled experiment on understandability

Mature verification and monitoring approaches, such as complex event processing and model checking, can be applied for checking compliance specifications at design time and runtime. Little is known about the understandability of the different formal and technical languages associated with these approaches. This uncertainty regarding understandability might be a major obstacle for the broad practical adoption of those techniques. This article reports a controlled experiment with 215 participants on the understandability of modeling compliance specifications in representative modeling languages, namely linear temporal logic (LTL), the complex event processing-based event processing language (EPL) and property specification patterns (PSP). The formalizations in PSP were overall more correct. That is, the pattern-based approach provides a higher level of understandability than EPL and LTL. More advanced users, however, seemingly are able to cope equally well with PSP and EPL in modeling compliance specifications.

Enron and WorldCom). Basel III [4] has been established in response to weaknesses in financial regulation responsible for the financial crisis in 2007/2008. Another example of heavily regulated domains is the construction industry. Compliance rules in this domain are often related to occupational safety and health. For example, certain precautions and safe practices are required if a lead contamination is present or to be presumed in buildings built before 1978 that undergo renovation (cf. United States Environmental Protection Agency's Lead-Based Paint Renovation, Repair and Painting Rule [83]). A third example is the healthcare sector. Processes in hospitals must comply with state-of-the-art medical knowledge and treatment procedures (e.g., Rovani et al. [71]).
From cooperations with industry partners (e.g., Tran et al. [80]), their customers and other company representatives at conferences and workshops, we were able to gain valuable insights into the current situation on how compliance rules are handled in practice. Most often, compliance documents are transformed to internal policies first. They are often described in natural language, but there is also a shift toward structured approaches like the Semantics of Business Vocabulary and Business Rules (SBVR) standard [60]. Later these internal policies become considered in business process models (e.g., BPMN [59]) or other behavioral models (e.g., UML activity diagrams), and/or they become hard-coded in a programming language. That often leads-to consistency problems and to a poor maintainability and traceability between compliance specifications, internal policies, models and the source code. This is especially the case when compliance specifications change frequently. Additionally, practitioners report that it often takes a long time until new compliance specifications are actually supported by their software. Often the compliance rule has long been obsolete before the implementation is ready (cf. [20,48]). Consequently, the industry shows a strong interest in approaches that are applicable in practice. Such approaches should support a comprehensible, fast and accurate adoption of compliance specifications as well as their automated enactment and verification. All modeling languages that we study in this article are well suited for automated computer-aided compliance checking or monitoring. Nonetheless, companies are still often reluctant to expose their customers or employees to such approaches. In discussions with industry partners (cf. [79,81]), uncertainty regarding how understandable these approaches are became evident. This uncertainty was stated as one of the major reasons for the reluctance in practical adoption.

Problem statement
Most existing work on design time verification and runtime monitoring focuses on technical contributions rather than empirical contributions. From the perspective of a potential end user who has to implement compliance specifications, the understandability of an offered formal specification language appears to be a major interest. To the best of our knowledge, there are no empirical studies that investigate and compare the understandability of representative languages with respect to the formal modeling of compliance specifications. In particular, the following representative specification languages are considered in this empirical study: -Linear temporal logic (LTL) was proposed in 1977 by Pnueli [65]. LTL is a popular way for defining compliance rules according to Reichert and Weber [66]. In general, LTL is a widely used specification language commonly applied in model checking (cf. Cimatti et al. [12] for NuSMV 1 , Blom et al. [9] for LTSmin 2 , Holzmann [42] for SPIN 3 ) and runtime monitoring by non-deterministic finite automata (cf. De Giacomo and Vardi [23] and De Giacomo et al. [25]). -Event processing language (EPL) is the query language of the open-source complex event processing engine Esper 4 . EPL is well suited as a representative for CEP query languages as it supports common CEP query language concepts, such as leads-to (sequence, followed-by) and every (each) operators, that are present in many CEP query languages and engines (e.g., Siddhi 5 and TESLA [15]). Several existing studies on compliance monitoring make use of EPL (cf. Awad et al. [2], Holmes et al. [41] and Tran et al. [82]). -Property specification patterns (PSP) are a collection of recurring temporal patterns proposed by Dwyer et al. [27,28]. This pattern-based approach abstracts underlying technical and formal languages, most notably LTL and CTL (Computation Tree Logic; cf. Clarke et al. [13]). Numerous existing approaches are based on PSP. Among them are the Compliance Request Language proposed by Elgammal et al. [29] and the declarative business process approach Declare proposed by Pešić et al. [61].
In previous controlled experiments carried out by Czepa and Zdun [17], the understandability of already existing formal specifications in those language was studied. That experiments can be seen as the first step toward studying the understandability of those languages. To further study the understandability of these languages, it is crucial to consider the modeling itself as well.

Research objectives
This empirical study has the research objective to investigate the understandability construct of representative languages with regard to the modeling of compliance specifications. The understandability construct focuses on the degree of correctness achieved and on the time spent on modeling compliance specifications.
The experimental goal using the goal template of the Goal Question Metric proposed by Basili et al. [5] is stated as follows: Analyze LTL, PSP and EPL for the purpose of their evaluation with respect to their understandability related to modeling compliance specifications from the viewpoint of the novice and moderately advanced software engineer, designer or developer in the context/environment of the Software Engineering 2 Lab and the Advanced Software Engineering Lab courses at the Faculty of Computer Science of the University of Vienna.
Based upon the stated goal, questions concerning understandability were generated as shown in Table 1. Q2 Are there differences in understandability between the tested approaches for participants at the bachelor level (attending the Software Engineering 2 Lab course)?
Q3 How understandable are the tested approaches for participants at the master level (attending the Advanced Software Engineering Lab course)?
Q4 Are there differences in understandability between the tested approaches for participants at the master level (attending the Advanced Software Engineering Lab course)?
Q5 How understandable are the tested approaches for participants with industrial working experience?
Q6 Are there differences in understandability between the tested approaches for participants with industrial working experience?
The understandability is measured by three dependent variables, namely the syntactic correctness and semantic correctness achieved in trying to formally model compliance specifications as well as the response time. Correctness and response time are commonly used to measure the construct understandability, for example, in empirical studies by Feigenspan et al. [31] and Hoisl et al. [40]. The study design enables a more fine-grained analysis of the correctness by differentiating between syntactic and semantic correctness as suggested by numerous existing studies, such as Ferri et al. [32], Hindawi et al. [39] and Harel and Rumpe [37].
Besides the main research goal, which focuses on understandability, this work addresses subjective aspects, namely the perceived ease of application and the perceived correctness, which are measures of self-assessment and not directly related to the understandability construct.

Guidelines
This work follows the guidelines for reporting experiments in empirical software engineering by Jedlitschka et al. [45]. These guidelines integrate among others the "Preliminary guidelines for empirical research in software engineering" by Kitchenham et al. [50] and standard books on empirical software engineering by Wohlin et al. [86] and Juristo and Moreno [47]. The "Robust Statistical Methods for Empirical Software Engineering" article by Kitchenham et al. [49] had a strong impact on the statistical evaluation of the data in this article. Table 2 Informal meanings of LTL operators Text notation Symbol notation Meaning Gψ ψ ψ must be true in every point in time F ψ ♦ψ ψ must be true at some future point in time ψ U φ ψ must remain true at least until the point in time when φ becomes true ψ R φ ψ must remain true at least until and including the point in time when φ becomes true X ψ •ψ ψ must be true at the next point in time

Background
This section provides a brief introduction to the specification languages used in this study. Readers already familiar with one or more of the discussed approaches may consider skipping parts of this section. Examples of compliance specifications formalized in all three representations are available in "Appendix A." These examples are based on the experimental tasks (cf. Sect. 3.3) of this experiment.

Linear Temporal Logic (LTL)
Propositional logic is not expressive enough to describe temporal properties, so a logic called linear temporal logic (LTL) for reasoning over linear traces with the temporal operators G (or ) for "globally" and F (or ♦) for "finally" was proposed by Pnueli [65]. Additional temporal operators are U for "until," W for "weak until," R for "release" and X (or •) for "next." The meaning of these operators is described in Table 2. LTL formulas are composed of the aforementioned temporal operators, atomic propositions (the set AP) and the Boolean operators ∧ (for "and"), ∨ for "or," ¬ for "not," → for "implies" (cf. Baier and Katoen [3]). The weak-until An LTL formula is inductively defined as follows: For every a ∈ AP, a is an LTL formula. If ψ and φ are LTL formulas, then so are Gψ (or ψ), The semantics of LTL over infinite traces is defined as follows: LTL formulas are interpreted as infinite words over the alphabet 2 AP . The alphabet is all possible propositional interpretations of the propositional symbols in AP. π(i) denotes that state of the trace π at time instant i. π, i ψ means that a trace π at time instant i satisfies the LTL formula ψ, and is defined as follows: π, i a, for a ∈ AP iff a ∈ π(i).
For the definition of the semantics of LTL over finite traces, we refer the interested reader to the work of De Giacomo and Vardi [23] and De Giacomo et al. [25].
In model checking, LTL formulas commonly have two possible truth value states, namely true and false. In case of monitoring a compliance specification in a running system, it might be the case, that it is not only of interest if it is satisfied or violated but also whether further state changes are possible that could resolve or cause a violation of it. That is, the runtime state of a specification is either temporary or permanent. Consequently, an LTL specification at runtime is either temporarily satisfied, temporarily violated, permanently satisfied or permanently violated (cf. Bauer et al. [6,7]). Several existing studies make use of the concept of four LTL truth value states (cf. Pešić et al. [62], De Giacomo et al. [24] and Maggi et al. [54]).

Event Processing Language (EPL)
In this section, the event processing language (EPL) [30] is discussed and how it can be applied for runtime monitoring of compliance specifications. An EPL-based specification consists of an initial truth value, which is either assigned to temporarily satisfied or temporarily violated, and one or more query-listener pairs. A querylistener pair causes a truth value change in the specification as soon as a matching event pattern is observed in the event stream. Consequently, an EPL-based compliance specification always consists of EPL queries that are composed of EPL operators and listeners that cause truth value changes to temporarily satisfied, temporarily violated, permanently satisfied, permanently violated, as already discussed for LTL in Sect. 2.1. The truth value state of the specification is updated by a positive match of the related expression in the event stream. Based on the notation suggested by Czepa et al. [18,19], the short notation <EPL query> ==> <truth value> is used for an EPL query-listener pair responsible for changing the truth value of a compliance rule. Obviously, further truth value changes are not possible once a perma- The first e 1 must be observed and only then is e 2 matched. Intuitively, the whole expression is matched once e 1 is followed by e 2 at the occurrence of e 2 until e 1 until e 2 Matches the expression e 1 until e 2 occurs. In practice, this operator is commonly used in the expression not e 1 until e 2 that demands the absence of e 1 before the occurrence of e 2 nent state, namely either permanently violated or permanently satisfied, has been reached. According to the EPL reference [30], the semantics is given as shown in Table 3.

Property specification patterns (PSP)
Dwyer et al. proposed the property specification patterns (PSP) [27,28], a collection of recurring specification patterns. For each pattern, there exist transformation rules to underlying formal representations , including LTL and CTL 6 . The patterns are categorized into Occurrence Patterns and Order Patterns as shown in Tables 4 and 5, respectively. Figure 1 shows the area of effect of available scopes, whereas Table 6 discusses their meaning. The available runtime states of PSP specifications are no different from those of LTL and EPL specifications (cf. Sects. 2.1 and 2.2), namely temporarily satisfied, temporarily violated, permanently satisfied and permanently violated.

Experiment planning
This section describes the outcome of the experiment planning phase, and it provides all information that is required for a replication of the study. To describe a portion of a system's execution that contains an instance of certain events or states Bounded existence a occurs at most n times To describe a portion of a system's execution that contains at most a specified number of instances of a designated state transition or event To describe a relationship between an event/state a and a sequence of events/states (b, c) in which the occurrence of b followed by c within the scope must be preceded by an occurrence of a within the same scope 2 Stimulus-1 Response Chain (a, b) leads-to c To describe a relationship between a stimulus sequence (a, b) and a response event c in which the occurrence of the stimulus events must be followed by an occurrence of the response event within the scope 1 Stimulus-2 Response Chain a leads-to (b, c) To describe a relationship between a stimulus event a and a sequence of two response events (b, c) in which the occurrence of the stimulus event must be followed by an occurrence of the sequence of response events within the scope

Goals
The primary goal of the experiment is measuring the construct understandability of representative languages that are suitable for modeling compliance specifications. This construct is defined by the syntactic correctness, semantic correctness and response time of the answers given by the participants. This study differentiates between syntactic and semantic correctness as it enables a more fine-grained analysis. This is in line with Chomsky [11], who stressed that the study of syntax must be independent from the study of semantics. p must hold between every s 1 (i.e., starting the scope) that is followed by s 2 (i.e., closing the scope) after-until after s 1 until p must hold after every s 1 (i.e., starting the scope) by no later than s 2 (i.e., closing the scope) Numerous existing studies differentiate between syntactic and semantic correctness (cf. Ferri et al. [32], Hindawi et al. [39] and Harel and Rumpe [37]). On the other hand, an LTL formula can be syntactically totally correct without catching the desired meaning. For example, the specification "activity 2 must not happen unless activity 1 has already happened" is not covered at all in a semantic way by the syntactically correct formula "F activit y 1 ∧ F activit y 2 ." In contrast, the formula "¬ activit y 2 U activit y 1 " is both syntactically and semantically correct. In addition to the understandability construct, the experiment aims at studying the perceived ease of application of the languages and the perceived correctness of the formalized compliance specifications.

Experimental units
All 215 participants of the experiment are students who enrolled in the courses "Software Engineering Lab (SE2)" and "Advanced Software Engineering Lab (ASE)" at the Faculty of Computer Science, University of Vienna, Austria. Two kinds of participants can be differentiated: -149 participants of the bachelor-level course SE2 are used as proxies for novice software engineers, designers or developers. -66 participants of the master-level course ASE are used as proxies for moderately advanced software engineers, designers or developers.
Using students as proxies for non-expert users is not an issue according to Kitchenham et al. [50]. Other studies even suggest that students can be used as proxies for experts under certain circumstances (cf. Höst et al. [43], Runeson [72], Svahnberg et al. [78] and Salman et al. [73]). As an incentive for participation and proper preparation, up to 10 bonus points (10% of total course points) were awarded based on the participant's performance in the experiment. All participants were randomly allocated to experiment groups.

Experimental material and tasks
In total, the experiment comprised five distinct tasks stemming from three different domains, as shown in Table 7. Tasks 1 and 2 are related to compliance in the context of lending, Task 3 focuses on compliance regarding hospital processes, and Tasks 4 and 5 are based on compliance specifications in the construction industry. Each task was presented to the participants by stating first the context, then the specification and last the available elements that are to be used during formal modeling of the specification. For an example, how experimental tasks were presented to the participants, see Fig.2. The full experimental material is available online (cf. Czepa et al. [22]). For sample solutions of all experimental tasks, see "Appendix A." It is important to note that these sample solutions show just one way to model the compliance specifications. In the grading process, each proposed solution was carefully assessed under constant consideration that the sample solution might not be the only way to correctly formalize the specification.

Hypotheses, parameters and variables
PSP abstracts underlying formal representations, such as LTL formulas, by high-level patterns with the intention to facilitate reuse and to enable ease of use. That is, the pattern representations are assumed to provide a better understandability than their underlying LTL formulas. EPL-based constraints are composed of an initial truth value and one or more query-listener pairs that change the truth value state. In contrast to LTL where meaning is encoded in a formula, different concerns, namely defining the initial truth value and change criteria for the truth value, are separated from each other in EPL-based constraints. This separation of concerns is assumed to facilitate the understandability of EPL-based constraints as opposed to LTL formulas where this separation is not present.
Consequently, we hypothesized that PSP, as a highly abstract pattern language, is easier to understand than LTL and EPL and that EPL, due to separation of concerns, is easier to understand than LTL. Consequently, the following hypotheses for the controlled experiment were formulated: -H 0,1 : There is no difference in terms of understandability between PSP and LTL. The construct understandability is measured by three interval-scaled dependent variables, namely: -the syntactic correctness achieved in trying to formally model the compliance specifications, -the semantic correctness achieved in trying to formally model the compliance specifications, -the response time, which is the time it took to complete the experimental tasks.
In addition, there are hypotheses that are concerned with the participants' opinion on the languages under investigation, namely: -H 0,4 : There is no difference in terms of perceived correctness between PSP and LTL. -H A,4 : PSP has a higher level of perceived correctness than LTL. -H 0,5 : There is no difference in terms of perceived correctness between PSP and EPL. -H A,5 : PSP has a higher level of perceived correctness than EPL.

Task 2
Use your constraint language to describe the requirement below. It might be necessary to use multiple constraints to represent the requirement. Just write "C1:" to start you first constraint, "C2:" for the second, and so on. Use the given letters (e.g., p for Check Customer Privilege) to refer to a task in your constraint(s). Please always keep records of the time when working on this task, and don't forget to answer the two questions below at the completion of this task.

Start Time
End Time

Context:
Request for a loan (Kreditantrag)

Requirement:
"The checking of the customer bank privilege is followed by checking of the credit worthiness. Both activities must take place before determining the risk level of the loan application."

Tasks: p = Check Customer Privilege w = Check Credit Worthiness e = Evaluate Loan Risk
Please fill out the survey at the completion of this task: 1. I think that my transformation of the requirement to the constraint language is correct.
2. It has been easy for me to create the constraint(s) for the requirement. The dependent variables associated with these hypotheses are ordinal scaled since the data were gathered by agreedisagree scales. In accordance with the results of a study by Revilla et al. [68], each scale had five categories.

Experiment design and execution
According to Wohlin et al. [86], "it is important to try to use a simple design and try to make the best possible use of the available subjects." For that reason, a completely randomized experiment design with one alternative per experimental unit was used. That is, each participant is randomly assigned to exactly one experiment group. This assignment took place fully automated in an unbiased manner.
Preparation documents were distributed to the participants one week before the experiment run. In these documents, the basics of the approaches are discussed, and the participants were encouraged to prepare for the experiment by applying the assigned behavioral constraint representation before the experiment session. To avoid bias, all three preparation documents are similar in length and depth. The approaches were presented in an approachable manner to the participants as suggested by numerous existing research on teaching undergraduate students in theoretical computer science, formal methods and logic (cf. Habiballa and Kmeť [34], Knobelsdorf and Frede [51], Carew et al. [10] and Spichkova [77]). The used training material is available online (cf. Czepa et al. [22]).

Procedure
To ensure a smooth procedure and to avoid unnecessary stress, the preparation document informed the participants about the procedure on the experiment day as detailed as possible. Seating arrangements were made to limit chances for misbehavior, and the participants were instructed how to find a suitable seat. The participants were allowed to use printouts of the preparation material and notes at their own discretion. After a brief discussion of the contents and structure of the experiment document by the experimenters, the participants started trying to solve the experimental tasks. The duration of the experiment was limited to 90 min. Due to organizational reasons, the experiment was done on paper, and time record keeping was the responsibility of each participant (please see Sect. 5.2 for a discussion of this potential threat to validity). After experiment execution, the answers given were evaluated. For that purpose, a method proposed by Lytra et al. [53] was applied, which comprises the independent evaluation of the answers by three experts, and a discussion of large differences in grading until a consensus is achieved. The attempted formalization in each experiment tasks was graded independently by the first, second and third author, who are experts in the investigated languages. To mitigate the risk of grading bias, the participant's given answers were graded in random order by each of the experts, and, in case of large differences in grading, a discussion took place until a consensus was achieved. Figures 3 and 4 depict the grading process schematically from the individual and over-all perspective, respectively. This evaluation of more than a thousand distinct answers comprising approximately 17,000 constraints took about half a year besides the authors' normal responsibilities such as teaching and other research. All other given answers, which are related to previous knowledge, time records and agree-disagree scale responses, were digitized and double-checked subsequently.

Analysis
This section is concerned with the treatment and statistics of the data.

Data set preparation
To preserve the integrity of the acquired data, it was necessary to drop potentially unreliable items. In total, the data of eight participants were not considered in the statistical evaluations. Table 8 summarizes all dropped participants including the reasons for non-consideration.

Descriptive statistics
In this section, the acquired data (cf. Czepa et al. [22]) are analyzed by the help of descriptive statistics. Table 9 shows the number of observations, central tendency and dispersion of the dependent variables syntactic correctness, semantic correctness and response time per group. In the bachelor-level course Software Engineering 2, the sample size is relatively large and evenly distributed (9 : 47 : 49). In the master-level course Advanced Software Engineering, there are less than half as many observations. Unfortunately, the number of participants of the group with the smallest number of observations, namely PSP, was further diminished by the exclusion of three participants (cf. Sect. 4.1). In consequence, the distribution in the ASE course is 21 : 17 : 24. The median and mean correctness values of the LTL groups in both SE2 and ASE are smaller than those of the other two groups. In SE2, the mean syntactic correctness of the LTL group is 56.52, thus about 5% less than in the EPL group (61.82%) and about 12% less than in the PSP group (68.64%), and the mean semantic correctness of the LTL group is at 28.49%, so about 10% below the EPL group (38.20%) and 22% below the PSP group (50.19%). In ASE, the mean syntactic correctness of the LTL group is 57.01%, thus about 8% less than in the PSP group (65.13%) and about 15% less than in the EPL group (71.91%). While the PSP group overall achieved a higher syntactic and semantic correctness than the LTL group in SE2, this ranking is reversed in the ASE course where EPL participants overall achieved a higher syntactic and semantic correctness than their colleagues of the PSP group. The mean syntactic correctness  Skew is a measure of the shape of a distribution. A positive skew value indicates a right-tailed distribution (e.g., more cases of low correctness than high correctness), a negative skew value indicates a left-tailed distribution (e.g., more  cases of high correctness than low correctness), and a skew value close to zero indicates a symmetric distribution. Differences in skew are, for example, present -between the semantic correctness distributions of LTL (0.75 indicating that the mass of the distribution is concentrated at lower levels of correctness) and PSP (− 0.08 indicating a rather symmetric distribution) in SE2, -between the syntactic correctness distributions of LTL (−0.15 indicating a curve that is slightly leaned to the right) and EPL (−0.9 indicating a distribution with only few measurements in lower correctness ranges) in ASE, -between the semantic correctness distributions of LTL (0.6 indicating higher densities in lower correctness ranges) and EPL (−0.37 indicating higher densities in higher correctness ranges) in ASE, and -between the response time distributions of LTL (0. 42 indicating a left-leaning curve) and PSP (−0.61 indicating a right-leaning curve) in ASE. Kurtosis is another measure for the shape of a distribution which focuses on the general tailedness. Positive kurtosis values indicate skinny tails with a steep distribution, whereas negative kurtosis values indicate fat tails. The most severe difference in kurtosis is present between the syntactic correctness distributions of the LTL group (1.22) and PSP group (−1.02).
So far, the dependent variables were analyzed on the basis of separating between course groups, which reflects the participants academic level of progression. Next, the dependent variables are investigated focusing on participants with industrial working experience. Table 10 summarizes the descriptive statistics of the dependent variables when focusing on participants with industrial working experience of one year and above. Based on the demographic data collected (cf. "Appendix D"), we consider this subset of participants to be close to the population of industrial practitioners with basic to modest experience. The mean syntactic correctness in the LTL group (58.65%) is about 8% lower than in the PSP (66.79%) and EPL (66.01%) groups. The PSP group achieved the highest degree of semantic correctness (48.58%), closely followed by the EPL group (44.46%). The LTL group achieved 30.51% semantic correctness, which is noticeable lower than in the two other groups. Present differences in skew and kurtosis are indications of differences in central location and distribution shape.
For additional descriptive statistics of the dependent variables syntactic correctness, semantic correctness and response time, we refer the interested reader to "Appendix B." With regard to the stacked bar chart (cf. Bryer and Speerschneider [44]) in Fig.5a showing the perceived correctness in SE2, the share of strongly agree responses to the statement "I think that my transformation of the requirement to the constraint language is correct" is 2% higher in PSP (37%) than  in the other two groups, and the share of (strongly) disagree answers is 22% in PSP while it is higher in LTL (37%) and EPL (33%). With 41% the share of neutral answers is largest in PSP. In ASE (cf. Fig. 5b), the participants appear to be overall slightly more confident regarding the correctness of their formalizations. The largest share of (strongly) agree responses is again present in the PSP group (51%), followed by LTL (46%) and EPL (44%). According to the stacked bar charts in Fig. 5, the perceived correctness of PSP appears to be slightly higher than in the other experiment groups in SE2, while EPL has a slightly lower perceived correctness than the other languages in ASE. According to Fig. 5c, a large share (44%) of participants with industry experience in the PSP is undecided whether the given answer is correct. The percentage of neutral answers of participants with industry experience is lowest in the EPL group (30%) and only slightly higher in the LTL group. The largest share of (strongly) agree responses of participants with industry experience is present in the EPL group (42%), followed by LTL (38%) and PSP (34%). Figure 6 contains stacked bar charts of the participants' perceived ease of application of the tested languages. Interestingly, there appears to be a strong similarity between the perceived correctness and perceived ease of application responses in SE2 regarding the ranking of the approaches (cf. Figs. 6a, 5a). PSP with 25% (strongly) agreeing and 42% (strongly) disagreeing appears to be slightly easier to apply than EPL with 21% (strongly) agreeing and 48% (strongly) disagreeing, and LTL with 17% (strongly) agreeing and 48% (strongly) disagreeing is perceived slightly more difficult to apply than EPL. In ASE (cf. Fig. 6b), the application of PSP is perceived to be even easier than in SE2. Interestingly, EPL is perceived to be similarly easy as PSP with regard to application. Like in SE2, LTL is ranked last in perceived ease of application. Figure 6c focuses on industry participants and reveals striking differences between the groups. The perceived ease of application is highest rated in the EPL group with 33% (strongly) agreeing and 38% (strongly) disagreeing, which means that there is still a shift toward a negative rating. The strongest shift toward low ease of application is present in the LTL group with only 7% (strongly) agreeing and 52% (strongly) disagreeing. In between are the results of the PSP group with 22% (strongly) agreeing and 49% (strongly) disagreeing.

Statistical inference
Before applying any statistical test, its model assumption must be tested and met. For a discussion whether or not the normality assumption is violated by the acquired data, see "Appendix C." Since there is uncertainty regarding normality, a core assumption of parametric testing, nonparametric testing is the preferable approach.    [14] and Rogmann [70]), a robust nonparametric test, is applied. Table 11 summarizes the test results for the bachelor-level course SE2. To take multiple testing into account, the p-values are adjusted based on the method proposed by Benjamini and Hochberg [8]. There is a highly significant result with a medium effect size magnitude, indicating that PSP provides a higher syntactic correctness than LTL. After p-value adjustments, no such result is present in the remaining syntactic correctness tests. All semantic correctness test results are highly significant with medium-to large-sized effects. There is no significant difference between the response times. Consequently, H 0,1 is rejected on the basis of syntactic and semantic correctness whereas H 0,2 and H 0,3 can only be rejected based on semantic correctness.
In the master-level course ASE (cf. Table 12), there is a large-sized difference in syntactic correctness between EPL and LTL. Regarding semantic correctness, there are largesized effects between PSP/LTL and EPL/LTL, indicating that the former outperforms the latter mentioned approach. As in SE2, there are no significant differences regarding the response times. Consequently, H 0,1 can only be rejected on the basis of semantic correctness, whereas H 0,3 is rejected based on both types of correctness. Table 13 contains the test results for participants with industry experience. There is no significant difference in terms of syntactic correctness and response time. Similarly to ASE, there is no significant difference in semantic correctness between PSP and EPL, while there are significant differences with large-sized effects when comparing PSP against LTL and EPL against LTL. Tables 14 and 15 summarize the test results regarding perceived correctness and perceived ease of application. Almost all test results are not significant with two exceptions: (1) A significant test result ( p = 0.0316) with a medium-sized effect is present in SE2 between PSP and LTL with regard to perceived correctness. Consequently, H 0,4 can be rejected in SE2. That is, PSP participants are significantly more confident that the formalization is correct than LTL participants at the bachelor level while such an effect is not measurable at the master level or within the sample of industry participants. (2) Participants with industry experience rate the ease of application of EPL significantly higher than of LTL ( p = 0.0023). Consequently, H 0,9 can be rejected for participants with industry experience.

Discussion
This sections discusses the results and threats to validity of the study.

Evaluation of results and implications
The Due to the large number of participants with industry experience, it became possible to consider a third population, namely participants with industry experience, who function as proxies for industrial practitioners with basic to modest industry experience. Based upon the stated goal, questions concerning understandability were generated. The understandability construct focuses on the degree of syntactic and semantic correctness achieved and on the time spent on modeling compliance specifications. The results per question are summarized in Table 16. By differentiating between syntactic and semantic correctness, it became possible to reveal that differences in understandability in formal modeling of compliance specifications predominately lie in semantic correctness. Almost all test results regarding semantic correctness are highly significant with large-sized effects. Interestingly, no significant difference in semantic correctness is present between the pattern-based PSP approach and the CEP-based EPL language in the master-level course ASE and in the subset of participants with industry experience. That might imply that more experienced users are able to cope equally well with both approaches. Aside from that, the results suggest that the pattern-based PSP approach is more understandable than EPL and LTL and that EPL provides a higher level of understandability than LTL. In terms of syntactic correctness, PSP seems to be more understandable than LTL for less experience users, while EPL seems to be more understandable than LTL for more experienced users. This study did not reveal any significant differences in response time. Regarding perceived correctness and perceived ease of application, there are two significant test results, which imply that transformations to PSP are perceived to be more correct than LTL transformations by less experienced users, and more experienced users with industry experience find that EPL is easier to apply than LTL. Overall, the results imply that the pattern-based PSP approach has advantages with regard to understandability. Therefore, the pattern-based approach seems to be particularly well suited for modeling compliance specifications. Moreover, the results indicate that EPL is more understandable than LTL. This could be important in cases where the set of available PSP patterns is not sufficient to model a compliance specification. In such cases, the compliance specification could be encoded in EPL for runtime verification or an extension of the pattern catalog could take place. In this regard, EPL specifications could be used to aid the creation of new patterns with underlying LTL formalizations by checking the plausibility of the LTL formula (cf. Czepa et al. [18,19]).
Moreover, the results are overall in line with two controlled experiments on the understandability of already existing formal specifications in LTL, EPL and PSP carried out by Czepa and Zdun [17]. The results of these controlled experiments with 216 participants in total suggested that existing specifications in PSP are significantly easier to understand than existing specifications in EPL and LTL. Moreover, the results implied that existing specifications in EPL are significantly easier to understand than existing specifications in LTL. The correctness of understanding was evaluated by letting the participant decide whether a truth value is the correct truth value of a specification, given a specific trace. In contrast to the current study, which focuses on the formal modeling of compliance specifications, no major differences between novice and moderately advanced users were found in understandability of existing specifications. Interestingly, the response times between the experimental groups were significantly different in most cases, an effect which appears to be absent during modeling (cf. Sect. 4.3).

Threats to validity
In the following, all known threats that might have an impact on the validity of the results of this study are discussed.

Threats to internal validity
Threats to internal validity are unobserved variables that might have an undesired impact on the result of the experiment by disturbing the causal relationship of independent and dependent variables. There exist several threats to internal validity, which must be discussed: -History effects refer to events that happen in the environment resulting in changes in the conditions of a study. The short duration of the study limits the possibility of changes in environmental conditions, and none were Q6 Are there differences in understandability between the tested approaches for participants with industrial working experience?
There are significant differences in terms of semantic correctness between PSP and LTL as well as between EPL and LTL observed. The occurrence of such effects prior to the study cannot be entirely ruled out. However, in such a case, it would be extremely unlikely that the scores of one experiment group are more affected than another, because of the random allocation of participants to groups. -Maturation effects refer to the impact the passage of time has on an individual. Like history effects, maturation effects are rather problematic in long-term studies.
Since the duration of the experiment was short, maturation effects are considered to be of minor importance, and none were observed. -Testing effects comprise learning effects and experimental fatigue. Learning effects were avoided by testing each person only once. Experimental fatigue is concerned with happenings during the experiment that exhaust the participant either physically or mentally. The short time frame of the experiment session limits chances of fatigue. Neither were any signs of fatigue observed nor were there any reports by participants indicating fatigue. -Instrumental bias occurs if the measuring instrument (i.e., a physical measuring device or the actions/ assessment of the researcher) changes over time during the experiment. Since the answers given in the experiment tasks were evaluated manually, this is a serious threat to validity. It might be the case that the experience gained in scoring some answers had an influence on subsequent evaluations. This threat was mitigated by evaluating the results in no specific prescribed order, and in case of substantial differences in grading, a discussion took place until consensus was achieved. -Selection bias is present if the experimental groups are unequal before the start of the experiment (e.g., severe differences in previous experience). Selection bias is likely to be more threatening in quasi-experimental research. By using an experimental design with the fundamental requirement to randomly assignment participants to the different groups of the experiment, it became possible to avoid selection bias to a large extent. In addition, the investigation of the composition of the groups did not reveal any major differences between them. (cf. "Appendix D"). -Experimental mortality more likely occurs in long-lasting experiment since the chances for dropouts increase (e.g., participants leaving the town). Due to the short time frame of this study, experimental mortality did not occur. -Diffusion of treatments is present if at least one group is contaminated by the treatments of at least one other group. Since the participants share the same social group, and they are interacting outside the research process as well, a cross-contamination between the groups cannot be entirely rule out. -Compensatory rivalry is present if participants of a group put in extra effort when the impression arises that the treatment of another group might lead to better results than their own treatment. This threat was mitigated by clarifying that different degrees of difficulty will be considered and compensated in the calculation of bonus points. -Demoralization could occur if a participant is assigned to a specific group that she/he does not want to be part of. No indications of demoralization such as increased dropout rates or complaints regarding group allocation were observed. -Experimenter bias refers to undesired effects on the dependent variables that are unintentionally introduced by the researcher. All participants received a similar training and worked on the same set of tasks. A manual evaluation of the given answers regarding their correctness was performed. To mitigate the threat of experimenter bias in that regard, the first, second and third author performed the evaluation of all tasks individually. Differentiating between semantic and syntactic correctness overall simplified the evaluation process by enabling a separation of concerns. A potential threat in that regard could be falsely classifying defects. Therefore, after the completion of all individual evaluations, in case of substantial differences in grading, a discussion took place until consensus was achieved.

Threats to external validity
The external validity of a study focuses on its generalizability. In the following, potential threats that hinder a generalization are discussed. Different types of generalizations must be considered: -Generalizations across populations: By statistical inference, generalizations from the sample to the immediate population are made. The initial study design considered two populations, namely computer science students that enrolled in the course SE2 as proxies for novice software engineers, designers or developers, as well as computer science students that enrolled in the course ASE as proxies for moderately advanced software engineers, designers or developers. Due to the large number of participants with industry experience, it became possible to consider a third population, namely participants with industry experience, who function as proxies for industrial practitioners with basic to modest industry experience. The results of this study show interesting discrepancies between these populations. In particular, there are no significant differences in understandability between PSP and EPL for more advanced users while a significant difference is measurable when testing less experienced users. In general, this study does not intent to claim generalizability to other populations without further empirical evidence. For example, it might be plausible that leading experts working in the software industry or as business administrators perform similarly to ASE participants or the subset of participants with industry experience, but this study can neither support nor reject such claims. -Generalizations across treatments: The treatments are equivalent to specific tested languages. Treatment variations would likely be related to changing the contents, amount or difficulty of experiment tasks or the amount of training provided. The experiment design attempts to be as general as possible by using compliance specifications stemming from different domains and applying a moderate amount of training. -Generalizations across settings/contexts: The participants of this study are students who enrolled computer science courses at the University of Vienna, Austria. The majority of the students are Austrian citizens, but there is a large presence of foreign students as well. Surely, it would be interesting to repeat the experiment in different settings/context to evaluate the generalizability in that regard. For example, repeating the experiment with English native speakers might lead to different and presumably better results. -Generalizations across time: It is hard to foresee whether the results of this study will hold over time. For example, if teaching of a specific tested language is intensified in the computer science curricula at the University of Vienna, then the students would bring in more expertise, which likely would have an impact on the results.

Threats to construct validity
There are potential threats to the validity of the construct that must be discussed: • Inexact definition and Construct confounding: This study has a primary focus on the construct understandability, which is measured by the dependent variables syntactic correctness, semantic correctness and response time. This construct is exact and adequate, and the dependent variables syntactic correctness and semantic correctness make even a more fine-grained analysis possible than in existing studies that measure correctness by a single variable (cf. Feigenspan et al. [31] and Hoisl et al. [40]). • Mono-method bias: Due to organizational reasons, keeping time records was the personal responsibility of each participant. The participants were carefully instructed how to record start and end times, and we did not detect any irregularities (e.g., overlapping time frames or long pauses) in those records. Nonetheless, this measuring method leaves room for measuring errors, and an additional or alternative measuring method (e.g., direct observation by experimenters or performing the experiment with an online tool that handles record keeping) would reduce this threat. However, these methods would have influenced the overall study design and potentially could have introduced other threats to validity (e.g., prolonged experiment execution potentially leading to an exposure of the experiment task contents or technical problems during experiment execution). To avoid monomethod bias in evaluating the syntactic and semantic correctness, the grading was not performed by a single but by three experimenters individually. • Reducing levels of measurements: Both correctness variables and the response time are continuous variables. That is, the levels of measurements are not reduced. The Likert scales used in this study offer 5 answer categories rather than 7 or 11, because the latter mentioned would produce data of lower quality according to Revilla et al. [68]. • Treatment-sensitive factorial structure: In some empirical studies, a treatment might sensitize participants to develop a different view on a construct. The actual level of understandability based on the task solutions provided was measured, so the participants' view on this construct appears to be irrelevant.

Threats to content validity
Content validity is concerned with the relevance and representativeness of the elements of a study for the measured construct: -Relevance: The tasks of this study are based on realistic scenarios stemming from three different domains in which compliance is highly relevant (cf. Elgammal et al. [29], Rovani et al. [71], and United States Environmental Protection Agency [83]). -Representativeness: In the formal modeling of the compliance specifications, the use of all core temporal LTL operators and EPL operators was required, which means that the construct understandability was measured comprehensively. The use of each PSP pattern was required two or more times (cf. sample solutions of experimental tasks in "Appendix A"). Unfortunately, it was not possible to test all available pattern-scope combinations. However, the majority of specifications are based on the global scope (cf. Dwyer et al. [27,28]), which is as well reflected in the realistic specifications used in the tasks of this experiment (cf. experimental tasks in Table 7 and sample solutions in "Appendix A"). That is, a representative subset of PSP was tested.

Threats to conclusion validity
Thorough statistical investigations of model assumptions were performed before applying the most suitable statistical test with the greatest statistical power, given the properties of the acquired data. That course of action is considered to be highly beneficial to the conclusion validity of this study. The decision to retain outliers might be a threat to conclusion validity, but all outliers appear to be valid measurements, so deleting them would pose a threat to conclusion validity as well.

Related work
We are not aware of any empirical studies evaluating the understandability related to the formal modeling of compliance specifications in particular. There exists, however, related work focusing on similar issues. Related studies in the field of business process management are concerned with declarative workflows (cf. van der Aalst [1]), which use graphical patterns with underlying formal representations in LTL (cf. Montali [56]) or event calculus (cf. Montali et al. [57]). Haisjackl and Zugal [35] investigated differences between textual and graphical declarative workflows in an empirical study with 9 participants. The descriptive statistics of this study indicates that the graphical representation is advantageous in terms of perceived understandability, error rate, duration and mental effort. The lack of hypothesis testing and the small number of participants are severe threats to the validity of this study. Zugal et al. [87] investigated the understandability of hierarchies on basis of the same data set. The results of their research indicate that hierarchies must be handled with care. While information hiding and improved pattern recognition are considered to be positive aspects of hierarchies since the mental effort for understanding a process model is lowered, the fragmentation of processes by hierarchies might lower overall understandability of the process model. Another important finding of their study is that users appear to approach declarative process models in a sequential manner even if the user is definitely not biased by previous experiences with sequential/imperative business process models. They conclude that the abstract nature of declarative process models does not seem to fit the human way of thinking. Moreover, they observed that the participants of their study tried to reduce the number of constraints to consider by putting away sheets that describe irrelevant subprocess or by using the hand to hide parts of the process model that are irrelevant. Like in the previously discussed study, it must be assumed that the validity of this study is strongly limited by the extremely small sample size. Haisjackl et al. [36] investigate the users' understanding of declarative business process models, again on the same data set. As in the previously mentioned study, they point out that users tend to read such models sequentially despite the declarative nature of the approach. The larger a model, the often are hidden dependencies overlooked, which indicates increasing numbers of constraints lower understanding. Moreover, they report that single constraints are overall well understood, but there seem to be problems with understanding the precedence constraint. As the authors point out, this kind of confusion could be related to the graphical arrow-based representation of the constraints where subtle differences decide on the actual meaning. That is, the arrow could be confused with a sequence flow as present in flow-driven, sequential business processes. As previously stated for the other two studies that are based on the same data set, the validity of this study is possibly strongly affected by the small sample size. De Smedt et al. [26] tried to improve the understandability of declarative business process models by explicitly revealing hidden dependencies. They conduced an experiment with 95 students. The result suggests that explicitly showing hidden dependencies enables a better understandability of declarative business process models. Pichler et al. [64] compared the understandability of imperative and declarative business process modeling notations. The results of this study are in line with Zugal et al. [87] and suggest that imperative process models are significantly better understandable than declarative models, but the authors also state that the participants had more previous experience with imperative process modeling than with declarative process modeling. The small sample size (28 participants) is a threat to validity of this study. Rodrigues et al. [69] compared the understandability of textual and graphical BPMN [59] business process descriptions with 32 students and 41 practitioners. They conclude that experienced users understand a process better if it is presented by a graphical BPMN process model whereas for inexperienced users there is no difference in understandability between the textual and graphical process descriptions. Jost et al. [46] compared the intuitive understanding of process diagrams with 103 students. They conclude that UML activity diagrams provide a higher level of understandability than BPMN diagrams and EPCs.
Software architecture compliance, which focuses on the alignment of software architecture and implementation, and requirements engineering are also related to this study. Czepa et al. [21] compared the understandability of three languages for behavioral software architecture compliance checking, namely the natural language constraint (NLC) language, the cause-effect constraint (CEC) language and the temporal logic pattern-based constraint (TLC) language, in a controlled experiment with 190 participants. The NLC language is simply referring to using the English language for documenting software architectures. CEC is a high-level structured architectural description language that abstracts EPL. It supports the nesting of cause parts, which observe an event stream for a specific event pattern, and effect parts, which can contain further cause-effect structures and truth value change commands. TLC is a high-level structured architectural description language based on PSP. Interestingly, the statistical inference of this study suggests that there is no difference in understandability of the tested languages. This could indicate that the high-level abstractions employed bring those structured languages closer to the understandability of unstructured natural language architecture descriptions. Moreover, it might also suggest that natural language leaves more room for ambiguity, which is detrimental for its understanding. Potential limitations of that study are that its tasks are based on common architectural patterns/styles (i.e., a participant possibly recognizes the meaning of a constraint more easily by having knowledge of the related architectural pattern) and the rather small set of involved patterns (i.e., only very few patterns of PSP were necessary to represent the architecture descriptions). A controlled experiment carried out by Heijstek et al. [38] with 47 participants focused on finding differences in understanding of textual and graphical software architecture descriptions. Interestingly, participants who predominantly used textual architecture descriptions performed significantly better, which suggests that textual architectural descriptions could be superior to their graphical counterparts. An eye-tracking experiment with 28 participants by Sharafi et al. [74] on the understandability of graphical and textual software requirement models did not reveal any statistically significant difference in terms of correctness of the approaches. The study also reports that the response times of participants working with the graphical representations were slower. Interestingly though, the participants preferred the graphical notation. Hoisl et al. [40] conducted a controlled experiment on three notations for scenario-based model tests with 20 participants. In particular, they evaluated the understandability of a semi-structured natural language scenario notation, a diagrammatic scenario notation and a fully structured textual scenario notation. According to the authors, the purely textual semi-structured natural language scenario notation is recommended for scenario-based model tests, because the participants of this group were able to solve the given tasks faster and more correctly. That is, the study might indicate that a textual approach outperforms a graphical one for scenario-based model test, but the validity of the experiment is limited by the small sample size and the absence of statistical hypothesis testing.

Conclusion and future work
The main goal of this empirical study was testing and comparing the understandability of representative approaches for the formal modeling of compliance specifications. The experiment was conducted with 215 participants in total. Major differences were found especially in semantic correctness of the approaches. Since formalizations in the property specification patterns (PSP) were overall more correct than in linear temporal logic (LTL) and event processing language (EPL), there is evidence that the pattern-based PSP approach provides a higher level of understandability. More advanced users, however, seemingly are able to cope equally well with PSP and EPL. That is, for more advanced users, these approaches can be used interchangeably as fitting best to a concrete domain or task. Moreover, EPL provides a higher level of understandability than LTL. Therefore, EPL is well suitable in situations that demand runtime verification in which the set of available patterns in PSP is not sufficient to model a compliance specification or to aid the creation of new patterns with underlying LTL formalizations (cf. Czepa et al. [18,19]).
Moreover, the results are overall in line with two controlled experiments with 216 participants in total on the understandability of already existing formal specifications in LTL, EPL and PSP (cf. Czepa and Zdun [16]). In contrast to the current study, which focuses on the formal modeling of compliance specifications, no major differences between novice and moderately advanced users were found in understandability of existing specifications. Interestingly, the response times between the experimental groups were significantly different in most cases, an effect which appears to be absent during modeling.
Opportunities for further empirical research are the consideration of an extended set of representations including, for example, event calculus (cf. Kowalski and Sergot [52]) or Declare (cf. Pešić and van der Aalst [61]) and studying the understandability construct in different settings with other user groups (e.g., business administrators or professional software engineers). Moreover, besides the understandability construct, additional metrics such as changeability (i.e., "Is one representation easier to change when taking new/amended compliance specifications into account?") and verifiability (i.e., "Are there differences between the representations when it comes to assessing whether a given compliance specification is fully covered?") could be investigated.    p.started precedes (y < 1978 and (t = 'residential house' or t = 'apartment' or t = 'child-occupied facility') and renovation.started) Figure 7 shows kernel density plots and box plots of the dependent variables syntactic correctness, semantic correctness and response time in the SE2 course. As the kernel density plot in Fig. 7a clearly indicates, there are differ-ences in central location and shape between the semantic correctness distributions of the groups. While the LTL group has a very low density in the range of 50-100 % semantic correctness, the PSP has a high density in the range of 40-70 % semantic correctness. The central location of the EPL group (about 35% semantic correctness) is located between the peaks of the two other distributions. Figure 7b shows two outliers in the LTL group, which represent participants who were able to achieve a higher level of correctness than most of their colleagues in the same experiment group. In Fig. 7c, a kernel density plot of the syntactic correctness is shown. All distributions have their central location at 70-75 min, but their shapes are different. The PSP distribution has a particularly high density directly at the central location whereas the remaining distributions show higher densities in the lower correctness ranges. There is a single outlier in the PSP group indicating a participant who has achieved a slightly lower level of syntactic correctness (cf. Fig. 7d). Both the kernel density plot in Fig. 7c and the box plot in Fig. 7d indicate a clear difference in distribution. The assumption of equal variance seems violated. The same applies to the response time distributions shown in Fig. 7e, f. Figure 8 visualizes the data of the dependent variables syntactic correctness, semantic correctness and response time of ASE participants by kernel density plots and box plots. In Fig. 8a, The PSP semantic correctness distribution is rather flat with its peak at about 45%. While the LTL semantic correctness distribution has a high density in the lower correctness range (10-45 %) with its peak density at 20-25 %, the EPL distribution has a high density in the range of 45-65 %. Thus, all semantic correctness distributions appear to be different in shape and central location. Regarding syntactic correctness (cf. Fig. 8c), the LTL distribution appears to be bimodal with peaks at 50% and 70%. The EPL syntactic correctness distribution is steeper than the others with its peak at 70-75 %. In contrast, the PSP syntactic correctness distribution is strikingly flat with a slightly higher density in the higher syntactic correctness ranges. There is a single outlier in the EPL group showing a low level of syntactic correctness. The PSP group has its peak response time density at 65 min, and there are indications of bimodality with a second small peak at about 35 min. Both remaining response time distributions are rather similar of shape, but their central locations differ. LTL has its central location at 45 min whereas PSP has it at 55 min. Figure 9 shows kernel density plots and box plots of the dependent variables syntactic correctness, semantic correctness and response time for the subset of participants with industry experience. The peak density in semantic correctness in the LTL group can be found at about 20% while the other groups have their peaks at 50-60 %. The syntactic correctness distribution of the LTL group is less steep than the ones of the other two groups with higher densities in the lower (f) Box plot: Response time syntactic correctness ranges. While there are only minor differences in distribution shape of the response time variable between the LTL and EPL groups with their peak density in the range of 45-50 min, the PSP group has its peak density in the range of 60-65. Overall, the distribution shapes differ in central location and shape in several cases. According to the scatter plots in Fig. 10, there is a positive linear correlation between the dependent variables syntactic correctness and semantic correctness. That is, syntactic and semantic correctness are not isolated metrics, which is not surprising, because correct application of syntax is a prerequisite for enabling meaning. There is no correlation between the correctness variables and the dependent variable response time (cf. Figs. 11, 12). Consequently, the amount of time spent working on the experiment tasks by the participants did not necessarily result in higher correctness values.

Appendix C: Evaluation of normality assumption and parametric testing by Welch's t test
Since the dependent variables syntactic correctness, semantic correctness and response time are interval-scaled, parametric methods would be the first choice, but the multivariate normality assumption appears to be violated according to the Shapiro-Wilk tests of multivariate normality in Table 22, so multivariate parametric testing (MANOVA) is ruled out. According to Shapiro-Wilk tests of univariate normality in Table 23, there are no indications of non-normality, but there are signs of non-normality in the descriptive statistics in Sect. B. Also normal QQ plots of the data show signs of nonnormality (cf. Fig. 13). Due to the large sample sizes (n > 30) in SE2, it might be valid to assume that the Central Limit Theorem holds. In ASE, the sample size is not large enough to claim that. Since there is uncertainty regarding normality, the application of nonparametric testing should be preferred (cf. Sect. 4.3). Nonetheless, in case of assumed normality, parametric testing yields similar results (cf . Tables 24, 25). This additional testing was performed since the violation of normality is based on the interpretation of plots only, which leaves room for subjectivity.

Appendix D: Descriptive statistics of previous knowledge, experience and other features of participants
For the validity of the study, it is crucial to find out whether the randomized distribution to experiment groups resulted in well-balanced groups. This section provides descriptive statistics of the age, gender, programming experience, complex event processing experience, logical formalisms experience and industry experience of the participants per experiment group for that purpose. Both the kernel density plot in Fig. 14a and the box plot in Fig. 14b show a nearly identical age distribution in all experiment groups of the SE2 course with a central tendency at 24 years. There are few (i.e., two each in LTL and PSP, and three in EPL) outliers, which represent students that are of older age than the majority of their colleagues. In contrast to the nearly identical age distribution in SE2, there seem to be minor differences in age distribution between the experiment groups of the ASE course. The kernel density plot in Fig. 14c and the box plot in Fig. 14d indicate that the share of younger participants is larger in the EPL group than in the two remaining experiment groups. Overall, LTL participants are slightly older than participants of the other groups. Moreover, there is a single age outlier in the LTL group representing a student of older age. Figure 15 shows the gender distribution. With 111 men and 34 women, there are about three times as many male than female participants in SE2. The share of women is larger in In both courses, the share of female participants is smallest in PSP. There are about twice as many women in the LTL group in SE2 and in the EPL group in ASE as in the corresponding PSP groups, which indicates an imbalance in the distribution of female participants. Since, however, the share of women is overall low, the magnitude of potential disturbing effects is assumed to be low as well.  Figure 16 shows the participants' programming experience. According to the kernel density plot in Fig. 16a and the box plot in Fig. 16b, the central tendency is balanced at 2-3 years in SE2. There are three outliers each in the LTL and PSP groups, and a single one in EPL, which result from participants with long-term programming experiences relatively to their colleagues in the same experiment group. In line with expectations, the participants of the master-level course ASE have more years of programming experience than their colleagues in the bachelor course SE2 (cf. Fig. 16c,d). The peak density is at 5 years programming experience in LTL and PSP. The participants of the EPL group appear to be slightly less experienced with a peak density at 4 years and a higher density in the range of 0 to 1 years. Each group has a single outlier with more years of programming experience than the other participants of the same experiment group.
Overall, the degree of industry experience is low in SE2, as to be expected in a bachelor-level course (cf. Fig. 17a, b). In ASE, a substantial amount of the students have already started to work in the industry (cf. Fig. 17c, d). Interestingly, EPL participants in ASE seem to have slightly more industry experience (cf. Fig. 17c, d) than their colleagues in the same course despite having less programming experience (cf. Fig. 16c, d). Figure 18 shows the participants' prior experience with Complex Event Processing (CEP). Almost all participants do not have any experience with CEP. Just two participants, one of the PSP group in SE2 and one of the EPL group in ASE, have prior experience with CEP. In contrast, the level of experience with logical formalisms is high (cf. Fig. 19). Overall, the share of experienced participants is larger in the master-level course ASE than in the bachelor-level course SE2. Interestingly, the share of prior experience with logical formalisms is higher in the LTL group than in the other groups. A potential reason could be that some of the LTL participants misunderstood this question by falsely assuming studying LTL for this experiment qualifies as "prior knowledge." Apart from minor differences between the experiment groups, which are to be expected in a completely randomized experiment design, the groups appear to be overall wellbalanced. That is, there are no clear indications of disturbing effects on the measurement of the dependent variables resulting from unbalanced groups.