An experimental scrutiny of visual design modelling: VCL up against UML+OCL

The graphical nature of prominent modelling notations, such as the standards UML and SysML, enables them to tap into the cognitive benefits of diagrams. However, these notations hardly exploit the cognitive potential of diagrams and are only partially graphical with invariants and operations being expressed textually. The Visual Contract Language (VCL) aims at improving visual modelling; it tries to (a) maximise diagrammatic cognitive effectiveness, (b) increase visual expressivity, and (c) level of rigour and formality. It is an alternative to UML that does largely pictorially what is traditionally done textually. The paper presents the results of a controlled experiment carried out four times in different academic settings and involving 43 participants, which compares VCL against UML and OCL and whose goal is to provide insight on benefits and limitations of visual modelling. The paper’s hypotheses are evaluated using a crossover design with the following tasks: (i) modelling of state space, invariants and operations, (ii) comprehension of modelled problem, (iii) detection of model defects and (iv) comprehension of a given model. Although visual approaches have been used and advocated for decades, this is the first empirical investigation looking into the effects of graphical expression of invariants and operations on modelling and model usage tasks. Results suggest VCL benefits in defect detection, model comprehension, and modelling of operations, providing some empirical evidence on the benefits of graphical software design.


Introduction
There is a crying need for modelling in software and systems engineering. Modelling no longer needs to be advocated as good engineering practice; instead, its necessity emerges from the practice of software and systems engineering. Modelling and design are simply the natural response to a need for abstraction and better means for tackling complexity.
Visual modelling has always been part of software engineering (Chen 1976;Ross 1977;Ross and Schoman 1977) with graphical design being advocated as a way forward for nearly three decades (Harel 1988;. Popular notations of software and systems engineering, such as the standards UML and SysML (OMG 2012), are graphical, which enables them to tap into the cognitive benefits that diagrams are known to provide (Larkin and Simon 1987); their graphical nature is seen as a major factor behind their popularity. 1 However, they fail to fully exploit the visual spectrum and are criticised for having low cognitive effectiveness (Moody 2009;Moody and van Hillegersberg 2009).
Mainstream visual notations are criticised for their semantics problems (Wieringa 1998;Evans et al. 1998;Stevens 2002;Henderson-Sellers and Barbier 1999;Henderson-Sellers 2005;France et al. 2006;Manfred Broy 2011;Rumpe and France 2011;Micskei and Waeselynck 2011). Their semantics is seen as being flimsy and ill-defined, resulting in imprecision due to the many ways in which phrases of a language may be interpreted and misinterpreted. Models expressed in these notations may have inconsistencies in them (Lange and Chaudron 2006;Farias et al. 2012), are prone to ambiguities and misunderstandings, and are often not mechanisable because they lack means of semantic inference. This, in turn, hampers exhaustive verifiability using theorem proving or model-checking as this requires a formally-defined semantics. Nevertheless, rigorous-minded research communities have learnt to overcome this issue; standards UML and SysML, seen by many as a family of modelling languages (Cook et al. 1999;Clark et al. 2000), are often seen as bloated language definitions that hinder rigorous usage. However, by defining well-founded profiles -a specialisation or a subset of the standard notation, or a family member -, it is possible to tackle the semantics problem; it is in this way that many formalisations of UML (Varró 2002;Störrle 2003;Amálio et al. 2004;Amálio et al. 2005;Amálio 2007), OCL (Richters and Gogolla 1998;Richters 2001) and SysML (Amálio et al. 2016) have been achieved.
The graphics of mainstream languages can be disappointing. They cannot express all that is required diagrammatically; UML provides the textual OCL notation to express constraints of operations and invariants. UML's graphical description of behaviour has been criticised (Henderson-Sellers 2005;France et al. 2006;Manfred Broy 2011;Dobing and Parsons 2006); collaboration and sequence diagrams are but partial behavioural descriptions as they describe scenarios; state and activity diagrams provide total descriptions, but their emphasis on explicitly defined states and state transitions may result in descriptions that are cumbersome, especially with systems other than those traditionally classified as reactive.
The Visual Contract Language (VCL) (Amálio and Glodt 2015;Amálio and Kelsen 2010a;) expresses software designs formally and graphically. It embodies a critique of the UML that aims to improve the following: (a) diagrammatic expressivity by describing pictorially what is described textually in the UML realm; (b) visual effectiveness by designing VCL according to theories and guidelines for the design of visual notations; and (c) the semantics issue by designing VCL with a formal semantics. To achieve this, VCL comes with two novel visual notations: assertion and contract diagrams, key ingredients in ensuring that VCL does visually what is done textually with UML.
VCL's foundational premise is that graphical software design is a good idea. The paper investigates this premise to gauge the effectiveness of visual software design and VCL's novel diagram types. The empirical investigation involves a controlled experiment, carried out four times in university settings with 43 participants, which compares VCL against the UML and OCL standards (our baseline) to provide insight into benefits and limitations of visual modelling. The paper examines the following: (i) modelling of state space, invariants and operations, (ii) comprehension of modelled problem, (iii) detection of model defects and (iv) end-user comprehension of a given model, (v) usefulness, (vi) ease of use, (vii) usability and (viii) overall appraisal of examined notations. Results suggest benefits of VCL's graphical approach in defect detection and model comprehension, and more general benefits of visual design.

Contributions
The paper's contributions are as follows: -This is the first empirical study that compares a design language that expresses predicates graphically against the UML and its OCL satellite notation. -This is the first empirical demonstration suggesting benefits of a diagrammatic approach to the modelling of operations using design-by-contract (Meyer 1992). -This is the first empirical demonstration that suggests that a diagrammatic modelling approach benefits tasks associated with the usage of models. The paper shows how VCL was significantly better when compared to UML+OCL in tasks related to end-user model comprehension and defect detection.

Outline
The remainder of this paper starts with some background (Section 2) focussed on VCL and diagrammatic description. This is followed by the experiment's scope (Section 3), design (Section 4) and materials (Section 5). The paper then talks about the pursued statistical analysis (Section 6), the experiment's participants (Section 7) and the results of the controlled experiment (Section 8). Finally, it discusses threats to the validity of the paper's results (Section 9), presents related work (Section 10) and draws its conclusions (Section 11).

Background
The sequel provides background on the cognitive benefits of diagrams and VCL, discusses the principles of graphical notations applied in VCL's design, and presents VCL's trajectory and founding ideas explaining why VCL is a suitable representative of graphical modelling.

Diagrams and their Cognitive Effectiveness
When solving problems, humans use both internal representations, stored in their brains, and external representations, recorded on paper or some other medium (Larkin and Simon 1987). For cognitive scientists, the form of external representations matters to the agents' performance in various cognitive tasks, such as problem-solving or decision-making (Larkin and Simon 1987;Zhang 1997); form determines what information can be perceived, what cognitive processes can be activated and what can be discovered. According to Zhang (1997), "external representations are not simply inputs and stimuli to the internal mind; rather, they are so intrinsic to many cognitive tasks that they guide, constrain and even determine cognitive behaviour". Humans favour pictorial representations (Pinker and Freedle 1990;Larkin and Simon 1987;Golkasian 1996;Goolkasian 2000;Harel 1988;Chen 2004). Larkin and Simon's study (1987) show that text and diagrams containing the same information are not necessarily equivalent with respect to the processing required to extract information, highlighting that diagrams facilitate perceptual inferences -some are called free rides because they are easily perceptible (Shimojima 1996). Subsequent studies (Golkasian 1996;Goolkasian 2000) have replicated the findings of Larkin and Simon (1987), highlighting a picture advantage. However, the superiority of pictures should not be taken for granted (Larkin and Simon 1987;Petre 1995).
The practical relevance of diagrams is endorsed by their ubiquity in engineering (Ferguson 1977;. Visual thinking is seen as intrinsic to engineering; thinking, designing and communicating with pictures are recognised as essential engineering activities. Like in traditional engineering, diagrams became an integral part of software engineering (Moody 2009). Diagrams in all forms and shapes, from formal representations to informal and ephemeral sketches, constitute a prominent means of software engineering expression. Unlike traditional engineering, however, software diagrams are not tied to the physical shape of any designed artefact, which entailed greater freedom regarding diagrammatic shapes and forms .
Unsurprisingly, software engineering visual languages have been advocated for decades as means to facilitate human communication and problem solving (Harel 1988). Visual languages, such as Harel's statecharts (1987), UML and SysML, are widely taught. The UML development is seen as a pivotal in software engineering, providing a common unifying language that the field had never had before. Although diagrams favour perceptual inferences, this does not entail cognitive effectiveness (Larkin and Simon 1987;Petre 1995) as diagrams need to be designed to exploit the benefits of visual representations (Larkin and Simon 1987;Moody 2009). This is where mainstream visual languages lag behind; the UML, for instance, is criticised for breaching many rules of visual language design (Moody 2009;Moody and van Hillegersberg 2009).

A Primer on VCL
VCL follows the approach to modelling of UML and many of UML's predecessors (Rumbaugh et al. 1991;Booch 1994;Wieringa 1998) based on an object oriented (OO) style of description, which sees a software system as a collection of data with associated behaviour inter-operatable from the environment. Another VCL pillar is discrete mathematics and set theory, the foundation of languages such as Z (Spivey 1992;Woodcock and Davies 1996;ISO 2002) and B (Abrial 1996).
VCL is introduced here with an example of a secure bank whose complete VCL model is given in Amálio (2011). A banking system manages customers, accounts and transactions, and needs to be made secure. Figures 1 and 2 present a few model excerpts that illustrate VCL.
VCL organises models around inter-dependent packages using ideas from aspect orientation . Figures 1 and 2 present diagrams of two different packages. Package diagrams (PDs) define packages (represented as clouds) and their dependencies to other packages. PD of Fig. 1a says that package Bank imports sets from CommonTypes. Package CommonTypes provides definitions common across the model; Bank focuses on those concerns specific to banking; other packages in Amálio (2011) address security concerns and the modular weaving of security and banking.
Structural diagrams (SDs), VCL's UML class diagram equivalent, express domains of discourse in state space (data or conceptual) models by representing entities of interest as sets (round contours). SD of Bank (Fig. 1b) has class sets for banking entities, namely, customers, accounts and transactions, and uses value sets from SD of CommonTypes (Fig. 1c -e.g. CT::Name. Objects and values are depicted as rectangles. Sets for dates and times of Fig. 1c use VCL's predicate language to define the novel derived sets. Set Month, for instance, is defined as the natural numbers (set Nat) from 1 to 12 in a graphical depiction of a set comprehension -Month = {n : Nat | n ≥ 1 ∧ n ≤ 12}. The arrows emanating from Nat (predicate edges) refer to the source and are combined through conjunction.
State spaces are subject to constraints (or invariants) that must be respected in all states of the system. Unlike UML, VCL identifies invariants in the data model -a SD defines a state space made up of state structures and their constraints -as assertions (named elongated hexagons) defined in assertion diagrams (ADs). In Fig. 1b, assertions HasCurrentBefSavings, SavingsArePositive and CorporateHaveNoSavings, defined in ADs of Fig. 1b, e and f, embody relevant invariants: customers must hold a current account prior to opening a savings account (HasCurrentBefSavings), savings accounts must be positive (SavingsArePositive) and corporate customers must not hold savings accounts (CorporateHaveNoSavings).
Whilst SDs focus on static aspects, behaviour diagrams (BDs) concentrate on dynamics (or behaviour). BDs provide a map over the constituent units of a package's behaviour defined in separate diagrams. BDs units of Bank (Fig. 2a) include the operations available to the environment, namely, create customer, open account, deposit, withdraw, delete account, view an account's balance, get accounts in debt, and get accounts of a customer.
In BDs, operations that modify state are represented as contracts (double-lined elongated hexagons); those that observe (or enquire) state as assertions (single-lined hexagons). Global operations (visible to the environment) stand alone; local operations (invisible to the environment) are placed inside the contours of their class sets. Local modifier operations can either create new class objects (constructors, symbol N), update existing objects (symbol U) or delete class objects (symbol D). For instance, the global OpenAccount does all that is involved in opening actual bank accounts and the New operation inside Account, a sort of sub-operation, creates new account objects only. Modifier operations are defined in contract diagrams (CDs); operations CreateCustomer, Customer.New, AccWithdraw and Account.Withdraw of ADs and CDs comprise a box with an identifier, followed by compartments for declarations (top) and predicate (bottom) which may be further split in two. ADs have one  (2011)) predicate compartment as only one set of states is being referred to; CDs have two predicate compartments corresponding to the states of pre-and post-condition (to the left and right, respectively). Declaration compartments include relevant variables (either internal or of inputs and outputs), and imported assertions and contracts. Predicate compartments are made-up of graphical formulas read from top to bottom and combined using conjunction. Two kinds of formulas are supported: logic-and set-based. Logic formulas, read from left to right, resemble their textual counter-parts. Set formulas start from some inner graphical expression and grow outwards.
AD SavingsArePositive (Fig. 1e) expresses a local invariant of Account. It has no declarations; the predicate contains a pictorial propositional logic formula made-up of individual atomic statements involving predicate edges that are with an implication to say that if the account's type is savings then its balance must not be negative -aT ype = savings =⇒ balance ≥ 0.
Predicate of AD CorporateHaveNoSavings (Fig. 1f) expresses a set formula. Relation Holds defined in SD of Fig. 1b, which denotes a set of pairs, is restricted to those pairs with corporate customers and savings accounts (a sort of filtering); the restricted relation is then required to be empty -hence, corporate customers must not hold savings accounts. In Fig. 1f the restrictions are performed using domain restriction (symbol ) and range restriction ( ) using edge modifiers (represented as double arrows and denoting functions) and the outer shading indicates that the restricted set or relation must be empty. The resulting formula is: {o : Customer|o.cT ype = corporate} H olds {o : Account|o.aT ype = savings} = ∅.
AD HasCurrentBefSavings (Fig. 1d) says that the set of customers with current accounts is a subset of customers with savings accounts -hence, a customer must hold a current account prior to holding a savings account. This involves two internal set variables (included in the declarations compartment) to represent customers with current accounts (custsCurr) and customers with savings accounts (custsSav), which are defined in the predicate in a similar way: relation Holds is range-restricted using edge modifiers (symbol ) to accounts that are either current or savings, and the actual sets are obtained from these restricted relations using the domain relational operator (symbol ←); finally, the bottommost formula says, using enclosure (or insideness), that custsSav must be a subset of custsCurr. This results in the following formulas: custsCurr = dom(H olds {o : Account|o.aT ype = current}) custsSav = dom(H olds {o : Account|o.aT ype = savings}) custsSav ⊆ custsCurr AD GetAccountGivenAccNo (Fig. 2f) fetches the Account object corresponding to aNo? into output a! -a! ∈ {o : Account|o.accNo = aNo?}. AD Account.GetBalance (Fig. 2h), which stores the account's balance in the output bal!, is imported in global AccGetBalance (Fig. 2g), which fetches the account object corresponding to the aNo? input (via imported GetAccountGivenAccNo) and calls Account.GetBalance on this object. CD Customer.New (Fig. 2c), a constructor, says how a new Customer object (c!) should be initialised in the post-condition; using predicate edges, the after-state (represented in bold) of custNo is set non-deterministically, and those of name, addr and cType are set to the corresponding inputs. CD Account.Withdraw (Fig. 2e) declares an amount natural number input and says in the post-condition that the new account's balance (bold line around rectangle) is the old balance minus the requested amount. These two local operations are brought into the global context (of a system or package) through CreateCustomer (CD in Fig. 2b) and AccWithdraw (CD in Fig. 2d). CD of CreateCustomer declares inputs required from the environment and imports Customer.New; CD of AccWithdraw declares aNo? input for account from which money is to be withdrawn, and imported assertion GetAccountGivenAccNo of (Fig. 2f).
VCL's ADs and CDs, illustrated in the VCL model excerpts of Figs. 1 and 2, are major novelties of VCL, giving VCL a strongly graphical character.

VCL and its Visual Effectiveness
VCL's design follows theories of visual notation design, namely, physics of notations (PoN) (Moody 2009) and cognitive dimensions of notations (CDN) (Green 1989;Green and Petre 1996;). The following exemplifies how these theories are applied in VCL using Figs. 1 and 2: -VCL tries to be well-matched to meaning, following CDN's closeness of mapping and PoN's semantic transparency, by conveying the underlying mathematics. For example: VCL's round contour set construct (eg. Customer, Account, CustId and CustType in Fig. 1b, and Holds, Account and Customer in Fig. 1f) taps to similar shapes of mathematics (Venn or Euler circles); the shading of Venn diagrams is used in Fig. 1f to indicate that the set must be empty; the single and double lines of assertions and contracts (see Fig. 2a), respectively, refer to the fact that assertions involve a single set of states, whereas contracts involve the state-sets of pre-and post-conditions. -VCL's graphical primitives follow PoN's principle of semiotic clarity. In the different diagrams of Figs. 1 and 2, sets are consistently rendered as rounded shapes, and values and objects (members of a set) as rectangles, 2 constraints upon single states are hexagons (independently of whether they denote invariants or observe operations). Furthermore, VCL's primitives have a core meaning that varies slightly with the context, enabling users to infer the meaning of graphical expressions in different contexts, following CDN's consistency and PoN's graphical economy. The round contours of Fig. 1b do not mean exactly the same as the several round shapes of ADs and CDs of Figs. 1 and 2, but they all have set-like meanings. -PoN's principles of perceptual discriminability and visual expressiveness can be observed in the panoply of shapes and colours used across the different VCL diagram elements illustrated in Fig. 1 -packages are green clouds, sets are rounded blue contours, assertions are red hexagons and contracts are brown hexagons, objects are yellow rectangles, shading distinguishes empty from non-empty sets, and size and brightness differentiate types of sets and edges. -PoN's cognitive integration (integration of different pieces of information), and CDN's role-expressiveness (how pieces contribute to the whole) is intrinsic to VCL. In SDs, as illustrated in Fig. 1b, the assertions of Bank's SD, defined separately in ADs, explicitly say that the AD-defined invariants constrain the SD's defined state space. BDs, on the other hand, identify all the different pieces of behaviour defined separately in ADs and CDs, as illustrated in Fig. 2a. Importing mechanism in ADs and CDs (illustrated in Fig. 2b, d and g) integrate pieces defined elsewhere to make compound definitions. Furthermore, the '+' attached to package (clouds) assertions and contracts (elongated hexagons) of SDs, BDs and PDs, provide navigation clues that a separate diagram is opened upon double-clicking. CDN's role-expressiveness is also manifested in SDs when line size and brightness are used to distinguish class from values sets (class contours are thicker) to give relevance to classes which are the major abstractions of a domain and act as beacons. -PoN's dual coding (text complements graphics to strengthen communication) and CDN's secondary notation is applied in SDs of Fig. 1b and c to distinguish the different kinds of sets and to reinforce the multiplicity constraints of relations; sets include a word to indicate set-kind, class sets are bold-lined and remaining sets have lines with normal thickness; relation multiplicity is conveyed both visually and textually. Figure 1f reinforces the empty set meaning through both shading and symbol ∅ . -PoN's complexity management (representing information without overloading the human mind) and CDN's abstraction gradient, are intrinsic to VCL. Statics and dynamics are clearly separated through VCL structural and behaviour diagrams; UML represents data and operations in class diagrams that tend to become severely cluttered even for medium to small models. VCL's package construct (represented as clouds, Fig. 1a) defines large modules to keep the contents of each package manageable; reference sets (symbol ↑) enable references to sets from other packages. VCL operations are constructed modularly; operation and assertions may be composed of other modules (operations or assertions); for example, in Fig. 2d, operation AccWithdraw is made-up of observe operations GetAccountGivenAccNo and local contract Account.Withdraw. -VCL addresses CDN's hard mental operations dimension as part of its raison d'être as it tries to improve the usability of formal software design, with its inherently hard underlying mathematics, through visualisation. However, this per-se does not totally solve this problem and two principles discussed above, CDN's abstraction gradient and PoN's complexity management, are key in giving VCL an abstract and modular ethos, which helps in dealing with hard mental operations. The notion of separation, inherent to modularity, is manifested in the different compartments of ADs and CDs (Figs. 1 and 2), which ease the hardness of the task through order and focus as each compartment expresses something specific that is relevant to the whole (also an application of the cognitive integration principle). Furthermore, modellers are encouraged to come up with abstractions and express designs that are modular as VCL provides the means to break down potentially overwhelmingly complex problems into manageable and meaningful chunks. BDs, for instance, encourage abstraction and modularity by letting the modeller focus on the different pieces that make up an overall behaviour. Despite this, there are so many ways in which VCL could be improved to ease hard mental operations.

VCL's Trajectory, Founding Ideas and Suitability as Visual Notation
VCL embodies ideas of both graphical and formal modelling. A major influence is Amálio's PhD thesis (2007) which proposes UML+Z (Amálio et al. 2006;Amálio et al. 2004), a modelling approach combining UML with the formal language Z. VCL's Z semantics (Amálio 2007;Amálio et al. 2005;Amálio 2019) was borrowed from this work. Another inspiration is the work lead by Kelsen on diagrammatic expression of behaviour in the language EP (Kelsen 2006;Kelsen and Ma 2008;. VCL emphasises rigour, formality and modularity. Early works developed the concept with diagrams built using drawing tools (Amálio and Kelsen 2010a;Amálio and Kelsen 2010b;. A major influence was the development of VCL's aspect-oriented modelling approach . Once VCL had been developed as a concept, we embarked upon the construction of VCL's tool, the Visual Contract Builder (VCB), 3 which made VCL more tangible and firmly defined Amálio and Glodt 2015). The tool paved the way to the use of VCL for coursework and student projects (Leemans and Amálio 2012a;Tobias et al. 2012), brought the language to life, and made the experiment presented here possible. Empirical results on modelling tools, emerging from a spin-off survey of the experiment presented here, highlighted that both VCL and its tool were being positively received by users (Amálio and Glodt 2015).
VCL is an experimental language embodying ideas on visual and formal modelling. It was designed to not drastically deviate from UML; it keeps the same OO foundation which has proved suitable for software design. We wanted to improve the graphics, precision and connection to mathematical modelling. VCL SDs, for instance, introduced just a few novelties with respect to class diagrams to exploit the idea of having more self-contained and closely integrated diagrams focussed on data modelling, but avoiding drastic divergences. The statecharts (Harel 1987) notation was inspirational; although developed independently from UML or any of its main predecessors, it ended-up incorporated in the standard. Likewise, we hope that VCL ideas can be incorporated into such standards if they prove to be useful.
VCL's major novelty, its capacity to express predicates visually, makes it a prominent representative of graphical design languages. There are other languages with formal basis and capable of expressing predicates visually, such as augmented constraint diagrams (ACDs) (Fish et al. 2005) and Visual OCL (VOCL) (Bottoni et al. 2001;Ehrig and Winkelmann 2006); however, VCL's most salient difference lies in its superior tool support. In a comparative study focussed on practical application (Tobias et al. 2012), VCL outperformed ACD and VOCL. VCL's VCB tool outperformed the UML tool Papyrus in the comparison of tools used in the controlled experiment presented here that focussed on usability . VCL is the only visual design language expressing visual predicates that tackles modelling in the large through its modelling primitives inspired by aspect-orientation . In terms of usability, VCL appears to outperform both ACD and VOCL in what respects the use of colour for visual expressivity. VCL was applied to case studies proposed as challenges by research communities, such as the large car-crash crisis management system ) and its variant the Barbados car-crash management system (Amálio 2012;Mussbacher et al. 2012), and a cardiac pacemaker (Leemans and Amálio 2012a, b).
VCL is being used in student projects at undergraduate and masters level with education being a modest success. It has been used by many students to design systems and applications. Further developments of VCL could focus on: (a) VCL and its tool, aiming towards code generation, (ii) empirical experimentation, following from the results presented here, and (iii) VCL's formal foundations.

The Experiment's Scope
The study presented here examines VCL as a visual design notation. It seeks to know whether VCL and its capacity to express predicates visually provide any advantage over existing standard notations, in particular UML and its OCL satellite textual notation, from the perspective of modellers and end-users. The sequel distils the aims of the study into the experiment's objective and research questions, which are then translated into tasks and hypotheses.

Objective
The experiment's objective is as follows: Evaluate the effectiveness of VCL on user performance using a set of tasks associated with constructing and using design models by comparing VCL against UML and its satellite textual language OCL.
To accomplish this, two perspectives are considered: modellers and end-users. As said above, the experiment gauges effectiveness, which is meant here to be the degree to which something is successful or adequate in producing a result or accomplishing a purpose. 4 This involves evaluating performance, which means how effective are modellers or end users in accomplishing tasks.
To fulfil its objective the experiment investigates the following: (i) modelling, which assesses performance in building design models and perceptions emerging from doing so; (ii) problem comprehension, which evaluates the problem comprehension gained from modelling; (iii) model usage, which assesses the performance of end-users in tasks related to the usage of models, namely defect detection and model comprehension; (iv) usefulness, ease of use, usability and overall appraisal, which assess perceptions emerging from experiment experiences.

Research Questions
The experiment seeks answers to the following research questions (RQs): -RQ1: Is the performance of modellers in building software designs better with VCL than UML+OCL? -RQ2: Is the comprehension of the problem accrued from modelling better with VCL than UML+OCL? -RQ3: Is end-users' performance in tasks related to usage of software designs, namely defect detection and model comprehension, better with VCL than UML+OCL? -RQ4: Is VCL perceived as being more useful and easy to use than UML+OCL? -RQ5: Is VCL's usability better than UML+OCL's? -RQ6: How is the overall perception of VCL in comparison to UML+OCL?

Dependent Variables
The dependent variables hold measures to assess the different RQs; they are sampled according to independent variables the most determinant of which being the notation (either VCL or UML). RQ1 to RQ3 are measured both objectively and subjectively. RQ4 to RQ6 are measured subjectively. All RQs are analysed quantitatively; RQ6 is analysed qualitatively also.  abbreviation of category (e.g. Co = completeness; Ac = accuracy) followed by abbreviation of measured modelling aspect -either state space (S), invariants (I) or operations (O). For example, CoS is completeness of state space. We use two types of variables: (i) proportion (obtained quantity is divided by maximum quantity, yielding a continuous number between 0 and 1); and (ii) nominal or categorical (possible values drawn from a bounded discrete set).
The sequel gives further details on how the different RQs are assessed.

RQ1, Modelling
RQ1's tasks involve building a design from a given case study narrative. The design is partitioned into: state space, constraints over the state space (invariants) and behaviour as operations made-up of pre-and post-conditions (contracts). We compare the following: (a) VCL structural diagrams against UML class diagrams, (b) VCL assertion diagrams of invariants against OCL constraints of invariants, and (c) VCL assertion and contract diagrams of operations against OCL constraints of operations. The comparison is based on the following criteria: -Completeness (Co) measures how much is modelled. This is based on a breakdown of requirements and corresponding modelling pieces, whose satisfaction is marked manually on an ordinal scale from 4 (fully satisfied) to 0 (unsatisfied). -Accuracy(Ac) measures the quality of what is modelled based on a partitioning of requirements into aspects of interest exercised by test cases, which are evaluated manually on an ordinal scale from 4 (fully satisfied) to 0 (unsatisfied).
This measurement apparatus, detailed in Amálio et al. (2013), aims to make the grading objective, repeatable and unbiased. Completeness and accuracy variables of Table 1 hold objective measures of modelling performance in state space (CoS and AcS), invariants (CoI and AcI) and operations (CoO and AcO) in the form of aggregate proportions (obtained score divided by maximum score). Categorical variables PMS, PMI and PMO of Table 1 hold subjective measures of modelling performance in state space (S), invariants (I) and operations (O).

RQ2, Problem Comprehension
From a set of multiple choice questions framed in a modeller perspective, RQ2 is evaluated objectively and subjectively in variables PC (a proportion of correct answers) and PPC of Table 1, respectively.

RQ3, Model Usage
This is evaluated with the following tasks: -Defect detection (DD), related to model inspection, consists of identifying defects in a model with seeded errors. -Model Comprehension (MC), or how well end-users understand given models, is assessed through multiple choice questions about a given design.
Variable DD of Table 1 provides an objective measure as a proportion of encountered defects. Variable MC, focussed on the end-user perspective, measures the proportion of correct questions in the MC questionnaire.

RQ4, Usefulness and Ease of Use
The examined notations are assessed at the light of perceived usefulness (PU) and perceived ease of use (PEoU) (Davis 1989) in the debriefing survey. PU is the degree to which someone believes that a particular system would enhance their performance. PEoU is the degree to which someone believes that a particular system would be free of effort. Both PU and PEoU are seen as important determinants for user acceptance of a technology (Davis 1989). The corresponding variables, U and EoU (Table 1), hold an aggregate proportion calculated from Likert scaled statements.

RQ5, Usability
The debriefing survey assesses RQ5 based on relevant criteria, namely: readability, navigation, maps and overviews, live error checking, look and feel, learnability and comfort/satisfaction. Usability measures are held in the Us variables of Table 1.

RQ6, Overall Perception
The debriefing survey enquires about overall perceptions of the examined languages based on experiment experiences. The relevant RQ6 dependent variables of Table 1 are: Appr (an appraisal of positive, negative and neutral aspects of VCL based on the open-ended questions of the debriefing survey) and the variables, that gauge the preferred notation with respect to state space (PNS), invariants (PNI), operations (PNO), overall (PN) and future usage (FN).

Hypotheses
There is one major independent variable for used notation -it has two treatments: VCL and UML. The hypotheses are formulated from independent and dependent variables (Table 1). Each dependent variable has corresponding null, H 0 i (no difference between the notations), and alternative, H a i (there is a difference between the notations), hypotheses.

Recruitment
Participants were recruited from the following institutions:

Experiment Design
The following explains the major elements of the experiment's design.

Case Studies
The experiment's case studies, detailed in Amálio et al. (2013), are as follows: -University Library (UL). A university library system enables members to borrow and return books, renew borrowings and recall books unavailable for loan. -Flight Booking (FB). A system to manage flight bookings of different airlines.

Experimental Tasks and Time Allocation
The experiment's tasks feed the dependent variables of Table 1. Each session required participants to work as both modellers and end-users, using either VCL or UML+OCL in either one of the two case studies. A session lasted two hours and comprised the following sequence of tasks (summarised in Participants worked on each of the two systems, using VCL or UML+OCL. They were prevented from collaborating; the work was monitored as tasks were performed in a classroom. Although aware of the experiment's overall goal, participants were unaware of experimental hypotheses or dependent variables.

Design Type and Scheduling
The experiment follows a crossover or within-subjects design -all participants are subject to the different treatments. It involves one main factor (or treatment), the design notation (either VCL or UML+OCL), and a secondary case study factor (either UL or FB), yielding four treatment combinations of a 2 2 factorial design. The rationale is: (a) maximise number of data points to increase statistical power, (b) remove or mitigate bias emerging from differences in case-study complexity or individual ability, (c) at the cost of possible carry-over effects (Greenwald 1976). Participants were split into two groups. The group-assignment was mainly based on availability due to the experiment's voluntary nature. However, an effort was made to scatter the high ability individuals evenly among groups based on the results of an ability questionnaire; this is elaborated further in Section 7. Table 3 outlines the experiment's four rounds. In each round, each group is given a different treatment. All participants were subject to the four treatment combinations.

Modelling Tools
The experiment relied on modelling tools. This enhances language usability, contributes to the quality of resulting models and ensures a connection with modern-day reality, which relies heavily on software tools. The experiment uses two Eclipse-based tools: Visual Contract Builder (VCB) (Amálio and Glodt 2015;Amálio et al. 2011) 5 and Papyrus. 6 VCB is the only VCL tool. Papyrus was chosen because it is Eclipse-based, which ensures a certain degree of similarity.

Training
The training covered the examined notations, focussed on the more challenging tasks of modelling of state space, invariants and operations, and relied on the participant's university training on UML-based design. It consisted of a live and assisted modelling of a system from a requirements description using the notations under study and their supporting tools mimicking the experiment's modelling tasks. No training was given on the experiment's model usage tasks (problem comprehension, defect detection and model comprehension) due to time reasons -if participants were able to model, then they would be able to undertake the easier model usage tasks.
The training lasted 4 to 6 hours with two mandatory hours per notation. Table 4 summarises the training provided at the different experiment replications. Each participant had at least 4 hours of training (2 VCL + 2 UML) with variations across the replications. At

Instrumentation
The experiment's artefacts comprise: (a) case study narratives; (b) sample case study models for tasks of modelling, defect detection and model comprehension; (c) problem and model comprehension questionnaires; (d) ability questionnaire; and (e) debriefing survey. All materials are given in full in Amálio et al. (2013). Each seeded defect and comprehension question was selected according to a number of criteria: it had to cover different aspects or parts of the system to the largest extent possible; it should neither be trivial nor overly difficult to answer or find. The questions had to be relevant and genuine, ideally questions that a software engineer could ask about the system. Standard techniques for phrasing subjective questions and designing surveys were followed (Oppenheim 1996).

Case Study Narratives and Sample Models
All tasks of an experiment session use either one of the two case studies, UL or FB (Section 4.1), exemplars of systems used in the teaching of software design.
At the start of a session, participants were given a requirements narrative (given in (Amálio et al. 2013)). For the task on modelling state space and invariants, participants had to complete a given incomplete model (called initial model) - Fig. 3a gives UL's initial VCL model. For the modelling of operations, participants had to model two operations on another given model (intermediate model with a complete state space and incomplete dynamics).
The given models aimed to get the most out of the short modelling tasks. They acted as ice-breakers, countering against the blank page effect, and contributing to the task's fluidity and engagement to provide meaningful experiment data.
For defect detection (DD), participants were given models with seeded defects, which they had to identify by completing an online form. Figure 3b gives the SD of UL's DD VCL model and Fig. 3c gives the UML counterpart; all faulty models are given in full in Amálio et al. (2013). Defects were seeded in the solution models of both case studies; for example, one error in Fig. 3b is that authors do not have names, whereas in Fig. 3c class Author is missing altogether. Table 5 gives the frequency distribution of seeded defects across the different modelling aspects being analysed (state space, invariants and  T o t a l 3 6 3 6 3 9 3 9 Goodness of fit p-value .51 .9 This considers both case studies; distributions are given with respect to modelling aspects (state space, invariants and operations) operations). A χ 2 goodness of fit test confirmed that the distributions of notations and case study are evenly distributed (see p-values of goodness of fit in Table 5) to avoid bias -no significant differences were found. For model comprehension, subjects were given a complete case study model. Two operations of these models for the UL case study are given in Fig. 4.

Comprehension Questionnaires
After modelling, participants had to complete a questionnaire with 12 multiple-choice questions having one correct answer, to assess comprehension of modelled problem. A sample question is given in Fig. 5.
Model comprehension questionnaires were accompanied by a complete model of the system. They contained 12 multiple-choice questions having a unique correct answer. A sample question of a model comprehension questionnaire for the university library case study is given in Fig. 6. Whilst the problem comprehension questionnaire someone's   Fig. 4 give the answer to question of Fig. 6: when a Copy is recalled, its status changes from onloan to recalled.

Ability Questionnaire
The ability questionnaire assesses the capabilities of participants prior to the experiment. It questioned participants on whether they had completed or were in the process of completing a higher education degree in computer science, and what was their exposure to computer programming, discrete mathematics, visual modelling (using either object-oriented or structured methods), and formal modelling. Figure 7 gives two sample questions of this questionnaire.

Debriefing Survey
The debriefing survey gauges perceptions resulting from individual experiment experiences. This includes perceived performance in tasks of modelling, problem comprehension, defect detection and model comprehension, as well as perceived usefulness, ease of use, usability, preferred notation, and positives and negatives of the two notations under study. In addition, the debriefing survey provides supplementary information meant to support and explain the quantitative results by providing qualitative insight. A sample question is given in Fig. 8.

Statistical Analysis
The quantitative analysis relies on null-hypothesis significance testing (NHST), effect sizes (ESs) and confidence intervals (CIs). ESs are quantitative estimates of the magnitude of some effect of interest, often the size of a difference. CIs give the precision of point estimates or measurements, providing a range of plausible values within which we can have a degree of confidence that the estimate is not due to chance. The analysis was conducted using the R statistical software (R Core Team 2015).

Means, Proportions and their CIs
The analysis emphasises measures of central tendency. For continuous variables, we calculate means (point estimates) whose precision is assessed through 95% CIs calculated using the following formula (Cumming 2012): Above: M is the sample's mean, SE the standard error, SD the standard deviation, and N the size of the sample; t .95 (N − 1) is critical value of the t distribution with N − 1 degrees of freedom and corresponding to the 95% range. 7 Categorical variables are studied with proportions, derived from frequency distributions, which, like means, are estimates whose precision can be assessed with CIs. For proportion CIs, we use the more robust approach of Newcombe et al (Newcombe 1998;Newcombe and Altman 2000), as recommended by Cumming (Cumming 2012) as it provides good approximations even when N is small and P is close or equal to 0 or 1; this gives the formula 8 : Where P is the estimated proportion, x is the observed frequency for the property of interest, and N is the total number of observations -P = x/N.

Hypothesis Testing
For a dependent variable V (Table 1), null and alternative hypotheses are formulated using either continuous or categorical templates: Hypotheses are tested by estimating probabilities called p-values. Given data D, and some null hypothesis H 0 , a p-value is a conditional probability estimate, P (D|H 0 ) (Cohen 1994) -the likelihood of the given observations assuming that the null hypothesis is true. Rejecting H 0 means that it is unlikely because our observations deem P (D|H 0 ) to be unlikely -hence, we accept the alternative hypothesis. Null hypotheses are rejected based on three levels of statistical significance, depending on whether the p-value is below α = .05 (*), α = .01 (**) or α = .001 (***).
Continuous hypotheses are tested using the non-parametric Wilcoxon test (hereafter referred to as W), a robust test as it does not assume that the sample population is normally distributed. Categorical variables are tested using the χ 2 test. Calculated p-values are two-tailed.
To counter against the problems of multiple testing and to increase the reliability of NHST, the analysis uses the false discovery rate (Benjamini and Hochberg 1995) and the method of Benjamini and Yekutieli (BY) (Benjamini and Yekutieli 2001), a more relaxed alternative to the family-wise error rate and the conservative procedure of Bonferroni (Holm 1979), to calculate adjusted p-values. Null hypotheses are rejected based on adjusted p-values only.

Effect Sizes (ESs)
We use as ES measurement the raw (or unstandardised) mean difference. We consider two cases: paired and unpaired data. 7 The classical formula uses the value 1.96, the Z score of the 95% range in the normal distribution. However, the t approach is more robust as it handles small sample sizes (less than 30), the case of many CIs calculated here. 8 The traditional formula to calculate SEs for proportion CIs is: SE = P ×(1−P ) N Given the experiment's within-subjects design, most of our mean difference calculations fall under the paired case. The formulas are as follows (Grissom and Kim 2005;Pfister and Janczyk 2013;Franz and Loftus 2012;Cumming 2012): The formula for the unpaired case is as follows (Cumming 2012): Above, we assume groups A and B, and M UD = M B − M A . Raw mean differences provide an intuitive measurement of an effect, but they lack uniformity, making it difficult to compare effects when the scales differ. To cater for uniformity, we use the ES Cohen's d (Cohen 1988), which is appropriate for continuous variables and means. To estimate this, we use the standard deviation average (or pooled standard deviation) as the standardiser (denominator) (Cumming 2012): There are slightly different ways of calculating the Cohen's d, which vary depending on the formula used for the denominator. The formula above, based on the standard deviation average, fits the paired design being followed (Cumming 2012). CIs for this ES are calculated using approaches based on non-central t distributions (Cumming and Finch 2001;Kelley 2007).
For categorical data, we use the ES measurement Cohen's h (Cohen 1988), based on the arcsine transformation, which is appropriate for differences between proportions. The formula for h and the SE to calculate CIs (from Cohen (1988)) is:

Statistical Graphs
The paper depicts its results using the following graph types: -Histograms (example in Fig. 9c), typically used to depict frequency distributions, use the area of the bars to represent proportions.

Symbols
The sequel adopts the following symbols to convey statistical significance and ES: -For statistical significance, given a p-value p we have: ns = not significant (p ≥ .05); * = p < .05; ** = p < .01, *** = p < .001. -For ESs, we use symbols to denote the attained magnitude level. Given a value es we have: ø = null (|es| ≤ .05); • = small (.05 < |es| ≤ .2); •+= small to medium (.2 < |es| < .4); ••= medium (.4 ≤ |es| ≤ .6); ••+= medium to large (.6 < |es| < .8); •••= large (|es| ≥ .8) Figure 9 characterises the experiment's participants using data gathered from the ability questionnaire. Participants had diverse levels of education and training, ranging from bachelor students taking a course on object-oriented design for the first time to students undertaking their PhD studies in topics related to software engineering. The proficiency scores cater to the subjects': higher education in computer science, ability in computer programming, and exposure to discrete mathematics, OO modelling and formal modelling. Each criterion was evaluated on a scale from 0 (no competence or exposure) to 4 (high degree of competence or exposure to a subject); the final score being a proportion (obtained score divided by maximum score). From this score, participants were classified as unfit (between 0 and .4), novice (between .4 and .6), competent (between .6 and .8) and skilled (between .8 and 1). Table 9a provides the mean proficiency scores and 95%CIs of each venue and in total, and the frequency distributions across proficiency categories, both are pictured in Fig. 9d and e, respectively. In general, participants had the required level of training and education. All participants were, or in the course of being, university-trained in UML-based software design; 23 under-and 19 post-graduate students took part in the experiment (Fig. 9c). Table 9b and the plot of Fig. 9f characterise the proficiencies of the two experimental groups. We can see that the proficiency of group B is non-significantly higher than A's, -W p-value = .06 (

Results
We start with an overview of the experiment's results (Section 8.1), followed by an account of the results for: modelling (Section 8.2), comprehension and model usage (Section 8.3), usefulness and ease of use (Section 8.4), usability (Section 8.5) and overall perception (Section 8.6). Since the experiment uses a crossover design we look into learning effects (Section 8.7). Table 6 summarises the outcomes of hypotheses testing. First column (Hyp) gives the dichotomous outcome: either null (no difference) or alternative (a significant difference in favour of either approach) hypothesis -highlighted using shading (green in colour version). Columns V CL M/P and UML M/P give either a mean (M) or a proportion (P) of VCL and UML, respectively. Column V CL − UML gives difference between either means or proportions. Column p/q gives NHST probabilities (p-values) with appraisals of significance (as per Section 6.5); raw p-values (first value), denoted as p, are calculated using either Wilcoxon (W) or χ 2 tests; if p is significant, we obtain the adjusted p-value, denoted as q, using the false discovery rate (Benjamini and Hochberg 1995) and the method of Benjamini and Yekutieli (BY) (Benjamini and Yekutieli 2001) to counter against multiple testing issues. Finally, column es (effect size) provides measures of magnitude, using either Cohen's d or h and levels of magnitude (as per Section 6.5). Null hypotheses are rejected based on the adjusted p-value q.

Results: a Bird's Eye View
The experiment's results are given in full in Amálio et al. (2013) with an accompanying detailed analysis. The next sections present an abridged analysis with relevant results.

Modelling
Modelling is evaluated objectively based on completeness (how much is modelled) and accuracy (quality of what is modelled), supplemented with subjective measures. We present the data using plots of means and CIs, focusing on the results for overall, each case study, and each experiment venue.

Completeness
The completeness results, corresponding to hypotheses H 1..3 of Table 6, are portrayed in Fig. 10. They are as follows:

Accuracy
The accuracy results, corresponding to hypotheses H 4..6 of Table 6, are portrayed in Fig. 10. They are as follows: -In state space (Fig. 10b)

Perceived Performance
The debriefing questionnaire (Amálio et al. 2013) asked about any perceived notation advantages in modelling. The results, corresponding to hypotheses H 7..9 of Table 6, are given in Fig. 11, which contains a table of frequencies (Fig. 11a), a histogram (Fig. 11b) and a plot of proportions and their CIs (Fig. 11c). The results are as follows: -In state-space (PMS), a higher but non-significant proportion of subjects perceived a better performance with VCL (

Comprehension and Model Usage
The tasks of problem comprehension, defect detection and model comprehension resulted in objective and subjective performance measures.

Objective Performance
The results, corresponding to hypotheses H 10..12 of Table 6 and portrayed in Fig. 13, are as follows: -In problem comprehension (PC, Fig. 13a

Perceived Performance
The results, corresponding to hypotheses H 13..15 of Table 6, are described in Fig. 14 with a table of frequencies (Fig. 14a), a histogram (Fig. 14b), and a plot of proportions and CIs (Fig. 14c). The results are as follows:

Usefulness and Ease of Use
Variables U and EoU aggregate perception scores (proportion) emerging from statements evaluated on Likert scale from 1 (strongly agree) to 5 (strongly disagree). 9 The results, corresponding to hypotheses H 16,17 of Table 6 and portrayed in the plots of Fig. 16, are as follows: -In U (Fig. 16a)

Usability
The usability results, corresponding to hypothesis H 18..25 of Table 6, are portrayed in Fig. 17, which contains a table of frequencies (Fig. 17a), a histogram (Fig. 17b), a plot of proportions and CIs (Fig. 17c), and a forest plot of ESs (Fig. 17d)  The usability analysis above together with Fig. 17d are consistent with the EoU results of the previous section. The spin-off study that compared the experiment's tools (Amálio and Glodt 2015) highlighted that VCL underperformed UML+OCL in the writing criteria, albeit non-significantly; this suggests that, in certain circumstances, graphical editors are less convenient than their textual counter-parts (Amálio and Glodt 2015).

Overall Perception
Participants' overall perception was appraised with respect to preferred notation, and positive and negative aspects.

Preferred Notation
The results, corresponding to hypotheses H 26..30 of Table 6, are depicted in Fig. 18 with a table of frequencies with levels of significance and ES (Fig. 18a), a histogram (Fig. 18b), a plot of means and CIs (Fig. 18c) and a forest plot of ESs (Fig. 18d). They are as follows: These results endorse VCL's positive perceptions with its approach to invariants and operations being appraised favourably (Fig. 18d).

Positive and negative aspects
Individual comments to the several open questions of the debriefing survey were classified as positive, negative or neutral to VCL in comparison to UML+OCL. Figure 19 gives the results; it contains a table of frequencies (Fig. 19a), a histogram (Fig. 19b), a plot of proportions and CIs (Fig. 19c), and two histograms detailing positive and negative comments ( Fig. 19d and e). The results of hypothesis H 31 (  Fig. 18 Perceived preferred notation. (NP = no preference; PNS = preferred state space notation; PNI = preferred invariants notation; PNO = preferred notation for operations; PN = preferred notation overall; FN = notation to use in  (Fig. 19d), we can see that understanding (U, 21), ease of use (oU, 20) and ease of finding errors (EFE,19) were the most remarked. Participants also appraised positively VCL's modelling of behaviour (MB, 18) and invariants (MI,14), and VCL's visualisations (V, 14), usability with respect to ease of access to information (EAI, 13) and navigability (N, 12), while appreciating VCL as an overall language (OL, 10). UML's Papyrus tool was perceived by many as difficult (TD, 10). Some participants appreciated VCL's overall modelling (M, 8), organisation (Or, 8), user interface (UI, 6), appealling (A, 5), and its sate modelling approach (MS, 5), while remarking UML+OCL's bad usability (BU, 6). A few participants found it comfortable to work with VCL (Ct, 4), appreciated VCL's cohesion (Cn, 4) and capacity to provide overviews (Ov, 4) while remarking that OCL is difficult (D, 4). -In terms of VCL's negatives (Fig. 19e), the most remarked aspects were UML's familiarisation (F, 16), the fact that UML is more know (MK, 9) and VCL's bad usability (BU,9). Some participants remarked UML's ease of use ( modelling of behaviour (CB, 8) and cumbersome editing (Ed, 7). Some participants appraised positively UML's Papyrus tool (T, 6) and UML's understanding (U, 5), while recognising that VCL's tool is difficult (TD, 6). A few praised UML as an overall language (OL, 4), its capacity to provide overviews (Ov, 3) and its expressivity (Ex, 3), and that UML is okay and does the job (Ok, 3), while emphasising that they felt comfortable using UML (Ct, 3) and that it is easy to find errors in UML models (EFE, 3).

Learning effects
Figure 20 depicts an analysis of learning effects typical of crossover designs (Greenwald 1976). Figure 20a contrasts first and second attempts at the different experiment tasks for the orders V-U (VCL followed by UML) and U-V (UML followed by VCL). For four

Fig. 20
Learning effects (Legend: V-U = VCL followed by UML; U-V = UML followed by VCL) (out of nine) measures, namely, completeness of state space (CoS), accuracy of state space (AcS), accuracy of invariants (AcI) and program comprehension (PC), there is a performance improvement on the second attempt (a learning effect) independent of the used notation. For completeness of invariants (CoI), there are improvements only when UML is used in the second attempt. For four measures, completeness of operations (CoO), accuracy of operations (AcO), defect detection (DD) and model comprehension (MC), there are improvements only when VCL is used in the second attempt. Figure 20b pictures the means and CIs of the paired differences of both V-U and U-V for each task measure -P D = V V CL − V UML . Figure 20c gives the calculated pvalues and ESs of PD(V-U) and PD(U-V). We observe a significant effect for completeness of state space (CoS), accuracy of state space (AcS) and accuracy of invariants (AcI). The remaining differences are not significant; ESs tend to be small, being medium only for model comprehension (MC). Overall, there is a general tendency for the paired difference to be higher when VCL is used on the second attempt. This suggests that due to VCL's novelty, participants' VCL proficiency grows as subjects learn by doing the experiment's tasks.

Threats to Validity
This section discusses threats to the validity of the experiment reported here, including conclusion, construct, internal and external validity.

Conclusion Validity
Conclusion validity is concerned with the relation between treatment and outcome, and the statistical results. Following from the debate on null-hypothesis significance testing (NHST) (Cohen 1994;Cumming 2013;Nuzzo 2014) and recommendations of highranked social science journals, the analysis supplemented NHST with levels of uncertainty, plausibility and magnitude; measures of central tendency (means and proportions) were accompanied by confidence intervals (CIS) and effect sizes (ESs). As highlighted in Section 6.2, null hypotheses are rejected based on adjusted p-values only, calculated using the false discovery rate (Benjamini and Hochberg 1995) and the method of Benjamini and Yekutieli (Benjamini and Yekutieli 2001).
The statistical analysis strived for robustness. We used non-parametric tests and avoided breaching tests' assumptions. Continuous variables were analysed using the non-parametric Wilcoxon test as it is not dependent on normality assumptions; Wilcoxon calculated pvalues are consistent with parametric t-tests and the robust trimmed means test (Yuen 1974) suggested by Wilcox and Keselman (Wilcox and Keselman 2003). The χ 2 test, used to assess categorical variables, is robust with respect to the distribution of the data, but all categorical variables were found to be normal by Pearson's test. All statistical testing results are given in Amálio et al. (2013); here we provide Wilcoxon, χ 2 and BY p-values only.
For ESs, we complemented the classical Cohen d with the more robust alternatives of Algina et al (Algina et al. 2005) and Wilcox and Tian (Wilcox and Tian 2011). All calculated ESs, given in full in Amálio et al. (2013), were found to be consistent; here we focussed on Cohen's d and h.

Construct Validity
Construct validity concerns how measurements are affected by factors related to experimental settings. It relates to training, case studies, measurement instruments, including the defects seeded, and the different questionnaires. Table 7 presents statements drawn from the debriefing survey, accompanied by levels of significance and effect, that supports the analysis that follows. Likewise for Fig. 21, which depicts performance hindering factors.
The training concentrated on the challenging modelling tasks and relied on participants' prior exposure to software modelling (Section 4.5). It tried to: refresh or enhance design skills and familiarise participants with the modelling tools within the time available. Many participants felt that (Table 7, training): the training was insufficient (T1 , Table 7), more training would have improved performance (T5), training was a major performancehindering factor (Fig. 21, LTr), they lacked knowledge and experience (Fig. 21, LKE), they felt prepared with UML (T2), but not with VCL or OCL (T3 and T4, respectively). This suggests the following: (a) the experiment's tasks were challenging because many participants were unhappy with their performance; (b) participants' prior exposure to UML (confirmed by the results of T2 in comparison to T3 and T4) could have biased the results in favour of UML, but the results are positive to VCL, hence: (i) although seen as insufficient, the training seems to have worked, and (ii) VCL accommodates UML-trained practitioners; (c) training was insufficient for the modelling of invariants and operations as most participants lacked prior training.
The experiment's case studies (CSs), exemplars of real-world software systems similar to CSs used in the teaching of software design, do not favour one notation over the other, and are from familiar application domains (a university library and a flight booking system). They laid the foundation for tasks that are both feasible and challenging, but not Columns: mean of response with 95% CIs, standard deviation (SD). Statement are compared against neutrality to derive significance and effect overwhelmingly complex. Participants recognised many of these characteristics in the CSs, as evidenced by Table 7 (case studies); most subjects felt that: (a) they had enough time to read the use case narratives (CS1), which (b) they understood (CS2), (c) the case studies were fairly easy to understand (CS4 and CS5). Many subjects felt that: the tasks were carried with difficulties (CS3 in Table 7 and DTCS in Fig. 21). A few subjects lacked a clear understanding of the tasks (Fig. 21, LUT). York participants (N = 7) found the CSs to be realistic (CS7, CS8). Overall, this suggests that: (i) the tasks were challenging, (ii) many participants lacked prior knowledge for such a challenging experiment, (iii) although the training worked somehow, it was not able to fill the knowledge gaps of most subjects.
Lack of time (Fig. 21, LTi, 24 out of 43) was a performance obstacle. Limited time and CS's reasonable complexity were countered with starting models to be completed within the allotted times. Despite this, the tasks on modelling of operations and invariants, and problem comprehension, were not entirely satisfactory. Modelling of invariants had to be done together with the state space within 30 minutes, which proved short; modelling of operations was too difficult for a 35-minute task. Task's short duration, lack of training and inherent difficulty, explain the low results obtained in these tasks.
One of the authors marked the models constructed by the participants. To minimise bias and maximise objectivity, a systematic, repeatable and divide-and-conquer scoring approach, detailed in Amálio et al. (2013), was pursued. Both completeness and accuracy relied on a requirements-based marking breakdown made up of individual items that are fairly objective and with little room for subjectivity. The completeness scoring scheme focused on the modelling required by each requirement; the accuracy scoring scheme consisted of test-cases, expressed as snapshots or object diagrams based on an approach outlined in (Amálio et al. 2004;Amálio 2007). Marking both completeness and accuracy involved going through each item of the marking scheme.
The seeded defects, chosen to avoid both favouring any of the notations and being obvious, were scattered evenly across state space, invariants and operations (see Section 5.1). Defect detection (DD) was designed to be intuitive and challenging, which was acknowledged by the participants (DD results in Table 7); an overwhelming majority of them understood the task (DD1, in Table 7) and many found it challenging because the errors were not obvious (DD2), which is endorsed by DD's modest final scores: .27 for VCL and .19 for UML+OCL.
The model comprehension questions were selected to be of a certain level of complexity and to ensure a reasonable level of coverage. They were articulated to avoid bias in favour of one treatment over the other; the multiple-choice questions eased marking. The debriefing survey was designed to capture the perceptions of subjects with respect to the experiment experience. It contained questions that targeted experiment hypothesis related to subjective assessments. To avoid bias, and to increase the reliability of the collected data, we followed guidelines of questionnaire design (Oppenheim 1996).
The modelling tools could bias or confound the results. We used tools built atop the same platform, Eclipse, to ensure a certain degree of similarity. Papyrus is a major UML Eclipse-based tool with a significantly larger user-base than VCL's tool, VCB. Despite some criticisms, both tools were seen as being stable (Table 7, modelling tools). A spin-off survey, tied to the experiment presented here (Amálio and Glodt 2015), compared these two tools with a focus on usability; the results, favourable to VCB, concluded that participants could not always discern between tool and language. Nevertheless, VCB is the implementation of VCL's language and its underlying ideas. The fact that VCB was competitive with Papyrus, a major UML tool, testifies to the quality of both VCL and VCB. Certain VCB features may be superior to Papyrus, but that per-se does not account for VCL's positive results presented here and in Amálio and Glodt (2015). Both tools, built on top of Eclipse's modelling infrastructure, have considerable room for improvement.

Internal Validity
Internal validity threats are related to external factors that affect the outcomes of the experiment, but are not a consequence of the studied treatment. Table 8 presents statements drawn from the debriefing survey, accompanied by levels of significance and effect, that aids the analysis presented here.
The experiment's volunteer basis is a possible internal threat. Usually, experiments resort to blocking to counter against variability in the capability of subjects. In our setting, this was, by and large, infeasible; assignment to a group was mainly decided on the basis of a subject's availability. Section 7 highlights a slight and non-significant imbalance in the proficiency scores of both experimental groups.
Information exchange did not affect the experiment: (i) all participants were working in parallel on the same case study, (ii) group sessions would run separately, but on the same day (morning or afternoon) and (iii) participants were busy. There was hardly any opportunity or motivation for exchanging information.
The experiment's crossover design suffers from known carry-over effects (Greenwald 1976). Here, the most relevant carry-over effect is the learning accrued from carrying out tasks on the same system using the two languages; modelling using one language may give rise to a learning effect leading to an improved performance when modelling with the next language. To balance the experiment and counter against such learning effects, the order of the two notations with respect to the two case studies was permuted for each group -a group starting with VCL on university library (UL) would start with UML+OCL on flight booking (FB), and vice-versa. We followed a 2 2 factorial design (case study is a factor) with the case study tasks being undertaken in parallel to: (a) avoid case study difficulty as a confounding effect, 10 (b) counter against exchanges of information, and (c) have more experiment practice (a total of 4 rounds) because software design is challenging, the training was somehow insufficient (see discussion on training in Section 9.2 above) and any practice improvements would only lead to more interesting results and better founded opinions for the debriefing survey. To counter against further carry-over effects, participants were 10 The different levels of difficulty was acknowledged by the participants in the debriefing survey; 27 out of 43 participants said that FB was more complex (p = .63 [.48, .76]), 11 found that UL was more complex (p = .26[.15, .4]) and 5 found the two case studies to be similar of complexity (p = .12 [.05, .24]  Columns: mean of response with 95% CIs, standard deviation (SD). Each statement is evaluated against neutrality to derive significance and effect size not given feedback on their performance. The analysis of order of notation (Section 8.7) detected a few significant learning effects with large or medium effect sizes, however, this does not affect the paper's main results: modelling of operations, defect detection and model comprehension.
Lack of engagement was refuted by the debriefing survey (Table 7): participants liked the experiment (E1), finding it interesting (E2), somehow challenging (E3) and positive from a learning perspective (E4). One of the options to the question "please indicate the reasons for your lack of motivation if you felt somehow demotivated" was "I do not really see the need for software design modelling in software engineering", which was selected by only one participant (out of 43).

External Validity
External validity, concerned with the generalisability of the results, is a recurrent issue in software engineering research (Glass 1994). Given that we want to extrapolate our results to software engineering practice, a question that emerges refers to the experiment's degree of realism (Sjøberg et al. 2002). In software engineering controlled experiments there is often an issue of scalability. Due to time constraints, size of case studies and tasks are reduced, and this may break the connection to industrial realities which usually deal with larger problems. We alleviated this issue with case studies that are neither too small nor trivial, and are realistic. This was acknowledged by many participants (see Section 9.2, above). To ensure the feasibility of the modelling tasks within allotted times, participants were given partial models that they were required to complete.
Another problem is the degree of representativeness with respect to software engineering practitioners (Falessi et al. 2017). The participant cohort was culturally diverse; we ran the experiment in universities of three European countries, involving participants from across the globe. Participants were students who may be perceived as unrepresentative of industrial professionals; however, many postgraduate participants had previous industrial experiences, and many were about to embark on new professional careers in industry. The experiment results attest to the cohort's degree of representativeness; a few individuals performed remarkably high, others noticeably low. Collectively, the average results on state space modelling and other model usage and comprehension tasks show that participants were able to undertake these required tasks, even though they were challenging. The low results on the modelling of invariants and operations are more due to the experiment's time constraints, the inherent difficulty of these tasks and the limited time available for training, rather than the cohort's overall ability. We see our cohort as a reasonably competent workforce, which is reflective of industrial settings applying design modelling. Empirical studies failed to find significant differences in performance between students and professionals (Höst et al. 2000;Arisholm and Sjøberg 2004;Salman et al. 2015); a recent study found out that professionals appear to perform better in tasks they are accustomed to, but there is no difference when it comes to using new approaches or technologies (Salman et al. 2015).
A criticism of software engineering controlled experiments concerns pen-and-paper tasks which are not seen as reflecting modern-day realities. The experiment presented here used Eclipse tools, a popular platform for modern-day software development that we see as reflective of current practice.
We see the results presented here as generalisable to general graphical modelling, even though only two languages are examined. This is because VCL is a suitable representative of a largely visual software design language (Section 2.4) and UML+OCL is a suitable baseline.

Related Work
This is the first empirical study that investigates the benefits of a modelling language capable of expressing predicates graphically, a pre-requisite to the diagrammatic expression of: (i) system invariants, (ii) and operations as contracts made up of pre-and post-conditions. Other languages with this capability, Visual OCL (Bottoni et al. 2001;Ehrig and Winkelmann 2006) and Augmented Constraint Diagrams (Fish et al. 2005), lack empirical scrutiny. Table 9 summarises related empirical studies. This paper covers several aspects of previous work: -It covers comprehension, a focus of many studies (Purchase et al. 2002;Purchase et al. 2001;Torchiano 2004;Staron et al. 2006;Ricca et al. 2007; 2010) to investigate whether either notation or modelling per se are fulfilling their aims. This paper insists on end-user comprehension (either through model comprehension or defect detection tasks), but going beyond to explore the problem comprehension gained from modelling. -It goes beyond data modelling, sharing with (Kim and March 1995) the emphasis on the modeller perspective, with Otero and Dolado (Otero and Dolado 2002) the focus on dynamic modelling and with Briand et al (Briand et al. 2005;Briand et al. 2011) the focus on constraints and design by contract, but exploring novel graphical notations to the modelling of invariants and operations not covered by any other study.
Certain interesting aspects of related work are unexplored here: -The impact of UML design on maintenance involving either code or design changes (Tilley and Huang 2003;Arisholm et al. 2006), which also delves into model comprehension as understanding is often a precondition to accomplishing the changes that are required. -Coarse-grained modelling (Moody 2002;Farias et al. 2012;Ali et al. 2014), which is related to VCL's coarse grained modularity approach inspired by aspect-oriented modelling .
No other work in the literature has looked into the graphical expression of invariants and operations. This paper explores this problematic through a comparison of VCL against OCL, which expresses invariants and operations textually. It examines VCL's novel graphical approach with respect to end-user and modeller understanding, modelling effectiveness and usability.

Conclusions
Visual modelling has always been part of software engineering (Chen 1976;Ross 1977;Ross and Schoman 1977). Graphical design approaches have been advocated for nearly three decades (Harel 1988(Harel , 1992. Nevertheless, our knowledge of visual modelling is sketchy. We know that issues of syntax are vital to the usability of diagrammatic notations, seen as de-facto languages of software engineering practice (Moody 2009). However, the alleged benefits of visual modelling constitute a patchy region made up of many gray areas lacking empirical scrutiny. Is visual modelling a good idea? This article shed some light on this general question. If data modelling is well studied and reasonably convincing, the same is not true for other more intricate aspects of software design. This paper pursues an answer to the following question: can we model effectively more complicated aspects of a software design, such as constraints and operations, graphically? If the research on the visual expression of predicates and system dynamics has shown that a largely visual approach is possible, we are still largely unsure on whether graphics are any better than text.
This papers delves into this question through a controlled experiment carried out four times, which by studying VCL's effectiveness as a visual language tries to draw more general conclusions about graphical modelling. VCL is largely graphical; different modelling aspects are expressed largely diagrammatically, including invariants and operations. It is a suitable representative of largely diagrammatic languages. The experiment compares VCL against the standard UML and its OCL satellite notation, which, together, champion an approach to design modelling that is partially graphical with invariants and operations expressed textually in OCL. The experiment involved 43 students from four universities in three countries who received training in UML, OCL and VCL. The comparison focussed on: (i) modelling of state space, invariants and operations (RQ1); (ii) problem comprehension (RQ2); (iii) model defect detection (RQ3); (iv) end-user model comprehension (RQ3); (v) usefulness and ease of use (RQ4); (vi) usability (RQ5); and (vii) overall appraisal (RQ6). Aspects (i) and (ii) take the perspective of the modeller or designer, (iii) and (iv) of a design end-user, and (v)-(vii) of a general software engineer or computer scientist. Careful attention was paid to ensure that the observed trends were due to the notations and not other extraneous factors.
A relevant result of this paper is that VCL modelling of operations got better results than OCL. Individuals performed significantly better using VCL with respect to completeness, but not accuracy. A significant proportion of subjects perceived a better VCL performance, VCL was the preferred notation for modelling operations with a significant difference and behavioural modelling was perceived as a major positive aspect of VCL (Fig. 19d). Therefore, results suggest benefits of diagrammatic modelling of operations in a VCL style.
Results are unclear for the remaining modelling aspects. In state-space modelling, VCL's graphical improvements were appreciated by an interesting minority, but most remained agnostic to them, possibly due to the familiarity of widespread UML class diagrams (Fig. 19e). In the objective measures, VCL failed to provide an improvement. Nevertheless, VCL's conservative approach with respect to UML modelling paid off: with some training participants transposed their prior UML knowledge into VCL. Hence, on the one hand, the familiarity of UML class diagrams is hard to beat, but, on the other hand, participants transposed knowledge across languages becoming accustomed to the differences in syntax.
In the modelling of invariants, VCL failed to provide significant improvements. However, a non-significant proportion of participants perceived a VCL improvement, a significant

RQ1
On modelling of state space, invariants and operations: • VCL's structural diagrams (SDs) were quasi-equal to UML class diagrams (variables CoS, AcS and PMS).
• On invariants, no significant VCL benefits were encountered.
• On operations, significant VCL benefits were encountered in completeness (CoO), but not accuracy (AcO). A significant proportion of subjects perceived a better VCL performance (PMO).

RQ2
On the modeller's comprehension of modelled problem, no significant differences were encountered in both objective (PC) and subjective measurements (PPC).

RQ3
On model usage, significant VCL benefits were encountered in objective measurements of defect detection (DD) and model comprehension (MC).

RQ4
In usefulness (U), no significant VCL benefits were encountered. In ease of use (EoU), VCL was perceived as being significantly better than UML+OCL.

RQ5
Usability was analysed from different angles. Significant VCL benefits were found for navigation (UsN), live error checking (LEC) and look and feel (LF).

RQ6
On the overall perception: • A largely significant proportion of participants preferred VCL as a notation to express invariants (PNI) and operations (PNO).
• In the appraisal of positive and negative aspects (Appr), VCL was appraised favourably as positives significantly surpassed negatives.
proportion of participants chose VCL as the preferred notation for invariants, and VCL modelling of invariants was perceived as a major positive aspect (Fig. 19d). These positive results, together with the low scores in the modelling of invariants and operations, suggests the need for a better experimental apparatus, with improved training and more time devoted to carry out these more complex tasks -as remarked by the participants' appraisal of the training (Section 9.2). In model usage, the results signal a VCL improvement. In defect detection (DD), VCL's objective performance was significantly better, which is consistent with the way participants perceived their performance; ease of finding errors in VCL models was a frequently occurring positive aspect of VCL (see Fig. 19d). In model comprehension (MC), VCL was significantly better, which is consistent with the way participants perceived their performance; comments concerning understanding, ease of use, easy to access information were among the most frequently occurring VCL positive aspects (see Fig. 19d).
In three usability criteria, navigation, live error checking and look and feel, VCL outperformed UML+OCL significantly. It is interesting to relate these criteria to the theories of notation design considered here: physics of notations (PoN) (Moody 2009) and cognitive dimensions of notations (CDN) (Green 1989;Green and Petre 1996;. Navigation is related to PoN's principle of cognitive integration (also maps and overviews, rated well but without significance); live error checking is related to the CDN's error proneness and hard mental operations, as VCL's tool warns users when they write something meaningless which aids reasoning and avoids mistakes; and look and feel is related to PoN's principles of semiotic clarity, perceptual discriminability and complexity management -a result of VCL's syntactic clarity, visual expressivity and overall tidiness. This suggests that, with respect to these theories of visual notation design, VCL and its tool appear to be better than UML+OCL and Papyrus.
Both notations were perceived as equally useful, but VCL was largely perceived as more easy to use. VCL was also the preferred notation for invariants and operations by a significant proportion. Finally, VCL was also highly appraised as positive in comparison to UML. This appears to endorse VCL's better model usage results and VCL's overall graphical approach.
The findings presented here are summarised in Table 10 for each research question (Section 3.2). Overall, results suggest usability benefits of graphical software design, which was clear in model usage and more modest in modelling. Participants responded well to VCL's novel graphical notations, providing empirical evidence to motivate further research on diagrammatic modelling.
The results presented here should not be regarded as a claim of absolute scientific truth, but rather as a contribution to a research question. The stronger results need to be confirmed through replication, and the weaker results together with the new questions spurred by the paper's analysis require further experimentation. Nuno Amálio is a lecturer in Computer Science at Birmingham City University (UK). Nuno was awarded a PhD in Computer Science from the University of York, a MSC in Software Engineering with the same university, and a BsC in Computer Science with the University of Lisbon (Portugal). Nuno worked as a postdoctoral researcher at the University of York, Autonomous University of Madrid, University of Luxembourg and City University of London. He also spent one year in industry working in the development of large-scale critical systems. Nuno's research interested include: software modeling and design, formal methods, visual languages, formal verification, formal semantics, requirements, empirical software engineering, and security.
Lionel Briand is professor of software engineering and has shared appointments between (1) The University of Ottawa, Canada and (2) The SnT centre for Security, Reliability, and Trust, University of Luxembourg. Lionel was elevated to the grade of IEEE Fellow in 2010 for his work on testing object-oriented systems. He was granted the IEEE Computer Society Harlan Mills award and the IEEE Reliability Society engineerof-the-year award for his work on model-based verification and testing, respectively in 2012 and 2013. He received an ERC Advanced grant in 2016 -on the topic of modelling and testing cyber-physical systems -which is the most prestigious individual research award in the European Union. Most recently, he was awarded a Canada Research Chair (Tier 1) on "Intelligent Software Dependability and Compliance". His research interests include: software testing and verification, model-driven software development, applications of AI in software engineering, and empirical software engineering.