1 Introduction

Business process modeling is a broad and important area for practice and for applied research. Process modeling refers to the representation of organizational or business processes in a graphical manner, usually as a flow of activities [10]. It is common across industries—important for designing and improving business processes, analyzing industry goals and outcomes—including organizational efficiency, revenues, and social impact [10]. For these purposes, the quality of the process model is of importance. Model quality has been classified to syntactic (“correctness” of a model), semantic (the extent to which the model captures the behavior of the domain), and pragmatic (usefulness) quality [14]. When focusing on pragmatic quality, i.e., the correspondence between a model and people’s interpretation of it, important aspects considered are the understandability and the readability of the model by a human user. Several efforts have been made, attempting to study and to improve user comprehension of process models. For example, [20] lists and describes in detail seven process modeling guidelines, with the aim of improving the understandability of the models designed according to such guidelines. Another work [29] proposes a framework which allows to improve the readability of large process models by abstracting not useful elements and tailoring the visualization (both in terms of appearance and format) to the needs of the end users. However, a comprehensive set of features for the quantification of users’ comprehension of process models is still missing.

A large body of research has addressed factors that influence model understandability and readability, relating both to business process models and to other kinds of conceptual models. Much attention in this respect has been given to semantic clarity of the modeling language [19, 27] and to the graphical elements of the modeling language [21, 28]. Additional factors identified relate to specific properties of an individual model (e.g., complexity metrics [7, 17, 34, 35]). Visual features of elements in a model [30, 32] have been studied, specifically the effect of what is sometimes called “secondary notation” on model understandability [15, 31]. In contrast, the specific layout of a model has received little attention. To the best of our knowledge, only a few studies have investigated how layout features of a process model affect its understandability, providing very limited coverage. One specific example is [6], addressing the flow direction of a process model and its effect on model understanding. This investigation addressed only consistent flow directions (e.g., left to right, top to bottom), not considering the possibility of change in the direction of the model.

Cognitive psychology research has shown that the appearance of a model in general has a significant effect on user’s comprehension. Because of that, approaches to automatically lay out graphs while preserving the mental map have been proposed [21]. Moreover, results reported in [16] indicate that patterns involving graphical representations (see, for example, patterns P1Footnote 1 and P2Footnote 2 which received the highest scores) are effective in terms of both usefulness and ease to use. Therefore, the visual layout of a process model is central to achieve its aims—effectively communicating the intended process, ensuring comprehension by its users, and enabling revision and improvement of the process model. Yet, we currently lack an agreement upon set of concepts for describing and characterizing layout properties as perceived by humans and this lack hampers the ability to comprehensively study the effect of model layout on its understandability and readability.

In the context of modeling and model-generating tools, model layout has received attention and has been characterized by precisely defined properties. These tools typically include functionality that determines the layout and rearranges the elements in the model in order to improve its readability (e.g., [8, 9, 11]). The algorithms that are employed relate to precisely defined properties, such as avoiding line crossings in the model, alignment of model elements, usage of straight angles with the goal to produce a “neatly” appearing model. However, there is no indication regarding how comprehensively these properties correspond to and capture the human perception of the model. In fact, there is no cognitive anchoring of the specific selection of features that are currently addressed.

We believe that a set of concepts describing layout features should comprehensively correspond to how humans perceive the layout of the model. At the same time, it should be precisely defined and allow quantification and measurement of layout properties, serving several purposes. First, it will permit a more focused study of the effect of process model layout on model understanding. Such studies may address specific properties or combinations of properties. Second, it can become a basis for guiding the creation of process models. Currently, although there have been some broad efforts to guide modeling from a visual perspective—such as 7PMG [18]—process modelers individually decide how to design the process model layout. Note that most of the modeling guidance that is available (e.g., in 7PMG) is semantic and structural, while the visual perspective is only addressed to a very limited extent. Third, systematically developed layout guidelines may support the training of modelers. When applied in an online setting, they can further serve as basis for intelligent modeling environments that provide feedback to modelers (on how to improve the model). Finally, they can serve for the development of automatic layout features of modeling tools.

The aim of this paper is to take a step toward a set of human-meaningful and measurable layout features of process models. This paper extends our previous work [1] where key visual features were elicited in an exploratory study. This study, as reported here, resulted in a set of layout features reflecting human perception. Our main contribution in the current paper relates to the transformation of human-perceived layout features into quantitative metrics. We focus on one specific feature (i.e., consistency of flow) to highlight the challenges of such transformation. Then we propose two different approaches operationalized into three metrics to address these challenges, each highlighting a different aspect of the flow consistency. Finally, we report results of an empirical evaluation, which indicates the extent to which these metrics are consistent with the human perception; results showing the models where these metrics differ; and results regarding computational efficiency.

Accordingly, the remainder of the paper is structured as follows: Sect. 2 presents the methodology and setting of the exploratory study; Sect. 3 reports the findings of this study as a list of identified layout features; Sect. 4 focuses on the consistency of flow feature and presents alternative approaches for its quantification. Section 5 deals with the empirical evaluation of the proposed approaches and discusses the findings, while Sect. 6 reviews related work. Section 7 provides the conclusions and highlights implications of this study for future research.

2 Exploratory study

The exploratory study was guided by the research question of what layout features of process models are perceived as meaningful by model users. Due to the nature of the question, which seeks to discover features rather than to corroborate hypotheses, the study was exploratory in nature. Yet, to increase validity we have also conducted a validation study, intended to validate the findings of the exploratory study with a different subject population. To identify candidate visual features of a process model, an empirical qualitative study was conducted. Qualitative data gathering was needed in order to get an understanding of a user’s point of view [4]. Acquiring knowledge from participants was essential to understand how they perceive the visual layout of business process models.

2.1 Setting

The exploratory study consisted of two steps that took place at the Department of Information Systems at the University of Haifa. It was later on complemented by a validation study. Participants in the exploratory study were 15 undergraduate (in the first step) and seven graduate students (in the second step). Participants all had some knowledge of business processes modeling. All participants came with similar educational background—all took a variety of information systems analysis courses and studied modeling languages. Participation in the study was voluntary.

The study included questionnaires and interviews. At the first step, 15 undergraduate students were asked to fill out a questionnaire. Following an initial screening of the answers that were obtained, the second step included interviews with seven graduate students, intended to gain a deeper understanding than what could be gained based on the questionnaires. The interviews were based on the questionnaires, but allowed for interaction and prompting deeper explanations. Thus, interviews were used in order to get a better understanding of the user perspective on the visual layout of business process models [22].

2.2 Data collection process

Questionnaires The goal of the questionnaireFootnote 3 was to elicit participants’ beliefs and perceptions of the visual layout features of process models. A pilot questionnaire was given ahead of time to three participants in order to simplify and improve the questions in the questionnaire. The questionnaire had five pairs of BPMN models whose visual similarities and differences were to be elicited. The models were made small to fit a single page, yet it was ensured that their structure was clearly visible. All activity labels or edge labels were blurred in order to have participants address the visual aspect of the model exclusively rather than “read into it.” Some of the pairs included two models which appeared to be visually different according to the judgment of one of the researchers, while others appeared visually similar. In other words, the pairs were selected to have different levels of similarity according to the researcher’s judgment. The models were all presented in black and white in order to have participants focus on layout features of the models. Color might have drawn much attention and blur the effect of less dominant features [22]. An example pair from the questionnaire is shown in Fig. 1. The questionnaire consisted of the same set of questions referring to each of the pairs. This set included one question asking the participants to rate the visual similarity of the models on a 7-point Likert scale. The goal of using a Likert scale was to prompt participants to actually look at the figures, compare their layout, and evaluate their similarity. Following the Likert scale evaluation, two open-ended questions were presented to the participants, asking them to indicate differences and similarities between the models (at least three of each). Only the answers of the open-ended questions were considered in the data analysis; the Likert scale evaluation was only intended to prompt the comparison of the models and was not analyzed afterward.

Fig. 1
figure 1

Example pair of models from the questionnaire. In this case, labels should be ignored since, on the questionnaire, they were blurred to just focus on the visual aspect

Interviews As a second step of the study, the interviews were intended to gain a more comprehensive view of layout features that influence the perception of visual appearance of models. To this end, the interviews were semi-structured, based on the questionnaires. First, participants were asked to fill the questionnaire. Next, the participants were asked specific questions about their answers, prompting additional explanations about specific differences or similarities between the models of each pair. In particular, participants were asked to explain their answers and justify them by indicating relevant parts of the models. They were asked to express their preferences regarding the models and the specific features. They were also asked to compare the models in terms of clarity and to indicate what improvements could be made to increase the clarity of their appearance. The interviews were recorded and transcribed.

3 Analysis and findings

The data collected from the questionnaires and interviews were qualitatively coded and classified into categories.

3.1 Analysis

We started by analyzing the interview transcriptions, since they were more informative than the written answers of the questionnaires. The text was broken into segments, each referring to a single layout feature. Using qualitative data analysis methods [33], textual segments were coded by the model(s) they referred to and classified to categories in a process of open coding. Saturation of the categories was reached by the fourth interview, so eventually no new categories were found with additional data analysis. We nevertheless analyzed all the collected data in a similar manner, including the written answers of the questionnaires obtained at the first step of the study. Open coding was followed by an axial coding, where categories were related to each other and grouped into more general categories to form a hierarchy. In our case, grouping related to the model’s elements (e.g., edges) and properties (e.g., structure).

3.2 Categories

Table 1 summarizes the categories found by qualitative analysis. It provides a definition for each category, and examples of supporting statements taken from the questionnaires and interviews. The lower-level categories are grouped under the identified higher-level ones (where applicable). All the categories in the table are supported by at least two statements.

Table 1 Identified feature categories

3.3 Study validation

Although saturation was reached during the data analysis of the exploratory study, the relatively uniform population participating in the study posed two main threats to the validity of its findings. First, all participants had a similar educational background and little practical experience. Second, they were all Hebrew speakers, trained in reading left to right as well as right to left, and this is not typical of other populations. To overcome these limitations, we performed a separate validation study, interviewing modeling experts and practitioners from a more diverse geographical distribution. The validation study included seven modeling experts and practitioners from Israel, Austria, Italy, and Estonia. Each one was interviewed using the same questionnaire and procedure as in the second step of the exploratory study. When analyzing the data, rather than repeating the full qualitative analysis process, we attempted to map the textual segments to the categories already found in the exploratory study. Only if a statement could not be mapped to one of these categories, it was marked for further analysis. Eventually, we analyzed and classified the (few) unmapped statements.

The results support all the categories identified in the exploratory study, each with at least two indications by different participants. In addition, we got two other categories supported by at least two participants each and considered one of them as a relevant addition to our list:

  • The first category is fixed size of activity boxes, which refers to the possibility of having different sizes of the activity boxes for short and long textual descriptions of the activities. Interestingly, the activity boxes in the models were all of fixed size. Yet, the experienced modelers recognized the possibility of varying the box size. As one of the interviewees said: “in both models the sizes of the activity boxes are fixed. When I create models I sometimes change the box size to fit longer descriptions. It is strange they didn’t do so.”

  • The other category deals with implicit versus explicit gateways, which represents a known property associated with the pragmatic quality of BPMN models. However, since this feature is specific to the syntax of BPMN we decided to leave it out of our list.

In summary, the results of the validation study fully support the results of the exploratory study and extend it with one additional feature.

3.4 Threats to validity

Our study is subject to several threats regarding external validity. According to the methodological guidelines of qualitative text analysis, a clear indication that sufficient data have been collected is when the theoretical saturation point is reached (i.e., no new categories are revealed with additional data) [2]. According to [2], when saturation is reached, no additional data collection is needed. In our study, theoretical saturation was reached at the fourth interview. Yet, we continued analyzing data beyond the saturation point and performed an additional validation study with a different population to ensure the validity of the findings.

Another potential threat relates to the homogeneity of our subject group in terms of their interest in process models. In the context of our study, however, the fact that all our subjects are familiar with BPMN is a desirable feature: Process models are not general graphs since they have a very specific semantic associated with the elements. Therefore, only people familiar with this formalism are able to grasp the peculiar characteristics of the representation. The lack of homogeneity, however, also implies that no generalizations can be made regarding the formalism.

Another potential threat relates to the fact that the subjects of the exploratory study were students. Here we would like to argue that for this particular study, this does not represent a major issue, since our questionnaires do not rely on experience [13]. Moreover, this risk was mitigated by the use of experts and practitioners in the validation study, whose results supported the ones of the exploratory study. In order to further mitigate this risk, we cross-checked our results with the literature on BPM, graphs theory, and conceptual modeling. As reported, this literature research showed that all the features addressed by existing studies were included in our set, while at the same time our set also contains additional features.

4 Quantification of flow consistency

In Table 1, the different visual features identified in the exploratory study are highlighted. In order to enable the systematic investigation of how these features impact on model creation and use, it is necessary to transform them into metrics that can be automatically computed. In our previous paper [1], we already reported details about some of these metrics. In the following, we will focus on one of the identified features, namely consistency of flow. In general, we can state that the consistency of flow measures the extent to which the layout of a process model reflects the temporal logical ordering of the process. This metric, not analyzed in detail in our previous work, is particularly challenging since it involves “high-level concepts” and how such concepts are represented. Moreover, there are several different ways of computing it, and it is not obvious which approach would most closely reflect human perception.

For example, let us consider the models in Fig. 2. Figure 2a depicts a model which structures the flow in one, horizontal, left-to-right direction. The model in Fig. 2b, in turn, contains three “horizontal lines,” each of them with a clear direction. In this case, in order to read the complete process, it is necessary to change the reading direction between each line: The reader has to “go back” with the eyes to the left side before continuing to the right again. Therefore, the flow direction of this process is less consistent compared to Fig. 2a. For the model in Fig. 2c, it is even more difficult to identify a clear flow direction.

For most of the visual features described in Table 1, a clear mathematical quantification is mentioned in our previous work [1]. The description of the consistency of the flow, however, is just sketched. In particular, several options exist on how to approach the quantification. Metrics for the automatic identification of changes in the flow direction can be based on global or local features. Global features look at the overall shape of the model and allow to detect the flow consistency based on the “global shape” of the process. Local features, in turn, consider how activities (i.e., vertices of the graph representation of the process) and sequences (i.e., edges of the graph representation of the process) are positioned in relation to each other. Concerning global features, human beings can easily detect that the model illustrated in Fig 2a consists of one horizontal line, while the model shown in Fig. 2b is spread over several lines. For algorithms, however, the identification of such patterns is rather difficult, while local features can be much better dealt with. Since our goal is to provide algorithmic solutions, we decided to follow the second approach, based on local features. To this extent, and since we also want to closely reflect the human perception, we devised three different metrics. The first two metrics calculate the direction of each edge; determine the most frequent direction, based on majority voting; and then, based on this predominant direction, compute the extent to which the edges of the model are consistent with this direction. The third metric, instead, looks at pairs of activities and determines whether their positioning reflects their temporal local ordering.

Fig. 2
figure 2

Examples of models with different layouts, obtained starting from the same process description. a Process model with a consistent direction of the flow. b Model with some violations of the flow consistency. c Model without a strong flow consistency

Fig. 3
figure 3

Three differently shaped edges \(e_1\), \(e_2\), and \(e_3\) that, using our abstraction framework, are equally represented, since it considers only their starting and ending coordinates. In this case, the representation is given by the pair of points \(((s_x, s_y), (e_x, e_y))\)

Fig. 4
figure 4

Snippets of three common representations of edges outgoing from a BPMN gateway to activities. Edges of each snippet are laid out according to the edge shapes reported in Fig. 3

4.1 Graphical representation of processes

In order to present our metrics, we define the graphical representation of a process model as a diagram \(G = (V, E, L_V, L_E)\). G is a tuple composed of the set of vertices V and the set of directed edges \(E \subseteq V \times V\). Each vertexFootnote 4 and each edge has a corresponding graphical representation. Therefore, different from typical graph characterizations available in the literature [3], we added two more relations \(L_V\) and \(L_E\), with information regarding the positioning of the respective elements on the modeling canvas (i.e., coordinates on the Cartesian plane). For vertices, we consider the central point of its graphical representation \(L_V : V \rightarrow (\mathbb {N} \times \mathbb {N})\). For edges, in turn, we consider two coordinates, one representing the starting and one the ending points. \(L_E : E \rightarrow (\mathbb {N} \times \mathbb {N}) \times (\mathbb {N} \times \mathbb {N})\). Note that this edge representation abstracts from the actual path of the edge (cf. Fig. 3, where the three edges have exactly the same representation in our framework) and is therefore able to deal with edges whose layout is typical of business process models. For example, Fig. 4 depicts three possible and common ways of representing edges exiting from a gateway. The three different representations are based on the edge styles reported in Fig. 3 and can also be observed in the processes depicted in Figs. 1 and 2.

4.2 Metrics based on edges’ directions

In this section, we introduce a function, \(\textsf {Cons}(G)\), which will be operationalized into two different metrics. Both metrics quantify the consistency of flow, i.e., the extent to which the layout of a process model reflects the temporal logical ordering of the process. To determine the extent of consistency, these metrics primarily focus on the edges of the process model and quantify the consistency of all the edges E in graph G.

Fig. 5
figure 5

Illustrating example for metrics based on edges’ direction

To outline the idea behind these metrics, consider the process model depicted in Fig. 5, which shows a typical layout for a business process model. The fundamental idea is to determine, in a first step, the predominant flow direction of the process graph G. When looking at Fig. 5, we can easily see that the diagram points east, i.e., the predominant flow direction is east. Then in a second step, the metrics check the consistency of all the edges of graph G with the predominant flow direction. Analyzing each edge, we can see that \(e_1\), \(e_2\), \(e_7\), and \(e_8\) clearly point east, i.e., they are consistent with the predominant flow direction. For edges \(e_3\), \(e_4\), \(e_5\), and \(e_6\), it is less obvious since these edges cannot be assigned to one clear direction easily. Following a naïve approach, we could just consider the most predominant direction of the edge (like we did for the entire process graph):We may classify edges \(e_3\) and \(e_6\) as pointing north (the edges look slightly more north than east). Edges \(e_4\) and \(e_5\), in turn, would be classified as pointing south (the edges look slightly more south than east). The consistency of flow could then be calculated by dividing the number of edges that are consistent with the predominant flow direction (i.e., all edges pointing east), by the overall number of edges resulting in a consistency score of \(4/8=0.5\). The assignment of just one direction to an edge would result, against our intuition, in a relatively low consistency of flow. Assigning two directions to each edge, instead of one, allows to better reflect our intuition of consistency of flow. Edges \(e_1\), \(e_2\), \(e_7\), and \(e_8\) would still be classified as east. Edges \(e_3\) and \(e_6\), however, would be classified as both north–east, and edges \(e_4\) and \(e_5\) would be classified as south–east. To calculate the consistency of flow, we can now consider all edges that have one direction assigned that is consistent with the flow (i.e., all edges pointing toward north–east or south–east would be considered correct) and divide them by the overall number of edges. In this case, and in line with our intuition, we would obtain a consistency score of \(8/8=1\).

In the following, we describe more formally how metric \(\textsf {Cons}(G)\) captures our intuition of how the consistency of flow should be calculated. The proposed metric assigns to each graph G a value between 0 and 1 quantifying the degree of consistency. For this, we assume that the graph G has a predominant flow direction and we have to determine it. The set of all possible directions is \(D = \{\textsf {North}, \textsf {East}, \textsf {South}, \textsf {West}\}\), whereby the selection of the most predominant flow direction is based on majority voting (i.e., the direction most edges belong to is considered as the predominant flow direction). Precondition for the majority voting is that we can identify the direction of each edge. The function \(\textsf {Direction}: L_E \rightarrow \mathcal {P}(D)\) Footnote 5, given the layout information of an edge, returns the set of directions (i.e., potentially more than one) the edge belongs to. In Sect. 4.2.1, we present the naïve approach to compute \(\textsf {Direction}\), only considering one direction (which will serve as a baseline). In Sect. 4.2.2, in turn, we describe the classification of edges into two directions. We can then calculate the overall consistency \(\textsf {Cons}(G)\) by dividing the number of edges belonging to the predominant flow direction by the number of edges.

Algorithm 1 highlights the calculation of the metric. The procedure requires a graph G as input (with the layout details) and a \(\textsf {Direction}\) function, in order to get the directions of an edge. The procedure then iterates through each edge (line 5) and adds its directions to the proper direction counters (line 9). Frequency of the predominant direction is computed (line 13), and the final result is returned (line 14).

figure f

The two upcoming subsections explain in detail the two possible instantiations of the \(\textsf {Direction}\) function. The first assigns to each edge exactly one direction; the second assigns two directions to each edge. Since these two definitions of \(\textsf {Direction}\) affect the final result of the \(\textsf {Cons}\) function, we assign to the two possible combination of \(\textsf {Cons}\) and \(\textsf {Direction}\) different names: M-E1 and M-E2.

Fig. 6
figure 6

Division of the radius according to the number of directions per edge that could be defined. These two cases report \(\textsf {North}\) (gray filled area), \(\textsf {East}\) (dotted area), \(\textsf {South}\) (grid area), and \(\textsf {West}\) (lined area) directions. a Division of the radius to assign one direction to each edge (M-E1). Each direction does not overlap with any other. b Division of the radius to assign two directions to each edge (M-E2). Each direction overlaps with the two adjacent ones

4.2.1 One direction per edge (M-E1)

The first instance of the \(\textsf {Direction}\) function we analyze, which might be considered as a naïve approach, assigns one direction to each edge. For readability purposes, in the rest of this paper, we refer to the \(\textsf {Cons}\) function using the \(\textsf {Direction}\) described in this section as M-E1.

In order to identify the directions of each edge, we consider the “angle” created by the edge. Starting from the coordinates of the start and end points of an edge and using the arctangent function with two arguments, we can get the actual angle of the edge. To determine the direction of the edge, we divide the radius into four equal parts of \(90^\circ \) (one for each direction, i.e., \(\textsf {North}, \textsf {East}, \textsf {South}, \textsf {West}\)). Figure 6a highlights the four directions: The filled area identifies the \(\textsf {North}\) direction; the dotted area identifies \(\textsf {East}\); the grid area represents \(\textsf {South}\); and the lined area identifies the \(\textsf {West}\) direction.

We then check whether the angle obtained for a particular edge is within the intervals referring to each direction. Since the angles corresponding to each direction do not overlap, \(\textsf {Direction}\) always assigns exactly one direction to each edge.

4.2.2 Examples

In the following, we illustrate the calculation of the M-E1 metric using the examples shown in Fig. 2. Table 2 summarizes the obtained results. For example, if we consider the process of Fig. 2a, we can see that it contains 1 edge looking \(\textsf {North}\), 48 looking \(\textsf {East}\), 2 looking \(\textsf {West}\), and 0 looking \(\textsf {South}\). As reported in line 14 of Alg. 1, the final score is given by \({\textit{predominant}}/{|E|}\), where \(\textit{predominant}\) is the frequency of the prevalent direction (\(\textsf {East}\) in our case, with frequency 48) and |E| is the number of edges of the graph. Therefore, if we apply these counters, the final consistency score for this model is \(48{/}51=0.941\).

In Fig. 2b, we have 1 edge looking \(\textsf {North}\), 50 looking \(\textsf {East}\), 4 looking \(\textsf {West}\), and 4 looking \(\textsf {South}\). Therefore, the final consistency score is \(50{/}59=0.847\).

Finally, in Fig. 2c we have 5 edges looking \(\textsf {North}\), 20 looking \(\textsf {East}\), 17 looking \(\textsf {West}\), and 9 looking \(\textsf {South}\). Therefore, the final consistency score is \(20{/}51=0.392\).

Table 2 Consistency scores computed using the M-E1 metric

4.2.3 Two directions per edge (M-E2)

In this subsection, we are going to report details regarding the second definition of the \(\textsf {Direction}\) function. This new version of the function assigns two directions to each edge. For readability purposes, in the rest of this paper, we refer to the \(\textsf {Cons}\) function using the \(\textsf {Direction}\) described in this section as M-E2.

The main difference with respect to the definition provided before is that now each direction corresponds to a \(180^\circ \) portion of the angle. Based on this definition of direction, each portion overlaps with two others (cf. Fig. 6b). In this case, it is possible to see that the \(\textsf {East}\) direction (dotted area) overlaps with both the \(\textsf {North}\) (filled area) and the \(\textsf {South}\) (grid area) directions. The result is that, with this metric, each edge is always associated with exactly two directions.

4.2.4 Examples

Despite this slight change, the overall metric could generate very different values. For example, if we consider again the process models seen so far, we can observe the score values reported in Table 3.

Table 3 Consistency scores computed using the M-E2 metric

The process in Fig. 2a contains 28 edges looking \(\textsf {North}\), 23 pointing \(\textsf {South}\), 49 pointing \(\textsf {East}\), and 2 looking \(\textsf {West}\), with the predominant flow direction \(\textsf {East}\). Therefore, if we compute the same values of Alg. 1, the final consistency score for this model is \(49{/}51=0.960\).

In Fig. 2b, we have 28 edges looking \(\textsf {North}\), 31 pointing \(\textsf {South}\), 54 pointing \(\textsf {East}\), and 5 looking \(\textsf {West}\), again with the predominant flow direction \(\textsf {East}\). Therefore, the final consistency score is \(54{/}59=0.915\).

Finally, in Fig. 2c we have 21 edges looking \(\textsf {North}\), 30 pointing \(\textsf {South}\), 27 pointing \(\textsf {East}\), and 24 looking \(\textsf {West}\). Therefore, the final consistency score is \(30{/}51=0.588\). Unlike with previous examples for this model, the predominant flow direction is \(\textsf {South}\).

Please note that the time complexity of both the \(\textsf {Direction}\) procedures reported in the last two subsections is constant, given an edge. Therefore, the general complexity of the \(\textsf {Cons}\) function is linear on the number of edges of the graph. Another key characteristic of these metrics is their semantic independence, i.e., they can be applied to any directed graph.

4.3 Metric based on behavioral profiles (M-BP)

While the two metrics introduced in Sect. 4.2 consider the edges of a process model, this metric puts its focus on activities to determine the extent to which the layout of the model is consistent with the temporal logical ordering of the business process. For this, the metric looks at the relations between pairs of activities (i.e., their behavioral profiles [37]) and evaluates, for each of them, whether the way they are placed in relation to each other is consistent with their temporal logical ordering. For readability purposes, in the rest of this paper, we refer to the metric described in this section as M-BP.

The fundamental idea behind behavioral profiles consists of the characterization of a process using relations between two activities, defined according to three fundamental possibilities: (i) strict order; (ii) exclusiveness; or (iii) interleaving order. Let us present these relations using the example process model depicted in Fig. 7. Between activities A and B, the strict order is holding, identified as \(A \rightarrow B\) (i.e., A always occurs before B, and never the other way round). Activities C1 and C2 are in as exclusiveness relation \(C1 + C2\) (i.e., C1 cannot appear before C2 and C2 cannot appear before C1). Finally, E1 and E2 (but also E3) are in interleaving order: \(E1 \Vert E2\) (i.e., E1 might appear before E2 and E2 might appear before E1 as well).

Fig. 7
figure 7

Example of process model for illustrating behavioral profiles between activities

The main idea of the metric M-BP is to measure the extent to which the layout of a process model reflects the temporal logical ordering of the activities. The behavioral relation that imposes a restrictive order among activities is the strict order behavioral relation. Therefore, we need to analyze the position of nodes referring to activities belonging to such relation. Then, for each strict order relation, we check whether the source node (i.e., the node that must occur first) is “graphically before” the target node (i.e., the node that must occur later). The “graphically before” condition holds if the target node is placed east or south of the source node, i.e., the positioning of the two activities reflects their temporal logical ordering. To calculate the consistency score, we divide the number of graphically before relations by the overall number of strict relations.

More formally, given a process graph \(G = (V, E, L_V, L_E)\), let us define a behavioral relationship b as the tuple \(b=\langle r, s, t \rangle \) with \(r \in \{ \rightarrow , +, \Vert \}\) indicating the relation type and \(s,t \in V\) indicating the nodes associated with the source and the target activities (therefore, s and t must refer to nodes representing activities, not just “general nodes” of G). For convenience, we define projection operators for a behavioral relation instance \(b=\langle r, s, t \rangle \) such that \(\#_{\text {relation}}(b) = r\), \(\#_{\text {source}}(b) = s\), and \(\#_{\text {target}}(b) = t\). We also assume to have available a procedure \(\textsf {BehavioralProfiles}\) which extracts all behavioral relations out of a process.Footnote 6 The complete pseudocode of the algorithm is reported in Algorithm 2.

figure g

The algorithm starts by initializing several counters and by extracting all the behavioral profiles available in the process (line 4). Then, it has to iterate through all relations and consider only the strict orders (line 6). For these relations, the coordinates of source and start activities are extracted (lines 78) and the system checks whether the graphically before condition holds (lines 914). The final score is computed as the ratio of the graphically before relations to the total number of strict relations (line 18).

The complexity of the algorithm is linear on the number of behavioral relations that could be extracted. Behavioral profiles can be computed quite efficiently, in a low polynomial time to the size of the process model [36].

4.3.1 Examples

If we consider the example process models seen so far, we can observe the score values reported in Table 4. For example, the process in Fig. 2a has 43 strict relations (computed with look ahead 1) and 40 of them are pointing \(\textsf {South}-\textsf {East}\). Therefore, the final score is \(40{/}43=0.930\). In Fig. 2b, we have 38 strict relations (with look ahead 1) and 33 of them are pointing \(\textsf {South}-\textsf {East}\). Therefore, the final score is \(33{/}38=0.868\). Finally, in Fig. 2c we have 37 strict relations (with look ahead 1) and 23 of them are pointing \(\textsf {South}-\textsf {East}\). Therefore, the final score is \(23{/}37=0.622\).

Table 4 Consistency scores computed using the M-BP metric

Please note that, since we are using look ahead equal to 1, we do not penalize the violation of the “graphically before” condition for relations involving activities very far apart in the process. In particular, value 1 for the look ahead indicates that only pairs of very close activities are considered. Although this is a parameter of our approach (i.e., it can be changed), we opted for this configuration since each local violation implies a change in direction and, this way, we count the changes without penalizing more than once for each change.

Fig. 8
figure 8

Number of process models within different consistency score intervals

5 Evaluation of flow consistency

To gain a better understanding of our new flow consistency metrics, we conducted several empirical evaluations. In total, we performed 3 evaluations, useful to provide a more comprehensive picture of the capabilities of our metrics.

Since the approaches we defined rely on different assumptions, and therefore quantify the flow consistency using different feature sets, with the first evaluation we wanted to establish to what extent the metrics “agree” on the same model. We can use this analysis to ensure that the features we are taking into account are not redundant to each other and therefore be sure that our approaches are measuring the consistency of the flow considering different perspectives. If this is the case, we could identify the conditions under which a metric performs better than another one. Another evaluation focused only on the time efficiency of our metrics, i.e., the time required to compute each metric on all the models of our dataset. The third evaluation is an experimental one, performed with the support of several people familiar with process modeling, and aims at comparing the human assessment of flow consistency with the outcomes provided by our metrics.

Both automated analyses are based on a process model dataset which was generated during a modeling session conducted in December 2012 at the Eindhoven University of Technology, with students following programs on operations management and logistics, business information systems, innovation management, and human–technology interaction [23]. In total, the dataset contains 125 models, all referring to the same process description. The experimental evaluation is based on a subset of this dataset containing 14 models.

5.1 Metrics agreement

The first analysis we performed aimed at establishing the extent to which different metrics agree. In order to evaluate this aspect, we counted the number of models that each metric places within a provided consistency score interval. Figure 8 depicts the distribution of the values, using intervals of 0.1. It is interesting to note that the metrics tend to assign scores in slightly different intervals. For example, M-E1 distributes the scores rather uniformly, but there is a considerable set of models (above the average) with scores in the interval 0.6–0.9. Metric M-BP, in turn, assigns rather high values, with only very few exceptions below 0.6. Finally, metric M-E2 assigns even higher scores, and most models lie in the interval 0.8–1.

Fig. 9
figure 9

In this chart, each dot represents a process model. The position on the x-axis indicates its average ranking; the position on the y-axis indicates the standard deviation of the rankings. Both averages and standard deviations are computed for all three metrics

Fig. 10
figure 10

In this chart, each line represents the average standard deviation for the interval considered (instead, the chart in Fig. 9 reports the average every 10 positions). The light-blue area indicates how values are distributed for the interval under consideration (i.e., min and max values)

In order to compare the agreement of our metrics, we decided to use ranking. Specifically, using the scores generated by each metric we ranked the dataset, from the most consistent model to the least consistent one. Therefore, for each model, we ended up with three ranking positions (one for each metric). We computed the average and the standard deviation of the rankings for each model, and we plot the latter value. Figure 9 contains the results of our analysis. The figure does not only show the evolution of the standard deviation as the average ranking increases, but also reports the average values, computed every 10 data points. By looking at the average values of the standard deviation, it is interesting to highlight that our metrics tend to agree with the ranking next to the extremes of the chart (i.e., lower standard deviation averages for very high or very low rankings). Instead, on the middle ranking positions, values are more spread apart: The peak of the average standard deviation is reached for ranking positions 52–58. This clearly indicates that all metrics, basically, agree on “very consistent” and “very inconsistent” models. However, for average process quality scenarios, there is less agreement. This behavior can be identified also on the chart reported in Fig. 10.

The observed behavior is in line with our expectations since very consistent and very inconsistent models are “globally” recognized, no matter what the considered features are. This is highlighted by our observation: Our metrics (and the features taken into account) capture very consistent or very inconsistent scenarios similarly (i.e., low standard deviations on ranking). Moreover, for average situations (i.e., average rankings), the characteristics of each metric play an important role: The different feature sets used indeed provide different characterizations which, in turn, results in values more spread apart. These large standard deviations, on average cases, also indicate that the features of the metric described in the previous sections (i.e., majority voting with edge angles of \(90^\circ \) and \(180^\circ \), and directions with respect to behavioral profiles) are not redundant to each other.

Consider as example the linear process model illustrated in Fig. 2a. It is ranked in position 3, 5, and 24 (respectively, by M-E1, M-E2, and M-BP metrics), with a rather low standard deviation of 9.4. This is an example of a model that all metrics consider as a consistent one. It is placed on the very left side of the chart in Fig. 9.

The model reported in Fig. 2c has an even lower standard deviation (1.63) with the following rankings: 118 for M-E1, 122 for M-E2, and 120 for M-BP. This indicates that all metrics consider this an inconsistent model. It is located on the very right side of the chart in Fig. 9.

The model depicted in Fig. 2b is not consistently ranked by all metrics (positions 11 for M-E1, 36 for M-E2, and 54 for M-BP). This model has a standard deviation of 17.6 and appears toward the central part of Fig. 9. This model clearly highlights the features that each metric takes into account: edges’ direction and behavioral profiles violations. In particular, considering the directions of edges, this model, compared to other models, has only few inconsistencies (such as those connecting each horizontal fragment) and many properly oriented edges. From a behavioral profiles perspective, instead, the model has a considerable amount of strict relations that are violating the “graphically before” condition (again, compared to the other models).

To sum up, the metrics we described in this paper are providing consistent results regarding “extreme models” (i.e., models with very high and very low consistency scores). Instead, on average scenarios, it depends on what the end user wants to analyze: Metric M-BP is concentrating on the position of activities; M-E1 and M-E2 are based on the direction of edges. In Sect. 5.3, we will investigate which representation is more in line with the human perception.

5.2 Efficiency

All our metrics have been implemented as extension of the Cheetah Experimental Platform [24]. In order to evaluate the efficiency of these metrics, we compared the time required to compute them for each model. The machine we used is a standard-level laptop with a Intel Core i7-2620M CPU (2.70 GHz), equipped with 8GB of RAM and a 64-bit Windows 7 OS. The test was performed on the Java Virtual Machine version 1.8.0.25, with 64 bits.

Table 5 Time required to compute the metrics for one process model

Final results are reported in Table 5 (all values are expressed in milliseconds). The results report the average, the minimum and the maximum computation time required to calculate the metrics for the whole dataset. To provide more general results and to avoid that specific conditions influence the computation, we computed each metric 5 times for each process (i.e., \(5 \times 125 = 625\) computations for each metric) and the average times are reported. As the complexity analyses suggested, all three metrics can be considered as very efficient. From all three metrics, the behavioral profiles-based metric, M-BP, is the most demanding one, since it has to compute the behavioral profiles in advance. Still the metric can be calculated, on average, within 34 ms. These results clearly show the applicability of the proposed metrics even for settings with high performance requirements (e.g., online and real-time computations needed, for example, to include such calculations in an intelligent modeling environment that provides recommendations to users based on observed behavior).

5.3 Experimental evaluation

The evaluation reported in Sect. 5.1 shows that the three proposed metrics tend to agree on the assessment of process models with very high and very low consistency of flow. For models with average ratings, the assessment is less consistent. In this section, we aim to evaluate how far the proposed metrics reflect human perception of consistency of flow. To do that, we conducted an empirical study in which we asked human readers to rate a set of models regarding their flow consistency. We then compared the human assessment with our three metrics and looked for correlations. This way we could tell which metric is a better approximation and to what extent.

Subjects

For our evaluations, we targeted subjects familiar with process modeling. To this end, we decided to target the participants of the BPM 2015 conferenceFootnote 7, since we expected them to be familiar with process modeling, most of them coming from academia. In total, we collected data from 47 subjects.

Objects

We decided to select a subset of 14 models from our dataset. The models we picked have been sampled according to the standard deviation on the average ranking. In particular, we used the representation provided in Fig. 11: We evenly sampled our space and, for each average ranking positions, two models are selected: one with low standard deviation and one with high standard deviation. By doing that, we evenly partitioned our dataset, both with respect to the average ranking and the standard deviation. By selecting one process per partition, we guarantee that each partition is actually represented in our dataset.

Fig. 11
figure 11

Average position on ranking with processes selected for the human assessment represented by large, bright-blue dots

We prepared a questionnaire for our subjects with the 14 models that were selected.Footnote 8 To avoid that the ordering of the process models causes a bias on the evaluation, we actually assembled two versions of the same questionnaire (labeled “A” and “B”), with the processes presented in inverted order.

Response Variables and Data Collection

For each model reported in the questionnaire, we asked participants to rate the consistency of flow on a 7-point Likert scale ranging from 1 “No consistency at all” to 7 “Complete consistency.”

Execution

The actual evaluation was conducted between August 31 and September 3, 2015, during the BPM conference, which took place in Innsbruck. We asked all the conference participants to fill our questionnaire and return the answers to us, without providing them any additional instruction. We randomly distributed to each participant either a questionnaire labeled “A” or “B”. These two questionnaire types have the ordering of the process models inverted. In total, we collected 47 answers (25 labeled “A” and 22 “B”).

Data Analysis and Results

Once we collected all our questionnaires, we rescaled the values from the closed interval [1, 7] into [0, 1], just to simplify the comparison with the output of our metrics. For each model, we computed the average scores assigned by our participants. Then, we compared the average human evaluation against the automatic metrics defined in this paper. The average human evaluation scores of all the models as well as the values of the three metrics are reported in Table 6.

Table 6 Descriptive variables for every model of the study. Scores for the three metrics are reported, together with the average score and the standard deviation of the human assessment

The first test we employed was the one-sample Kolmogorov–Smirnov, in order to verify the normality of our data distribution which is precondition to compute the Pearson correlation. The data we collected, actually, are fitting a normal distribution with mean 0.541 and standard deviation 0.16. The significance score observed is 0.919.

Table 7 Correlation of our three different approaches computed with average scores assigned by our subjects

Once we verified such condition, we computed the Pearson correlation between each metric and the average human scores. The results are reported in Table 7. Our results suggest that the metric based on behavioral profiles, M-BP, best reflects human assessment, with a Pearson correlation score of 0.719 and a significance value of 0.004. Please note that the absolute scores of the human evaluation and M-BP are linearly shifted by a factor which, on average, is 0.29. Also metric M-E2 obtained a significant correlation with the human judgment (Pearson correlation 0.567, significance 0.034), but when compared to metric M-BP, the correlation is less strong. Metric M-E1, in turn, with a Pearson correlation of 0.263 (and significance 0.364) is not correlated to human judgment. Figure 12 depicts the comparison of the trends of the scores assigned by the three approaches with the average human assessments. The high correlation value of metric M-BP is reflected in the picture: Even though the two curves in Fig. 12c do not overlap (i.e., the actual scores differ), they describe very similar behavior. Comparable effect is reported in Fig. 12b, which represents M-E2. In Fig. 12a, instead, the curves are touching in some points; however, the two shapes are more distinct from each other, thus indicating lower correlation.

Fig. 12
figure 12

Comparison of human assessment and values calculated by our metrics, on all the models of our sample dataset. a M-E1 versus human evaluation. b M-E2 versus human evaluation. c M-BP versus human evaluation

From these results, we can also infer that, when evaluating the consistency of the flow, human perception tends to give more importance to the position of activities than to the actual direction of single edges. This effect entails the ability to abstract from the drawn path of edges and, instead, just focus on the actual flow of the activities.

Limitations

As Table 6 reports, for the human evaluation we had, almost for every model, quite high standard deviation values. This is due to the different grading habits of persons: Someone tends to never give the maximum score, whereas others have lower expectations and therefore assigns the best grade more easily. Similar situation can be observed with minimum score. Actually, this type of problems is rather common when dealing with intuitive assessment: We expected such a phenomenon. Grouping our subjects into clusters, generated according to the scores they share (for example, by identifying groups of persons that almost never gave best/worst grade), could help us in improving our evaluation and therefore refining our validation. Another approach around the problem could be to “normalize” the data. However, in this case, the risk of obtaining unreliable results (by applying an artificial transformation on the input data) is rather high.

6 Related work

Related to our paper is work on the visual layout of business processes in general and work on the flow of a business process model specifically.

Visual Layout of Business Process Models

Research on the visual layout of business process models has largely relied on studies done in the field of graph drawing. A considerable body of knowledge exists on how to automatically set the layout of graphical models in order to improve their readability. Studies done in the graph drawing field mainly explored the following visual layout features: edge crossing, edge bends, the minimum angle between edges leaving a node, orthogonality, symmetry, flow direction, edge length variation, and width of layout [25, 26]. The direct relation between these metrics and understandability was also investigated.

Research on aesthetics of graph layout in general [25] found that an important feature to users is minimizing line crosses; less important are: minimizing bends, maximizing symmetry; other features were not found to have a significant effect. Research on users’ preferences of UML layout/appearance indicates that users rated features as follows: arc crossings, orthogonality, direction of flow, arc bends, text direction, width of layout, and font type. Considering process models, [32] explored understanding of process models by experts and novices in regard to the following layout features: line crossings, edge bends, symmetry, and vicinity of related elements. [5] investigated user preferences of layout aesthetics for BPMN models, considering heterogeneous user groups with the goal of designing a modeling tool for BPMN. They used line crossings, orthogonal lines, drawing area, line bends, and flow. Findings showed that the aforementioned layout criteria were most relevant for users with average or greater experience and at least basic education in business process modeling. The layout features described above were all identified as part of the findings in our exploratory study.

Another body of work has developed or evaluated algorithms that change an existing layout of a business process model manually or automatically, to match a desirable aesthetical pattern for effective visual layout of a model. In [9, 10], both studies present algorithms which are based on a set of constraints targeting a readable layout of a process model (unified flow direction, minimal edge crossing, minimal bending points, usage of Manhattan layout). Automatic layout of BPMN models is presented in [14] and is focused on edge positioning. The study in [29] presents a comprehensive framework which allows for a personalized process model visualization, meaning that the model’s visual appearance can be tailored to the specific needs of different user groups. In addition, in the field of graph drawing, applied research has developed algorithms and related tools to automatically or manually improve visual layout of graphs and thus improve their understandability. GraphEd system in [12] compared and evaluated different algorithms of graph drawing while considering the following layout criteria: edge length, edge distribution, area, density, bends, crossings, and orthogonal edges. The work presented in [21] suggested an algorithm which reorders a diagram using orthogonal ordering while preserving the “mental map” of the diagram.

The conclusion from the reviewed works is that various attempts to provide precise metrics of specific layout features have been made. Yet, as far as we know, all existing works address a conveniently selected set of features, typically those that are immediate to think of and possible to automatically address. In this paper, instead, we present a set of features that are elicited based on human perception. In addition, we extend our own previous work [1] toward two main directions. First, we provided three different metrics to formalize and quantify the consistency of flow feature. The second important contribution consists of the evaluation of the newly proposed metrics. We evaluated their computational requirements; we analyzed their scores over a dataset of process models; and finally, we compared our metrics with human perception.

Consistency of the Flow

In [26], the consistency of the flow is evaluated with respect to a target direction, which can be parameterized, considering all fragments of all edges. Specifically, the formalization computes the ratio of fragments pointing toward the target direction divided by the total number of fragments. This definition is different from the definitions we devised and reported in this paper in two aspects. First, it considers only the edges’ angle. Second, in [26] authors can deal with just one direction at the time. Therefore, this approach could have problems, for example, with processes containing structures similar to the ones reported in Fig. 4, since it considers each fragment independently, or with the process in Fig. 5 which requires the assignment of two directions for each edge.

Related to our work is also the study reported in [5] which looks into the impact of model aesthetics. The operationalization of flow remains somewhat unclear and is only informally stated as “edges are drawn such that they consider the reading direction.” The applied notion of consistency again focuses on edges, activities as relevant features characterizing flow are not considered.

Another study on the impact of the flow direction of a model is reported in [6]. Thereby, the paper compares in an experimental setting the effect of different flow directions (either left to right, or right to left, or top to bottom, or bottom to top) on model comprehension. While the models considered in this study all had a clear flow direction, focus of our work is a metric that is able to deal with models that only have a partially consistent flow direction.

7 Conclusions and future work

The visual layout of the model, the way elements of the model are laid out on the canvas, is an important factor for the user’s understanding of the model. Since layout properties are mostly not addressed by modeling languages, and in the absence of enforced layout conventions, the modeler has much freedom to decide on how a model will be laid out. A common set of properties, which can be used to describe different aspects of layout, is an essential basis needed for developing an understanding, appropriate conventions, and tools that enforce them.

The contribution of this paper is twofold. First, during a human-centered investigation, visual layout features in business process models were elicited based on what users perceive as important. Second, we concentrated on one of these features: the consistency of the flow. We operationalized such a feature, which is not trivial to quantify, into three possible metrics in a way that closely reflects human perception. The outcomes of the evaluations we performed, both automatic and involving humans, show that all metrics can be calculated very efficiently. Moreover, we show that two of the proposed metrics correlate with human perception. In particular, metric M-BP achieved the highest correlation value. This suggests that, in this context, the position of activities represents a more important features as opposed to the orientation of the edges.

Currently, all the metrics are implemented in a process modeling tool, allowing to precisely “measure” the layout properties of every model. Yet, validation has been performed just for the flow consistency feature. As future work, we aim to validate additional metrics to corroborate the correlation between the calculated metrics and the human perception of models’ layout. In addition, future work may include quantitative studies to experimentally test to what degree these layout features indeed affect user comprehension. Once such a link has been established, it surely would be very useful to use our metrics as an optimization criterion for the automatic drawing of process models.

As future work, we also plan to adapt the metrics reported in this paper to use them “on the fly,” during the process design phase (i.e., even before the process is completely modeled). This way, we could provide continuous and interactive feedbacks to the user modeling the process.