# Fast business process similarity search

- First Online:

DOI: 10.1007/s10619-012-7089-z

- Cite this article as:
- Yan, Z., Dijkman, R. & Grefen, P. Distrib Parallel Databases (2012) 30: 105. doi:10.1007/s10619-012-7089-z

## Abstract

Nowadays, it is common for organizations to maintain collections of hundreds or even thousands of business processes. Techniques exist to search through such a collection, for business process models that are similar to a given query model. However, those techniques compare the query model to each model in the collection in terms of graph structure, which is inefficient and computationally complex. This paper presents an efficient algorithm for similarity search. The algorithm works by efficiently estimating model similarity, based on small characteristic model fragments, called features. The contribution of this paper is threefold. First, it presents three techniques to improve the efficiency of the currently fastest similarity search algorithm. Second, it presents a software architecture and prototype for a similarity search engine. Third, it presents an advanced evaluation of the algorithm. Experiments show that the algorithm in this paper helps to perform similarity search about 10 times faster than the original algorithm.

### Keywords

Business processFeatureSimilaritySearch## 1 Introduction

Nowadays, business process management techniques develop quickly in both academic and industrial fields. To increase the flexibility and controllability of the management of organizations, business processes are used to describe the services that an organization provides and the internal processes that implement those services. As a result, it is common to see collections of hundreds or even thousands of business process models. For example, the collection of SAP reference models consists of more than 600 business process models [6], and the collection of the reference models for Dutch Local Government contains a similar number of models [9]. As business process model collections increase in size, tools and techniques are required to manage them. This includes tools and techniques for quickly searching through a collection, for business process models that meet certain criteria. These criteria can be specified by means of a query language [1, 5], but also by means of (a part of) a query business process model for which similar models must be retrieved [7, 8, 27].

*similarity search*techniques. Figure 1 shows an example of business process similarity search. It shows one

*query model*and five

*process models*in the BPMN notation. Given a query model, a similarity search technique should only returns those process models that are similar to the query model and it should return those similar process models in order of their similarity to the query model. In the example, the technique could return models 1, 2 and 3.

There currently exist similarity search techniques [7, 8, 27]. However, these techniques focus on defining a metric to compute the similarity between two process models. To rank the business process models in a collection, the similarity of each of the process models to the query model must be computed. Subsequently, the process models must be ordered according to their similarity. At the same time, business process model collections are increasing in size. For example, Suncorp-Metway Ltd [16] maintains a collection of more than 6000 business process models. Comparing a query model with such amount of models is time consuming and can cause a similarity search operation to take multiple seconds or even minutes, depending on the metric and algorithm that is used, while a query should be performed within milliseconds by a search engine. (Compare, for example, the response time that you would require of an Internet search engine.)

*and*fast. This paper presents such an algorithm. It is developed by extending an existing fast similarity search algorithm [32], by:

- 1.
introducing preprocessing techniques, to reduce the search space that must be processed and, therewith, the number of iterations that must be performed by the algorithm;

- 2.
introducing incremental computation in the algorithm, thus reducing the complexity of each iteration; and

- 3.
introducing prediction techniques to predict which choices in the algorithm will lead to the best result, thus reducing the number of iterations that must be performed to arrive at the best result.

*potentially relevant*models (e.g., models 2 and 3). Finally, the models in the collection are ranked according to their (estimated) similarity to the query model.

Two experiments were performed to evaluate the algorithm in this paper. The first experiment evaluates the use case in which a model is taken from a collection and subsequently similar models in the same collection are searched. The results of this experiment show that, for this use case, the algorithm helps to retrieve similar models 6.7 times faster than the original algorithm, without impacting the quality of the results. It helps to retrieve similar models 8.6 times faster if a quality reduction of 1% is acceptable. The second experiment evaluates the use case in which the model that is searched is not from the same collection. It shows that, for this use case, the algorithm helps to retrieve similar models at least 8 times faster with a 4% quality reduction; and 10.7 times faster with a 7% quality reduction.

The rest of the paper is organized as follows. Section 2 defines the concept of feature and presents features that can be used for business process similarity estimation (step 1 in Fig. 2). Section 3 defines metrics for measuring the similarity of features and checking whether features match (step 2 in Fig. 2). Section 4 presents metrics to determine whether a model is relevant, irrelevant or potentially relevant to a query, based on the features that match with features from the query model (step 3 in Fig. 2). Section 5 presents the greedy algorithm for business process similarity search along with indexing techniques and efficiency improvements (step 4 in Fig. 2). Section 6 explains how business processes can be ranked according to their similarity to a given query model (step 5 in Fig. 2). Section 7 presents a software architecture and a prototype for a business process model repository that uses the algorithm in this paper for doing similarity search. Section 8 presents the experiments that were performed to evaluate the properties of the algorithm that was introduced. Section 9 presents related work and Sect. 10 concludes the paper.

## 2 Business process model features

In this paper features are defined as simple but representative abstractions of business process models. Their simplicity allows similarity computation based on them to be fast and their representativeness ensures that their similarity is strongly related to similarity of the business process models themselves. This makes features very suitable as means to quickly estimate the similarity of business process models. Provided that we choose business process model features carefully, we can further speed up similarity search by building an index of business process models based on those features. In this section, we present the business process model features that we explore in this paper.

Labels can be conveniently used as features, because they are simple strings and therefore qualify as simple abstractions. In addition, indexing mechanisms for strings are well-known, which enables indexing of label features. However, it is harder to use the structure of a business process model as a feature. In fact, considering the structure of a graph when computing the similarity between business process models in our previous work is what makes the problem computationally hard. Therefore, we consider the structure of a business process model in terms of the simpler structural features: start, stop, sequence, split, and join. We define these features on the abstraction of a business process graph.

### Definition 1

(Business Process Graph, Pre-set, Post-set)

*N*,

*E*,

*λ*), in which:

*N*is the set of nodes;*E*⊆*N*×*N*is the set of edges; and\(\lambda: N \rightarrow \mathcal{L}\) is a function that maps nodes to labels.

*G*=(

*N*,

*E*,

*λ*) be a business process graph and

*n*∈

*N*be a node: •

*n*={

*m*|(

*m*,

*n*)∈

*E*} is the pre-set of

*n*, while

*n*•={

*m*|(

*n*,

*m*)∈

*E*} is the post-set of

*n*.

A business process graph is a graph representation of a business process model. As such, it is an abstraction of a business process model that focuses purely on the structure of that model, while abstracting from other aspects. We define our similarity search techniques on business process graphs to be independent of a specific notation.

Based on this the structural features are defined in Definition 2.

### Definition 2

(Structural Business Process Model Features)

*G*=(

*N*,

*E*,

*λ*) be a business process graph.

A start feature is a node

*n*∈*N*that has an empty pre-set;A stop feature is a node

*n*∈*N*that has an empty post-set;A sequence feature of size

*s*is a list of nodes [*n*_{1},*n*_{2},*n*_{3},…,*n*_{s}]⊆*N*, such that (*n*_{1},*n*_{2})∈*E*,(*n*_{2},*n*_{3})∈*E*,…,(*n*_{s−1},*n*_{s})∈*E*, for*s*≥2;A split feature of size

*s*is a split node*n*and a set of nodes {*n*_{1},*n*_{2},…,*n*_{s−1}}⊆*N*, such that (*n*,*n*_{1})∈*E*,(*n*,*n*_{2})∈*E*,…,(*n*,*n*_{s−1})∈*E*, for*s*≥3;A join feature of size

*s*is a join node*n*and a set of nodes {*n*_{1},*n*_{2},…,*n*_{s−1}}⊆*N*, such that (*n*_{1},*n*)∈*E*,(*n*_{2},*n*)∈*E*,…,(*n*_{s−1},*n*)∈*E*, for*s*≥3.

For example, for graph 1 in Fig. 3, the label feature set is {Buy Goods, Receive Goods, Verify Invoice}, the start feature set is {Buy Goods} (using node labels to identify nodes), the stop feature set is {Verify Invoice}, the sequence feature set is {(Buy Goods, Receive Goods), (Buy Goods, Verify Invoice), (Receive Goods, Verify Invoice)}, the split feature set is {(Buy Goods,{Receive Goods, Verify Invoice})}, and the join feature set is {({Buy Goods, Receive Goods},Verify Invoice)}.

Many more possible features can be considered in business process models, depending on the business process model aspects that are taken into account (e.g., the organizational aspect or the data aspect), the desired performance of the algorithm (adding more features decreases performance) and the desired quality of the results (adding more features is expected to increase the quality of the search results). In this paper we focus on the most basic process model features. Extensions are possible and are a topic for future work.

## 3 Feature similarity, matching and indexing

It is possible to use the similarity of the features of two business process models as an estimator of the similarity of the business process models themselves. To this end, metrics must be defined that quantify the similarity of the business process model features. We say that two features that are sufficiently similar are *matching features* and we show how we can determine feature matching based on their similarity. The ratio of matching features will be used in the next section as an estimator of the similarity of business process models. To be able to quickly identify matching features, and therewith similar business process models, feature indices must be defined.

This section first presents metrics to quantify feature similarity. Second, it explains how a feature match can be determined based on feature similarity and, third, it presents feature-based indices.

### 3.1 Feature similarity

Label feature similarity can be measured in a number of different ways [8, 27]. For illustrative purposes we will use a syntactic similarity metric, which is based on string edit-distance, in this paper. However, in realistic cases more advanced metrics should be used that take synonyms and stemming [8, 27] and, if possible, domain ontologies into account [11]. Label feature similarity is defined as follows in previous work [7, 8].

### Definition 3

(Label Feature Similarity)

*G*=(

*N*,

*E*,

*λ*) be a business process graph and

*n*,

*m*∈

*N*be two nodes and let |

*l*| represent the number of characters in a label

*l*. The string edit distance of the labels

*λ*(

*n*) and

*λ*(

*m*) of the nodes, denoted ed(

*λ*(

*n*),

*λ*(

*m*)) is the minimal number of atomic string operations needed to transform

*λ*(

*n*) into

*λ*(

*m*) or vice versa. The atomic string operations are: inserting a character, deleting a character or substituting a character for another. The label feature similarity of

*λ*(

*n*) and

*λ*(

*m*), denoted lsim(

*n*,

*m*) is:

For example, the string edit distance between ‘Transportation planning and processing’ and ‘Transporting’ is 26: delete ‘ion planning and process’. Consequently, the label feature similarity is \(1.0 - \frac{26}{38}\approx 0.32\). Optional preprocessing steps, such as lower-casing and removing special characters, can improve the results of feature similarity measurements.

The drawback of measuring similarities only by labels is that similar tasks can have different labels. Therefore, it may be hard to determine task similarity solely based on label similarity. For example, in Fig. 3, ‘Buy Special Goods Online’ of Graph 2 and ‘Purchase Commodities’ of Graph 3 are related to ‘Buy Goods’ of Query graph. However, compared with ‘Buy Goods’, ‘Buy Special Goods Online’ is more verbose and ‘Purchase Commodities’ uses synonyms. Therefore, they may not match based on the label similarity. To deal with this situation, we use structural information together with the labels.

We can measure the structural similarity of two nodes, by determining the similarity of the (structural) roles that they have in their business process graphs. We distinguish five different roles that nodes can have: start, stop, regular (sequence), split or join. We do not distinguish the type of splits or joins (e.g., XOR or AND), because we established in previous work [7, 8] that the similarity of the types of two splits or two joins is a bad indication for whether they are similar.

### Definition 4

(Role Feature)

*n*∈

*N*be a node and \(\mathcal{R} = \{\mathrm{start},\mathrm{stop},\mathrm{split}, \mathrm{join}, \mathrm{regular}\}\) be a set of roles that a node can have. The roles of

*n*are determined by the function \(\mathrm{roles}{:}\ N \rightarrow \mathbb{P}(\mathcal{R})\), such that

Roles of nodes are considered to be similar or not with respect to the input and output paths of the nodes. The definition of role feature similarity is inspired by string edit-distance, i.e., mainly considering the differences between numbers of input (output) paths of two nodes. Formally, role feature similarity is defined as follows:

### Definition 5

(Role Feature Similarity)

*n*,

*m*∈

*N*be two nodes. The role feature similarity of these two nodes, denoted rsim(

*n*,

*m*), is defined as:

^{1}

*n*)∩roles(

*m*).

This formula covers all possible combinations of roles that nodes can have. For example, the situation in which both nodes are split nodes as well as join nodes is covered by the case ‘otherwise’ (start∉croles∧stop∉croles). The situation in which both nodes are regular nodes is covered by the same case and leads to a role feature similarity score of 1.

The drawback of measuring role similarity in this way is that it does not discount for the fact that there is a large difference between the frequency of the occurrence of the different role features. Therefore, using the role similarity metric in this way is ineffective. Since, if we give a bonus for matching role features, most nodes would receive that bonus. Therewith, the effect of the bonus would be minimal.

For that reason we refine the role similarity metric to take this effect into account. We do that by not considering features that appear too frequently in the dataset; we say that those features lack ‘discriminative power’.

### Definition 6

(Discriminative Role Features)

*r*) if and only if the fraction of the nodes that have the feature is sufficiently small:

*dcutoff*is a cutoff value that determines when the fraction of nodes that have the feature is sufficiently small. This cutoff value is a parameter that can be set as desired, to produce the best results.

In general, a good setting for *dcutoff* is easy to determine, because there is a large difference between the frequency of features with a low frequency of occurrence and features with a high frequency of occurrence. For example, in the set of business process models that we use for evaluation in this paper, there are 374 nodes in total. Of these nodes, 178 have the ‘stop’ role, 153 have the ‘start’ role, 58 the ‘regular’ role, 52 the ‘split’ role, 36 the ‘join’ role. Here, we have far more nodes with the ‘start’ and ‘stop’ roles than other nodes. Hence, if we set the *dcutoff* anywhere between 0.16 and 0.40, ‘start’ and ‘stop’ role features are not considered discriminative, while other role features *are* considered discriminative. We incorporate the discriminative power of role features into their similarity using the following formula.

### Definition 7

(Role Feature Similarity with Discriminative Power)

*n*,

*m*∈

*N*be two nodes. Their role feature similarity with discriminative power, denoted rdsim(

*n*,

*m*), is defined as:

### 3.2 Feature matching

We say that two features are matched if they are sufficiently similar. What is considered to be sufficient is determined by cutoff parameters that can be set accordingly. If two business process models have sufficiently many matching features, we consider them similar. This is explained in the next section.

We consider two node features to match, if their component features (label features and role features) match. Strong label feature similarity is a strong indication that two nodes are matched, while a combination of role feature similarity and (less strong) label feature similarity is also an indication that two nodes are matched. We distinguish between these two cases when determining a node feature match, such that we can set different thresholds for label similarity in case there is also role similarity and in case there is no role similarity.

### Definition 8

(Node Feature Match)

*n*,

*m*∈

*N*be two node features with their respective label features and role features. The node features match, if they satisfy one of the following two rules:

their label features are similar to a high degree, i.e., lsim(

*n*,*m*)≥lcutoff_{high};their role features are similar, and their label features are similar to a medium degree, i.e., rdsim(

*n*,*m*)≥rcutoff and lsim(*n*,*m*)≥lcutoff_{med}.

_{high}, rcutoff and lcutoff

_{med}are parameters that determine what is considered to be a similar to what degree. The parameters can be set as desired, to produce the best results.

We consider two structural features to match, if their component features (node features) match.

### Definition 9

(Structural Feature Match)

Two start features with nodes *n* and *m* match, if and only if their node features are matched. A stop feature match is defined similarly.

Two sequence features of size *s* with lists of nodes *Ln*=[*n*_{1},*n*_{2},*n*_{3},…,*n*_{s}] and *Lm*=[*m*_{1},*m*_{2},*m*_{3},…,*m*_{s}] are matched if and only if for each 1≥*i*≥*s*: the node features of *n*_{i} and *m*_{i} are matched.

Two split features of size *s* with split nodes *n* and *m* and sets of nodes *Sn*={*n*_{1},*n*_{2},…,*n*_{s−1}} and *Sm*={*m*_{1},*m*_{2},…,*m*_{s−1}} are matched if and only if the node features of nodes *n* and *m* are matched and there exists a mapping Map:*Sn*→*Sm* holds that for each (sn,sm)∈Map: the node features of *sn* and *sm* are matched. A join feature match is defined similarly.

Features of different types or sizes are never matched with each other. We can use these two definition to define general feature matching.

### Definition 10

(Feature Match)

Let *f*_{1} and *f*_{2} be two features. *f*_{1} and *f*_{2} match, denoted match(*f*_{1},*f*_{2}), if and only if they are of the same type and they match according to Definition 8 in case they are node features or Definition 9 in case they are structural features.

### 3.3 Feature indexing

Node feature matching is mainly based on label similarity and, indirectly, structural feature matching is as well, because it is based on node feature matching. Therefore, if we can find similar labels more efficiently, we can do feature matching more efficiently. We use two indexing techniques to find similar labels more efficiently.

First, we use an M-Tree index [2] on node labels. An M-Tree index is specifically meant for quickly finding items that are similar to a given item to a given degree. In our case, we use it to quickly find nodes with labels that have a similarity (Definition 3) to a given node label that is higher than a specified cutoff (lcutoff_{high} or lcutoff_{med} in Definition 8).

Second, we use an inverted index [20] that maps node labels to nodes, such that, given a node label, we can quickly find the nodes with that label. We use the inverted index, because multiple nodes with the same label may exist in a collection of business process models. For example, in the set of business process models that we use for evaluation in this paper, there are 374 labels, but only 190 distinct ones. The inverted index can prevent comparing identical labels repeatedly.

Furthermore, we can build a ‘parent-child’ index that exploits the fact that features of a larger size (in terms of the number of nodes) are composed of features of a smaller size. For example: sequence features of size 2 are composed node features (of size 1); sequence features of size 3 are composed of sequence features of size 2 and, indirectly, of node features (of size 1). We call the (larger) composed features ‘child’ features and the (smaller) component features ‘parent’ features.

### Definition 11

(Parent Feature, Child Feature)

If feature A can generate feature B by adding some node(s), feature A is a parent feature of feature B, and feature B is a child feature of feature A. If feature A can generate feature B by adding a single node, feature A is a direct parent feature of feature B, and feature B is a direct child feature of feature A.

For example, suppose that we need to search for models that are similar to a model that consists of only a single sequence: (‘receive goods’, ‘consume goods’). The model, therefore, contains node features ‘receive goods’ and ‘consume goods’, as well as sequence feature (‘receive goods’, ‘consume goods’). Starting the similarity search in the ‘parent-child’ index, we first find matches (by using an M-Tree lookup) between node features ‘receive goods’ and ‘Goods Receipt’ and between node features ‘consume goods’ and ‘Consume Goods’. We cache these matches. When subsequently looking for a match for the sequence feature (‘receive goods’, ‘consume goods’), we first look at the cached matches of its parent features (which are ‘Goods Receipt’ and ‘Consume Goods’) and their common child features, the sequence (‘Goods Receipt’, ‘Consume Goods’). Subsequently, we only have to establish the match between this sequence and the given sequence.

## 4 Feature-based similarity estimation

We use the fraction of matching features between two business process models to estimate their similarity, as shown in Definition 12.

### Definition 12

(Estimated Business Process Model Similarity)

*G*

_{q}and another process graph

*G*, with feature sets

*F*

_{q}and

*F*derived from

*G*

_{q}and

*G*. The estimated business process similarity, denoted ESim(

*G*

_{q},

*G*) is the number of features in

*G*

_{q}or

*G*that are matched by a feature in the other process graph, divided by the number of all features in

*G*

_{q}and

*G*:

Note that we count the number of features in *G*_{q} that match a feature in *G* separately from the number of features in *G* that match a feature in *G*_{q}, because the match is not necessarily one-to-one. For example, a label feature ‘Fill-out Request Forms’ can match with label features ‘Fill-out Requester’s Detail’ and ‘Fill-out Request Details’ in the other process graph.

Based on the estimated graph similarity, we can classify graphs as relevant, irrelevant or potentially relevant to a query graph. We do that by defining the minimal estimated similarity that a graph must have to the query graph to be considered relevant and the minimal estimated similarity that a graph must have to be considered potentially relevant. We return relevant graphs directly, check the potentially relevant graphs with expensive similarity search algorithms [7, 27], and discard irrelevant graphs.

### Definition 13

(Graph Relevance Classification)

*G*

_{q}and another process graph

*G*, we classify

*G*as:

relevant to

*G*_{q}if and only if ESim(*G*_{q},*G*)≥ratio_{r}potentially relevant to

*G*_{q}if and only if ratio_{r}>ESim(*G*_{q},*G*)>ratio_{p}irrelevant to

*G*_{q}if and only if ratio_{p}≥ESim(*G*_{q},*G*)

_{r}and ratio

_{p}are parameters that determine when a process graph is considered to be relevant, potentially relevant or irrelevant and can be set as desired, to produce the best results.

## 5 The improved greedy algorithm for process similarity search

In Sect. 4, we classify process models as relevant, potentially relevant or irrelevant to a given query model. Potentially relevant models still need to be checked by algorithms that can compute exact process similarity. This section explains how to do this. In previous work, a metric is defined to measure process similarity, and algorithms are given to compute the similarity automatically [7]. In this section, we briefly introduce the metric and the currently fastest algorithm to compute the similarity automatically, the greedy algorithm [7]. Then we propose three improvements to the greedy algorithm to further improve its performance.

### 5.1 The greedy algorithm for process similarity search

The similarity of two business process graphs is defined as a metric based on the graph edit distance is defined, as described in Definition 14.

### Definition 14

(Graph Similarity)

Let *G*_{1}=(*N*_{1},*E*_{1},*λ*_{1}) and *G*_{2}=(*N*_{2},*E*_{2},*λ*_{2}) be two graphs. The graph edit distance between two graphs is the minimal number of atomic operations needed to transform *G*_{1} into *G*_{2} or vice versa. Atomic operations include inserting, deleting, and substituting nodes and edges. Let *M*:*N*_{1}↛*N*_{2} be a partial injective mapping that maps *N*_{1} to *N*_{2}. Let *n*_{1}∈*N*_{1} be a node in *G*_{1}. *n*_{1} is a substituted node if and only if ∃*n*_{2}∈*N*_{2}, *M*(*n*_{1})=*n*_{2}, and accordingly *n*_{2} is also a substituted node. A node *n*∈*N* is a skipped node if and only if it is not a substituted node. Let *n*_{11},*n*_{12}∈*N*_{1} and (*n*_{11},*n*_{12})∈*E*_{1} be two nodes and an edge of *G*_{1}. (*n*_{11},*n*_{12}) is a skipped edge if and only if \(\not\exists (n_{21},n_{22})\in E_{2}\), *M*(*n*_{11})=*n*_{21}∧*M*(*n*_{12})=*n*_{22}. Similarly, we can define the skipped edge in *G*_{2}.

*M*, denoted as GSim(

*G*

_{1},

*G*

_{2},

*M*), is defined as follows:

The graph similarity of two graphs, denoted as GSim(*G*_{1},*G*_{2}), is the maximal possible similarity induced by a mapping between these graphs.

*M*= {(“Buy Goods,” “Buy Goods”), (“Reception of Goods,” “Receive Goods”)}. Then, the partial graph similarity induced by

*M*for the query graph,

*G*

_{q}, and graph 1,

*G*

_{1}, can be computed based on Definition 14. Note that there are 2 skipped nodes (“Consume Goods” and “Verify Invoice”), 3 skipped edges, and that the label similarity of “Reception of Goods” and “Receive Goods” is 0.62. Consequently, \(\mathrm{GSim}(G_{q},G_{1},M)=1.0-\frac{0.5\cdot0.33+0.5\cdot0.6+1.0\cdot0.19}{0.5+0.5+1.0}\approx 0.68\). This is also the maximal possible graph similarity induced by any mapping and, hence, this is the graph similarity of the two graphs, i.e., GSim(

*G*

_{q},

*G*

_{1})=GSim(

*G*

_{q},

*G*

_{1},

*M*).

*M*for which two process graphs have the highest similarity. The algorithm works as follows. Initially, all possible node pairs are added to

*openpairs*(line 3) and no node pair to the mapping

*M*(line 4). Then, in each iteration, GSim(

*G*

_{1},

*G*

_{2},

*M*∪{(

*n*,

*m*)}) is computed for all (

*n*,

*m*)∈

*openpairs*to select the pair that increases the partial graph similarity the most (line 7 and 8). That pair is added to the mapping

*M*(line 10) and all pairs that contain one of the nodes from that pair are removed from

*openpairs*(line 11), such that each node can be mapped at most once. The algorithm ends when there is no node pair in

*openpairs*that can increase the graph similarity (line 7 and 8).

We illustrate the algorithm using again the example based on the query graph and graph 1 from Fig. 5. Initially, there are 9 (3 times 3) pairs in *openpairs*. In the first iteration, the pair that increases the similarity most is (“Buy Goods”, “Buy Goods”), because it has the highest label similarity. This pair is added to the mapping *M* and all elements from *openpairs* that contain one of the nodes “Buy Goods” are removed from, such that there are 4 (2 times 2) pairs left. In the second iteration, (“Reception of Goods”, “Receive Goods”) is chosen, since the pair increases the partial graph similarity most. Then, there is only one (one by one) pair left, but it cannot increase the partial graph similarity and the function ends. In the example, the graph similarity is computed 14 times (9 times in the first iteration, 4 times in the second iteration and 1 time in the third and last iteration).

### 5.2 Improvements

Below, we optimize the algorithm, by reducing the number of times that the graph similarity has to be computed and by reducing the complexity of computing the similarity itself. We present three improvements for Algorithm 1.

#### 5.2.1 Selecting only the top-k similar node pairs

We can reduce the number of times that the graph similarity has to be computed, by initially reducing the number of *openpairs*. In Algorithm 1, *openpairs* is assigned *N*_{1}×*N*_{2} initially (line 3) to include all the possible node mappings between the query graph and the graph in the dataset. However, only pairs with high similarity scores are valuable because those can be expected to increase the similarity the most. The time complexity of the algorithm is directly related to the size of *openpairs*. Consequently, reducing the size of this set has direct impact on the execution time. Therefore, in this section we aim to reduce the size of *openpairs* as follows. For each node in the query graph, we find the top k most similar nodes in the graph and only put these node pairs in the *openpairs*. Definition 15 presents the formal definition of the top *k* most similar nodes.

### Definition 15

(Top-K Most Similar Nodes)

*n*be a node, let

*N*be a set of nodes, let

*k*be the number of similar nodes that should be considered for each node, and let

*sim*be a function to compare the similarity of two nodes (e.g.,

*lsim*). The set that contains the top

*k*similar nodes for the node

*n*, denoted as

*TOP*(

*k*,

*n*,

*N*,

*sim*), is the set that makes the following conditions hold:

*TOP*(*k*,*n*,*N*,*sim*)⊆*N*|

*TOP*(*k*,*n*,*N*,*sim*)|=min(*k*,|*N*|)∀

*p*∈*TOP*(*k*,*n*,*N*,*sim*), \(\not\exists o\in N/\mathit{TOP}(k,n,N,\mathit{sim})\), such that*sim*(*n*,*o*)>*sim*(*n*,*p*).

*k*can be set as desired to get the best results.

For example, consider the query graph and graph 1 from Fig. 5. Let *k*=1, let *n* be the node “Reception of Goods” in Query graph, let *N* be the node set of graph 1, and let *sim*=*lsim*. Then, *TOP*(1,*n*,*N*,*lsim*)= {“Receive Goods”}.

There is a drawback to computing the top *k* node pairs by only using label similarity. The graph edit similarity considers both the label similarity and the structural similarity. However, the Top-K heuristic does not take structural information into account and therefore may not result in the optimal node pair. For example, considering the query graph and graph 2 in Fig. 3, the labels “Buy Goods” and “Purchase Commodities” are related, but their label similarity is only 0.24. Consequently, the pair (“Buy Goods”, “Purchase Commodities”) may not be put into *openpairs* when using the Top-K heuristic, even if it may increase the similarity score later on in the execution of the algorithm.

To partly account for this issue, we can also take into account structural information in the Top-K heuristic. We can do this by comparing the sizes of the pre-sets and post-sets of nodes as defined in Definition 5. Then, we can compute the similarity of two nodes by considering both the label and role similarities, as described in Definition 16.

### Definition 16

(Node Similarity)

*n*,

*m*∈

*N*be two nodes. The node similarity is a weighted average value of lsim(

*n*,

*m*) and rsim(

*n*,

*m*), i.e.,

*w*

_{l}and

*w*

_{r}are parameters that can be set as desired to produce the best results.

For example, let *w*_{l}=1.0 and *w*_{r}=0.5. Considering the nodes “Buy Goods” and “Purchase Commodities” in query graph and graph 2 of Fig. 3, their node similarity is \(\frac{1.0\cdot0.24+0.5\cdot0.67}{1.0+0.5}\approx0.38\). Consequently, for *k*=1 (“Buy Goods”, “Purchase Commodities”) would be put in *openpairs*, while, if we had used label similarity instead of the node similarity, for the node “Buy Goods” in Query graph, the only node pair in the *openpairs* would have been (“Buy Goods”, “Get Commodities”).

#### 5.2.2 Incrementally computing the graph similarity

We can reduce the computation time of the graph similarity, by computing it incrementally instead of anew in each iteration. In Algorithm 1, when we add a new node pair (*n*,*m*) into the mapping *M*, we need to re-compute the partial graph similarity according to the new mapping *M*∪{(*n*,*m*)}, i.e., GSim(*G*_{1},*G*_{2},*M*∪{(*n*,*m*)}) (line 7 and 8). However, GSim(*G*_{1},*G*_{2},*M*∪{(*n*,*m*)}) is related to GSim(*G*_{1},*G*_{2},*M*), so we should compute it incrementally. Therefore, this section investigates the definition of graph similarity and deduces an incremental way to compute the partial graph similarity.

From Definition 14, we know that the partial graph similarity is related to three fractions, i.e., *fskipn*, *fskipe* and *fsubsn*. These fractions change when *M* changes. Let us see how these fractions change one by one after putting a node pair to *M*.

First, no matter which node pairs are in *M* and which new node pair is put into *M*, the size of *skipn* always reduces by two, because two nodes are matched and removed from the skipped node set. Thus, we can compute the increment of *fskipn*, as defined in Definition 17.

### Definition 17

(Skipped-node Fraction Increment)

*G*

_{1}=(

*N*

_{1},

*E*

_{1},

*λ*

_{1}) and

*G*

_{2}=(

*N*

_{2},

*E*

_{2},

*λ*

_{2}) be two graphs. Let

*M*be a partial injective mapping that maps

*N*

_{1}to

*N*

_{2}. Let (

*n*,

*m*) be a node pair in openpairs. After putting (

*n*,

*m*) into

*M*, the increment of

*fskipn*, denoted as

*Δ*

_{|fskipn|}, is defined as follows:

Second, the size reduction of *skipe* is related to the new node pair (*n*,*m*) and the mapping *M*. We can compute this by only considering edge pairs that are related to (*n*,*m*), instead of all possible edge pairs. The size reduction is equal to the size of the intersection of •*n*×•*m* and *M* and the size of the intersection of *n*•×*m*• and *M*.

### Definition 18

(Skipped-edge Increment)

*G*

_{1}=(

*N*

_{1},

*E*

_{1},

*λ*

_{1}) and

*G*

_{2}=(

*N*

_{2},

*E*

_{2},

*λ*

_{2}) be two graphs. Let

*M*be a partial injective mapping that maps

*N*

_{1}to

*N*

_{2}. Let (

*n*,

*m*) be a node pair in openpairs. After putting (

*n*,

*m*) into

*M*, the increment of |

*skipe*|, denoted as

*Δ*

_{|skipe|}, is defined as follows:

The increment of *fskipe* is defined in Definition 19.

### Definition 19

(Skipped-edge Fraction Increment)

*G*

_{1}=(

*N*

_{1},

*E*

_{1},

*λ*

_{1}) and

*G*

_{2}=(

*N*

_{2},

*E*

_{2},

*λ*

_{2}) be two graphs. Let

*M*be a partial injective mapping that maps

*N*

_{1}to

*N*

_{2}. Let (

*n*,

*m*) be a node pair in openpairs. After putting (

*n*,

*m*) into

*M*, the increment of

*fskipe*, denoted as

*Δ*

_{|fskipe|}, is defined as follows:

Third, contrary to *skipn*, the size of *subn* increases by two after putting (*n*,*m*) into *M*. The increment of *fsubn* also involves the label similarities of (*n*,*m*) and pairs in *M*, as defined in Definition 20.

### Definition 20

(Substituted-node Fraction Increment)

*G*

_{1}=(

*N*

_{1},

*E*

_{1},

*λ*

_{1}) and

*G*

_{2}=(

*N*

_{2},

*E*

_{2},

*λ*

_{2}) be two graphs. Let

*M*be a partial injective mapping that maps

*N*

_{1}to

*N*

_{2}. Let (

*n*,

*m*) be a node pair in openpairs. After putting (

*n*,

*m*) into

*M*, the increment of

*fsubn*, denoted as

*Δ*

_{fsubn}, is defined as follows:

*subn*|=2⋅|

*M*|.

From the analysis above, we can derive that after putting (*n*,*m*) into *M*, the graph similarity increment can be computed by performing only two computations, those of *lsim*(*n*,*m*) and *Δ*_{|skipe|}. The other components of the computation are either constants that can be computed before executing the algorithm, or functions of *M* that only have to be computed once each time *M* changes. Proposition 1 shows how the graph similarity increment can be computed as a function of *lsim*(*n*,*m*) and *Δ*_{|skipe|}, two constants *c*_{1} and *c*_{2} and two functions of *M*: *φ*_{1}(*M*) and *φ*_{1}(*M*).

### Proposition 1

(Graph Similarity Increment)

*Let*

*G*

_{1}=(

*N*

_{1},

*E*

_{1},

*λ*

_{1})

*and*

*G*

_{2}=(

*N*

_{2},

*E*

_{2},

*λ*

_{2})

*be two graphs*.

*Let*

*M*

*be a partial injective mapping that maps*

*N*

_{1}

*to N*

_{2}.

*Let*(

*n*,

*m*)

*be a node pair in*openpairs.

*After putting*(

*n*,

*m*)

*into*

*M*,

*the graph similarity increment*,

*denoted as Δ*,

*is defined as follows*:

*where*:

We prove Proposition 1 as follows.

### Proof

*n*,

*m*), i.e.:

*Δ*=GSim(

*G*

_{1},

*G*

_{2},

*M*∪{(

*n*,

*m*)})−GSim(

*G*

_{1},

*G*

_{2},

*M*). We can rewrite this as follows.

From Proposition 1 we can see that only *lsim*(*n*,*m*) and *Δ*_{|skipe|} are related to the new pair (*n*,*m*). We already know how to compute them as described in Definitions 3 and 18. Therefore, we can compute the graph similarity incrementally. For example, considering query graph and graph 1 from Fig. 5. Let the weights wsubn=1.0, wskipn=0.5 and wskipe=0.5. Let the mapping *M*= {(“Buy Goods”, “Buy Goods”)}. Then, after putting (“Reception of Goods”, “Receive Goods”) into *M*, the graph similarity increment is \(\frac{1}{2.0}\cdot(\frac{1.0\cdot0.63}{2}-\frac{0.5\cdot(-2)}{5}-\frac{1.0\cdot1.0}{2}+\frac{2.0\cdot0.5}{6}) \approx 0.1\).

#### 5.2.3 Pre-selecting similar node pairs

We can reduce the number of times the graph similarity must be computed, by ‘predicting’ the pair in the mapping that would increase the similarity the most. In Algorithm 1, to decide which node pair to add into the mapping *M* next, we need to compute GSim(*G*_{1},*G*_{2},*M*∪{(*n*,*m*)}) for all the (*n*,*m*) in *openpairs* and find the maximal value (lines 7 and 8). Proposition 1 discloses that the graph similarity increment is related to two variables: *lsim*(*n*,*m*) and *Δ*_{|skipe|} only. Consequently, if we can ‘predict’ the value of those two variables, we can predict the value of the overall similarity increase.

This section first proposes an efficient manner to compute the value of *Δ*_{|skipe|}. Then, it uses the values of *lsim*(*n*,*m*) and *Δ*_{|skipe|} to pre-select a few candidate pairs that potentially have the largest similarity increment. Last, it finds the node pair with the maximal graph similarity increment from candidate pairs. By doing this, only the graph similarity increments of these candidate pairs, instead of all the pairs in *openpairs*, need to be computed and compared to find the pair with the maximal graph similarity increment.

We can efficiently compute the value of *Δ*_{|skipe|} as follows. During iterations, given a pair (*n*,*m*), *lsim*(*n*,*m*) is constant in spite of the changes of the mapping *M*. However, *Δ*_{|skipe|} is related to both (*n*,*m*) and *M*. In each iteration, we need to know the *Δ*_{|skipe|} values for all the pairs in *openpairs* before the pre-selection. We can compute the *Δ*_{|skipe|} values based on Definition 18 in each iteration, but it is time consuming because of the consideration of all pairs in *openpairs*. Instead, we build a cache to store the *Δ*_{|skipe|} values for all the pairs in *openpairs*. Initially, all the *Δ*_{|skipe|} values are 0, because there is no node pair in *M*. When a node pair (*o*,*p*) is added to *M*, we only need to update the *Δ*_{|skipe|} values for (*n*,*m*)∈*openpair* that makes *n*∈•*o*∧*m*∈•*p* or *n*∈*o*•∧*m*∈*p*• hold. Proposition 2 presents the rule to update the *Δ*_{|skipe|} values.

### Proposition 2

(Difference of Skipped-edge Increment)

*Let*

*G*

_{1}=(

*N*

_{1},

*E*

_{1},

*λ*

_{1})

*and*

*G*

_{2}=(

*N*

_{2},

*E*

_{2},

*λ*

_{2})

*be two process graphs as defined in Definition*1.

*Let*

*M*

*be a partial injective mapping that maps*

*N*

_{1}

*to N*

_{2}.

*Let*(

*o*,

*p*)

*and*(

*n*,

*m*)

*be two node pairs in*openpairs \((o\not=n \land p\not=m)\).

*After putting*(

*o*,

*p*)

*into M*,

*the difference of the*

*Δ*

_{|skipe|}

*value for*(

*n*,

*m*)

*is defined as follows*:

We prove Proposition 2 as follows.

As an example, consider query graph and graph 1 from Fig. 5. Let the mapping *M*=∅. Then, after putting (“Buy Goods”, “Buy Goods”) into *M*, only the *Δ*_{|skipe|} values for (“Reception of Goods”, “Receive Goods”) and (“Reception of Goods”, “Verify Invoice”) need to be modified to −2.

By now, we already know the values of *lsim*(*n*,*m*) and *Δ*_{|skipe|}. Then, let us see how to use them to pre-select candidate pairs. The value range of *Δ*_{|skipe|} is typically limited. For example, in the validation dataset of this article, it can only be 0, −2, −4, or −6 (the values are always even, because only two edges can match each other at one time). Therefore, we can first consider *Δ*_{|skipe|} and then *lsim*(*n*,*m*). For each possible value of *Δ*_{|skipe|}, the node pair (*n*,*m*) with the maximal *lsim*(*n*,*m*) value is selected as a candidate pair (see Definition 1). To quickly get the pair (*n*,*m*) with maximal label similarity, we can sort the node pairs in *openpair* descendingly with respect to their label similarities in advance. We get a few candidate pairs, of which the one with the maximal graph similarity increment is the pair we are looking for.

Next, we can pre-select the potential node pairs to be put into *M*. For example, query graph and graph 1 in Fig. 3 are considered. When *M*={(“Buy Goods”, “Buy Goods”)}, the distinct values for *Δ*_{|skipe|} are −2 or 0. The node pairs (“Reception of Goods”, “Receive Goods”) and (“Consume Goods”, “Receive Goods”) are selected as candidate pairs respectively for each *Δ*_{|skipe|} value. Finally, (“Reception of Goods”, “Receive Goods”) is put into *M*, because it provides a higher graph similarity increment.

There are algorithms solving multiple items with the highest overall values, e.g., the well-known threshold algorithm (TA) [13]. These algorithms are also applicable for node pair pre-selection (two items in this case, *Δ*_{|skipe|} and *lsim*(*n*,*m*)). In this section, we present a simple but effective algorithm for the sake of explanation.

### 5.3 The improved greedy algorithm for process similarity search

Initially, instead of considering all the possible node pairs, only the top *k* most similar nodes are considered for each query node (see line 10) as explained in Sect. 5.2.1. These node pairs are sorted with respect to their label similarities. The mapping is empty at first (see line 11). Three more variables are defined (see lines 12–14: *result*, *skipedgecache* and *candidatepairs*. *result* is the partial graph similarity for the current mapping *M*. *skipedgecache* is a list that records the numbers of potentially matched edges for each pair in *openpairlist*. *candidatepairs* is a mapping that, for each possible value in *skipedgecache*, records the node pair with maximal node similarity.

In each iteration, the node pair with the maximal graph similarity increment in *candidatepairs* is added to the mapping; the variables are adapted according to the current state after that. The function ends when there is no more node pair for which the graph similarity increases.

*k*=2,

*wskipn*=

*wskipe*=

*w*

_{l}=0.5, and

*wsubn*=

*w*

_{r}=1.0. In the figure the nodes are identified by the first letters of the words in their labels. The figure shows the values for the variables of the algorithm for three iterations. Initially, the top-2 most similar nodes for each node in the query model are determined. Based on that information, the

*openpairlist*is constructed. Initially, none of the pairs in the

*openpairlist*will reduce the number of skipped edges. Consequently, the

*skipedgecache*contains only 0s. The only pair in

*candidatepairs*is (“Buy Goods”, “Buy Goods”), which is consequently put into

*M*in the first iteration. As a result, the

*openpairlist*,

*skipedgecache*and

*candidatepairs*variables are updated. There are now two distinct values (−2 and 0) in

*skipedgecache*, and two pairs in

*candidatepairs*. (“Reception of Goods”, “Receive Goods”) provides higher graph similarity increase and is chosen to be in

*M*in the second iteration. Then, there is only one pair left in

*candidatepairs*, but it cannot increase the partial graph similarity and the algorithm ends.

During the iterations, only four partial graph similarities induced by different mappings are computed, and only two partial graph similarities are compared to select the pair with the maximal graph similarity increment. Some additional computation is required in Algorithm 2 as compared to the original algorithm (e.g., ranking node pairs with respect to the label similarity). However, the improved algorithm is much less time consuming, as will be shown in the evaluation results in Sect. 8.

## 6 Ranking

Using the similarity estimation metric ESim from Sect. 4 and the similarity measurement metric GSim from Sect. 5, we can rank the models in a collection in the order of their similarity to a query model.

Given a query business process model and a set of business process models, we classify the set of business process models as ‘relevant’, ‘potentially relevant’ or ‘irrelevant’, according to Definition 13. We only rank the models in the ‘relevant’ and the ‘potentially relevant’ sets, by first presenting the models in the ‘relevant’ set, in the order of their estimated similarity ESim to the query model, and then presenting the models in the ‘potentially relevant’ set in the order of their similarity GSim to the query model. Ranking models in a set, results in a sequence that is ordered in descending order of similarity score (most similar item first). Sequences can be concatenated to produce a complete search result. Given two sequences *L* and *M*, their concatenation, denoted *L*++*M*, is the sequence in which the elements from *L* are put in front of the elements from *M* operand. We only consider the potentially relevant models that are sufficiently similar to the query model (i.e. we only consider the models *G*, for which GSim(*G*,*G*_{q})>*cutoff*, where *cutoff* is a parameter). More precisely, the ranking is defined as follows.

### Definition 21

(Ranking)

*G*

_{q}be a query graph and let

*Gs*be a set of graphs. Furthermore, let

*cutoff*be a parameter that determines the minimum similarity score. The ranking of the graphs from

*Gs*according to their similarity to

*G*

_{q}is a mathematical sequence

*Gr*++

*Gp*, where:

*Gr*is the sequence that consists of all models from*Gs*that are relevant to*G*_{q}, such that for each*Gr*_{i},*Gr*_{j}from*Gr*holds: if*i*<*j*then ESim(*Gr*_{i},*G*_{q})≥ESim(*Gr*_{j},*G*_{q}); and*Gp*is the sequence that consists of all models*G*from*Gs*that are potentially relevant to*G*_{q}and for which GSim(*G*,*G*_{q})>*cutoff*, such that for each*Gp*_{i},*Gp*_{j}from*Gp*holds: if*i*<*j*then GSim(*Gp*_{i},*G*_{q})≥GSim(*Gp*_{j},*G*_{q}).

The improvement in the time complexity when using the similarity estimation step, can be characterized as follows. Let *k* be the total number of process models in a collection and *n* be the average number of nodes in a process model. If node features are used for similarity estimation, the similarity estimation searches the most similar node in a tree-based index, for each node in the query model. There are *k*⋅*n* nodes in the tree at most, when all nodes in the process model collection are distinct from each other. Therefore, the time complexity of the similarity estimation step (using the node features only) has an upper bound of *O*(*n*⋅log(*k*⋅*n*)). The time complexity of the greedy algorithm for process similarity search is *O*(*n*^{3}) [7]. Therefore, the improvement in time complexity is characterized as: *O*(*p*⋅*n*^{3}−*n*⋅log(*k*⋅*n*)), where *p* is the fraction of models that can directly be classified as relevant or irrelevant, after the similarity estimation step.

## 7 Implementation

This section describes the architecture that we propose for implementing the search algorithm in a business process model repository. The architecture in this paper is based on the more general architecture for business process model repositories that we propose in [33] and focuses on the similarity search aspect. As such, it provides a more detailed design of a single aspect of the architecture for business process model repositories. As a proof of concept, we implemented a web-based prototype of the architecture and the search algorithm.^{2} In this section, first, we present the general three-layer architecture of the tool in terms of a UML component diagram. Second, we make the architecture more concrete, by presenting details of the interfaces of the components and implementations of some of the components. Third, we present the sequence diagrams that describe the behavior of the architecture: one sequence diagram for building an index of business process models and their features and one for searching similar processes. Last, we present the prototype that implements the architecture.

The storage layer consists of three components: the indexing component and the internal and external process model component. The indexing component is the core of our design. It stores features and an index based on features. As examples, Fig. 8 contains two types of features. However, subclasses of “Feature” can be created as desired to also store other features. “NodeFeature” stores a label and a number of input and output edges; “Seq2Feature” stores sequences of two nodes. The class diagram describes two types of indices, “InvertedFeatureIndex” and “FeatureRelationIndex”. The former stores the relation between features and the business process graphs in which they are contained. The latter stores hierarchical relations between features as explained in Sect. 4. More precisely, it stores which feature is (direct) parent of which features (Fig. 4). The internal process model component stores the business process models in the format that is used in the repository for efficient computation, which is the process graph in this case (Definition 1). The external process model stores the business process models in their original format. Process models are described in the “ProcessModel” class, which has several subclasses, indicating that process models can be described in different notations, e.g., EPC and BPMN. The class can be extended as desired to store other types of models. In order for those models to work in the repository, the process repository management layer must contain functions to convert them to business process graphs. Note that process models, the corresponding process graphs and features of those process graphs are related via the “processId” that must be unique for a given process model.

For conciseness, Fig. 8 only describes the most important components in detail. We excluded details about the other components, because they are not essential to understand the design and because they would differ in different repositories, for example, to cater for different GUI requirements or to include business process models in different notations. For the same reason, not all operations that are made available by the repository are shown. For example, the external process model storage component only provides an operation to read all models, but obviously also operations should be provided to create, read, update and delete singular models. These operations, however, are not essential to understanding the design. As another example, additional operations should be provided to add a single model to the indexes or to remove a single model from the indexes, such that indexes can be updated incrementally when a model is added to or removed from the repository.

To build the index, the GUI invokes the “createIndex” function. First, this function initializes the indexes, by creating empty indexes. Second, it reads all process models from the repository. Third, it converts each process model into a process graph, retrieves all of its features (Sect. 2) and stores the process graph into the repository. Fourth, it inserts each feature of a process graph into the index (Sect. 3.3) and updates the index in the repository. Note that, at this moment, the index is constructed in-memory, instead of in the storage layer. This is not efficient and must be improved in future work.

To search similar processes, the GUI invokes the “search” function. First, this function transforms the given model into a process graph. Second, it invokes the “estimate” function, to compute the relevant and potentially relevant process graphs for the given process graph. The “estimate” function first computes all features of the given query graph. It then reads the indexes into memory. (Again, processing of the indexes is done in-memory, which must be improved in future work.) The “estimate” function then retrieves, from the indexes, the features that match features from the process graph. It then uses this information to compute the estimated similarity of the query graph and the process graphs that have at least one matching feature. Finally, it uses the estimated similarity to determine which graphs are relevant and which graphs are potentially relevant. These two lists are returned. The search component then invokes the improved greedy algorithm (Sect. 5) to compute the similarity of the query graph to each of the potentially relevant process graphs. Finally, the list of results is returned to the user.

## 8 Evaluation

This section presents the evaluations of the algorithm described in this paper. The evaluations determine the execution time of the algorithm and the quality of the search results that it returns. In particular we compare the execution time and search result quality of the algorithm from this paper to those of the greedy algorithm [7]. Two evaluations are performed. One homogeneous evaluation in which the query models are taken from the collection that is searched and one heterogeneous evaluation in which the query models are taken from a different model collection.

### 8.1 Homogeneous evaluation

In this subsection, we present the homogeneous evaluation. We first explain the setup of the evaluation and then the results.

#### 8.1.1 Evaluation setup

We have two experimental setups: one for evaluating the quality of retrieved results and one for evaluating the execution time, respectively.

Both experiments are performed on the collection of SAP reference models. This is a collection of 604 business process models (described as EPCs) that capture the business processes that are supported by SAP [6]. On average each process model in the collection contains 21.6 nodes with a minimum of 3 and a maximum 130 nodes. The average size of node labels is 3.8 words.

To evaluate the quality of retrieved results, we use the same evaluation dataset as in [7]. This dataset consists of 100 process models that were extracted from the collection of SAP reference models. In addition to that we extracted 10 process models as query models. Consequently, there are 1000 combinations of a query model and a model in the dataset for which the similarity can be determined. For each of those combinations three human observers judged whether the process model is a relevant search result for a particular query model. Next, we can determine the quality of the search results that are returned by a particular algorithm by comparing them to the relevance judgement that is given by the human observers. We can quantify the quality in terms of the R-Precision [4].

### Definition 22

(R-Precision)

Let \(\mathcal{D}\) be the set of process models, \(\mathcal{Q}\) be the set of query models and \(\mathrm{relevant}: \mathcal{Q} \rightarrow \mathbb {P}(\mathcal{D})\) be the function that returns the set of relevant process models for each query model (as determined by the human observer).

*D*=[

*d*

_{1},

*d*

_{2},…,

*d*

_{n}] for a query

*q*with \(d_{i}\in \mathcal{D}\), the R-Precision is the precision of the first

*R*results, where

*R*=|relevant(

*q*)| is the total number of process models that is relevant to the query:

We compare the R-Precision of the greedy algorithm that we developed in previous work [7] to the R-Precision of the improved greedy algorithm that is described in this paper. We use the greedy algorithm, because it is the fastest algorithm of the ones we studied [7] and, therefore, provides a lower-bound for improvements in execution time.

To evaluate the execution time, we compare the 10 queries with all 604 business process models in the collection of SAP reference models, instead of just the 100 process models. We do this, because to compute the execution time we do not need the human judgement and computing the execution time for a larger set of models leads to a more realistic result. We record the average execution time per query.

#### 8.1.2 Evaluation results

Result quality of the homogeneous evaluation

Feature ( | Occurrences | Matches | Rel | PoR | Ir | R-Prec |
---|---|---|---|---|---|---|

Previous Work [7] | – | – | 0 | 100 | 0 | 0.84 |

1: Node(1) | 374 | 581 | 5.5 | 10.9 | 83.6 | 0.84 |

2: 1+Seq(2) | +267 | +197 | 8.1 | 8 | 83.9 | 0.83 |

3: 2+Seq(3) | +175 | +96 | 7.8 | 10.1 | 82.1 | 0.83 |

4: 2+Split(3) | +87 | +93 | 7.8 | 10.1 | 82.1 | 0.83 |

5: 4+Split(4) | +23 | +11 | 7.8 | 10.1 | 82.1 | 0.83 |

6: 2+Join(3) | +58 | +18 | 7.8 | 10.1 | 82.1 | 0.83 |

7: 6+Join(4) | +14 | +1 | 7.8 | 10.1 | 82.1 | 0.83 |

The rows in the table show the features that are used to do the feature-based similarity estimation. In the first row no feature-based similarity estimation is done. This row lists the performance of the greedy algorithm. In the second row similarity estimation is done based only on node features (of size 1). In the third row similarity estimation is done based on node features plus sequence features of size 2 and so on.

The columns in the table show the properties of the features and similarity estimation based on the features. First, they show the number of times features of a given type occur in the set of process models and the number of times features of a certain type match in the set of process models. For example, in the set of process models, there could be four nodes labeled ‘A’. These nodes count as four occurrences of the node feature type. Because of their high label feature similarity, these nodes can be considered to match. This leads to six matches, because each of the four nodes can be matches to each of the others. Second, the columns show the average number of process models that, after the similarity estimation step, are estimated as being relevant (Rel), potentially relevant (PoR) and irrelevant (Ir) over the ten queries. Third, the columns show the average R-Precision (R-Prec) over the ten queries.

The table shows that when similarity estimation is done based only on node features on average 5.5 models are estimated to be relevant, 10.9 models to be potentially relevant and 83.6 models as irrelevant. Therefore, in this situation, the improved greedy algorithm only has to be used to measure the similarity of about 11% of the total number of process models, about 6% of the models are immediately judged as relevant and the remaining models are judged as irrelevant. In this case the quality of the returned results in terms of R-Precision remains the same. If sequences of size two are also used to perform the similarity estimation, only 8% of the process models has to be compared using the improved greedy algorithm. However, this does lead to a slightly lower R-Precision. Inclusion of other types of features does not improve the similarity estimation any further.

Execution time of the homogeneous evaluation

Features ( | Rel | PoR | Ir | T | T | \(\mathrm{T}_{\mathrm{total}}^{\mathrm{avg}}\) | \(\mathrm{T}_{\mathrm{total}}^{\min}\) | \(\mathrm{T}_{\mathrm{total}}^{\max}\) |
---|---|---|---|---|---|---|---|---|

Previous Work [7] | 0 | 604 | 0 | 0.00 s | 0.60 s | 0.60 s | 0.16 s | 1.45 s |

1: Node(1) | 7 | 73 | 524 | 0.05 s | 0.04 s | 0.09 s | 0.03 s | 0.14 s |

2: 1+Seq(2) | 13.7 | 44.9 | 554.4 | 0.05 s | 0.02 s | 0.07 s | 0.03 s | 0.09 s |

3: 2+Seq(3) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |

4: 2+Split(3) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |

5: 4+Split(4) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |

6: 2+Join(3) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |

7: 6+Join(4) | 9.5 | 73.2 | 521.3 | 0.05 s | 0.05 s | 0.10 s | 0.03 s | 0.15 s |

The execution time consists of two parts: the time it takes to estimate the similarity and classify process models as relevant (Rel), potentially relevant (PoR) or irrelevant (Ir), denoted T_{est}; and the time it takes to compute the similarity for the models classified as potentially relevant, denoted T_{com}. Table 2 shows the average estimation and execution times over the ten search queries. In addition to that it shows the average total time over the ten queries and the (minimum) time of processing the query that takes the least time and the (maximum) time of processing the query that takes the most time.

The table shows that, on average, estimating similarity based on node features helps to retrieve similar models 6.7 times faster and from Table 1 we know that this does not impact the quality of the search results. Also including sequence features of size two helps retrieve similar models 8.6 times faster, but from Table 1 we know that this reduces the quality of the results by about 0.01 in terms of R-Precision as a tradeoff.

The table also shows that, on average, the total search time for the greedy algorithm is quite acceptable and takes only 0.60 seconds. However, in the worst case the total search time for the greedy algorithm is already 1.45 seconds. This is slower than the response time that one would expect of a search engine. In addition to that, the search time of the greedy algorithm is linear over the number of models in the collection, meaning that if we were to search a collection of 6000 models (which is the size of the collection of business process models of Suncorp-Metway Ltd [16]) the search time would already be around 14 seconds in the worst case.

dcutoff, which is a parameter that determines whether a role feature is considered to be discriminative (Definition 6).

lcutoff

_{high}, rcutoff and lcutoff_{med}, which are parameters that determine what is considered to be a sufficiently similar for a feature to match (Definition 10).ratio

_{r}and ratio_{p}, which are parameters that determine which class a process model belongs to based on the fraction of features that match with the query model (Definition 13).

We vary each of these parameters from 0 to 1 in increments of 0.1 and ran the experiments with all possible combinations of parameter values within this range. We use the parameters that, on average, give the highest R-Precision or the fewest potentially relevant models with respect to the queries to show attractive tradeoffs. The values that we use are *dcutoff*=0.3, *lcutoff*_{high}=0.8, *rcutoff*=1.0 and *lcutoff*_{med}=0.2. The other two parameters also depend on the type of features we use. For the node feature (the second row in Table 1 or 2), *ratio*_{r}=0.5; otherwise, *ratio*_{r}=0.2. For the node and sequence (with two nodes) features (the second and third rows in Table 1 or 2), *ratio*_{p}=0.1; otherwise, *ratio*_{p}=0.0.

*wskipn*,*wskipe*and*wsubn*, which denote the weights given to node deletion, node substitution and edge deletion (Definition 14).*k*, which denotes how many most similar nodes are considered for a search node (Definition 15).*w*_{l}and*w*_{r}, which denote the weights given to the label and role similarities (Definition 16).

For the first group, we used the same values as in [7], i.e., *wskipn*=0.1, *wskipe*=0.4 and *wsubn*=0.9. For the second group, we varied each of these parameters from 1 to 10 in increments of 1, and it returns best results when *k*=3. For the third group, we varied each of these parameters from 0 to 1 in increments of 0.1 and ran the experiments with all possible combinations of parameter values within this range. We used the parameters that give best results, i.e., *w*_{l}=1.0 and *w*_{r}=0.6.

Note that the parameter settings are specifically tuned to give the best results for this dataset. Parameter values that are generically applicable should be obtained through additional experiments on other process model collections. However, note that a change in parameter settings should not change the conclusions about the comparison between the greedy algorithm and the algorithm in this paper, because both algorithm profit equally from the optimization of the parameters for the evaluation dataset.

### 8.2 Heterogeneous evaluation

In this subsection, we present the heterogeneous evaluation. We first explain the setup of the evaluation and then the results.

#### 8.2.1 Evaluation setup

In the heterogeneous evaluation, the model collection was extracted from the same collection as for the homogeneous evaluation. However, the query models were taken from a different collection of business process models, which represent the processes of a large manufacturing company. Ten query models were extracted. On average each of these ten process models contains 20.3 nodes with a minimum of 9 and a maximum 35 nodes. The average size of node labels is 4.6 words.

The main difference between the heterogeneous and the homogeneous evaluation is that, in the homogeneous evaluation, it is more clear which models are similar to a given model. For example, the SAP Reference Model contains 7 purchasing models that resemble each other strongly. Consequently, given one of the purchasing models, it is very easy to find the other, similar, ones. For the heterogeneous evaluation, this is more difficult: given a purchasing model (that is not from the SAP Reference Model), similar models are less easy to identify. Therefore, the similarity estimation step will initially lead to more models that are potentially relevant. Consequently, we expect that the similarity estimation step will lead to a smaller efficiency improvement.

Setup for the heterogeneous evaluation

Branch/Business Function | nr. of query models | nr. of document models |
---|---|---|

Procurement | 3 | 37 |

Delivery and invoicing | 1 | |

Production planning | 17 | |

Sales | 4 | 43 |

Business planning | 2 | |

Management | 2 |

#### 8.2.2 Evaluation results

_{est}) and the computation time by the improved greedy algorithm (T

_{com}). Besides these the columns also show the execution time for the queries that take least (\(\mathrm{T}_{\mathrm{total}}^{\mathrm{min}}\)) and most (\(\mathrm{T}_{\mathrm{total}}^{\mathrm{max}}\)) time.

Results of the heterogeneous evaluation

Features ( | Rel | PoR | Ir | R-Prec | T | T | \(\mathrm{T}_{\mathrm{total}}^{\mathrm{avg}}\) | \(\mathrm{T}_{\mathrm{total}}^{\min}\) | \(\mathrm{T}_{\mathrm{total}}^{\max}\) |
---|---|---|---|---|---|---|---|---|---|

Previous [7] | 0 | 100 | 0 | 0.56 | 0.00 s | 0.32 s | 0.32 s | 0.20 s | 0.51 s |

1: Node(1) | 74.2 | 20.4 | 5.4 | 0.54 | 0.02 s | 0.02 s | 0.04 s | 0.02 s | 0.06 s |

2: 1+Seq(2) | 83 | 11.6 | 5.4 | 0.52 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |

3: 2+Seq(3) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |

4: 2+Split(3) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |

5: 4+Split(4) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |

6: 2+Join(3) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |

7: 6+Join(4) | 85 | 9.6 | 5.4 | 0.50 | 0.02 s | 0.01 s | 0.03 s | 0.02 s | 0.05 s |

The table shows that, by using node features only, around 20% of process models need to be checked with the improved greedy algorithm; the execution time is reduced by 8 times; while the quality is reduced by 0.02 in terms of R-Precision. These findings support our expectation that in the heterogeneous case more models will need to be checked with the greedy algorithm than in the homogeneous case (in the homogeneous case 10% of the process models need to be checked). By further including sequence with two nodes features, around 12% of process models need to be checked with the improved greedy algorithm; the execution time is reduced by 10.7 times; while the quality is reduced by 0.04 in terms of R-Precision. Similar to the previous evaluation, the results do not improve anymore by including more features.

For the heterogeneous evaluation we changed the values for the parameters *lcutoff*_{high}=0.2, *lcutoff*_{med}=0.1, *wskipe*=0.0 and *wsubn*=0.1 to obtain the best results. The values of other parameters stay the same as for the homogeneous evaluation.

From both the homogeneous and heterogeneous experiments, we can see that the quality of the homogeneous experiment is higher (0.84 v.s. 0.54). This is because similar tasks are typically labeled with the same terms in the same collection, but with different terms in different collections. Therefore, it is easier to establish task similarity based on label similarity in homogeneous datasets. Moreover, we currently use string edit distance to compute label similarity in this paper, which is a naive label similarity metric that cannot deal well with different (synonymous) terms being used in similar labels. The result quality of the heterogeneous experiment should therefore be improved by considering synonyms [8, 27] and domain ontologies [11]. We can also see that the execution time of the heterogeneous experiment is much less (0.60 s v.s. 0.32 s). This is because the size of query models. Although the average sizes are almost the same for both sets of query models (21.6 v.s. 20.3), the maximal size differs a lot (130 v.s. 35). Consequently, the slowest query of the homogeneous experiment takes 1.45 s, while the slowest query of the heterogeneous experiment only takes 0.51 s. This causes the homogeneous experiment to take more time on average.

## 9 Related work

The work presented in this paper is related to: business process similarity search, business process querying, general graph similarity (isomorphism) search, schema matching and ontology matching. We present work on these topics as related work.

Business process similarity search techniques have been developed from different angles [11, 17–19, 23, 26, 29]. These techniques mainly vary with respect to the information, incorporated in the business process models, that they use to determine similarity [7] and the underlying formalism that they use to determine similarity [10]. The work described in this paper complements existing business process similarity search techniques, because it focuses on estimating business process similarity, rather than measuring it exactly, and using that estimate to improve the time performance of existing techniques. As such it can be combined with any of the existing techniques to improve their performance. Lu and Sadiq [18] also use features to determine similarity, but because their goal differs from the goal of this paper (they want to measure similarity exactly), their features are larger than ours, potentially consisting of a complete process model. This makes their features suitable for measuring similarity exactly, but not for estimating it quickly. Kunze and Weske [15] combine metric trees with process similarity metrics based on edit distances to reduce comparison operations. They compare process similarity based on complete process models, while we estimate process similarity based on different types of features. Furthermore, this method completely relays on metric trees, which requires the process similarity metrics satisfying the positivity, symmetry, and triangle inequality postulates. It is not always the case when we consider synonyms [8, 27] and domain ontologies [11]. Although we also choose metric trees to index labels in this paper, it is optional. For example, we can use the inverted index to index labels for the label similarity metrics considering synonyms.

Process model querying is another related topic. Instead of computing similarity between models, it retrieves process models that satisfy a given query. A query can be described by a query language for process models [1, 3, 5] or a (fragment of a) process model [14]. Awad [1] develops BPMN-Q, a language to query business processes, by extending the BPMN notation. Beeri et al. [3] proposes BP-QL, a language to query business processes modeled in BPEL. Choi et al. [5] proposes IPM-EPDL, a query language for a proprietary process modeling notation based on XML. A notable relation to the work is this paper is the work by Jin et al. [14], who also develop indexing techniques, using sequences in the process models, to improve the efficiency of process model querying.

General graph search has been applied in various application domains, including fingerprint search, DNA search and chemical compound search. In these domains, (sub)graph isomorphism algorithms are used as a basis of graph search, by checking whether a query graph is a subgraph of a graph in the dataset. To avoid comparing two entire graphs, which is time consuming, graph fragments are used as features to build index. This idea is also the basis for this paper. Willett et al. [28] describe feature-based similarity search in a chemical compound databases. ShaSha et al. [25] propose a path-based approach; Yan et al. [30] use discriminative frequent structures to index graphs; Zhao et al. [34] prove that using tree structures and a small number of discriminative graph structures to index graphs is a good choice. Yan et al. [31] also investigate the relationship between feature-based and structure-based methods and built a connection between the two. The main difference between the work that has been done in this area and the work in this paper, is the different nature of business process graphs as compared to graphs in other domains. In particular, there is practically no restriction to the number of possible node labels in a business process graph and matching nodes do not necessarily have the identical labels. In comparison DNA nodes have four possible labels, chemical compound nodes have 117 possible labels, and in both cases matching nodes have identical labels. Also, business process graphs have different structural properties and patterns. These characteristics require that feature types are defined specifically for business process graphs. In addition to that processing feature similarity is different, because business process graphs do not require features to match exactly for graphs to be similar, while graphs in other domains do require features to match exactly.

The problem of process model similarity search can be related to that of schema matching [24]. There are, however, important differences between process models and schemas. Firstly, data models and schemas generally have labeled edges (associations or schema elements) in addition to labeled nodes. Secondly, the types of nodes and the attributes attached to nodes are different in process models when compared to schemas or data models (e.g. there are no control nodes in data models). During our experiments, we implemented a graph matching technique originally designed for schema matching, namely Similarity Flooding [21]. After adapting the technique to deal with process models, we tested it on the dataset discussed in this paper using various parameter settings [8]. The similarity flooding technique led to a poor score—0.56 of mean average precision for the best settings (with a first-10 precision of 0.6). We attribute this poor performance to the fact that edges in process models do not have labels, while schema matching techniques, such as similarity flooding, heavily rely on edge labels. Madhusudan et al. [19] introduce a structural metric for process model comparison based on similarity flooding. However, Madhusudan et al. rely on a semantic notation in which process models have labels attached to their edges.

The problem of process model similarity search can also be related to that of ontology matching [12]. However, the nature of ontologies and business processes is different; a process model consists of labeled tasks and control flow relations, while an ontology provides a vocabulary, which records the relationship of its terms, e.g., generalization and specialization. This makes it hard to directly use techniques from the area of ontology matching in the area of business process similarity search. However, in future work, it would be worthwhile to investigate the possible use of ontologies and ontology matching for matching tasks and task labels. Ehrig et al. [11] apply such a technique, using WordNet synonyms [22] as an ontology. In previous work we also applied WordNet synonyms to measure the semantic similarity of two labels [8, 27]. However, we still need to develop indexing techniques to use those similarity metrics efficiently.

## 10 Conclusion

This paper presents an algorithm that improves the efficiency of business process similarity search. The algorithm contains three improvements to an existing algorithm for fast business process similarity search [32]. In addition to that it presents a preprocessing step, in which the similarity of business process models is estimated. The estimation is used to quickly classify business process models as relevant, irrelevant or potentially relevant to a query. The actual similarity computation, which is computationally expensive, then has to be performed for fewer models, namely only those models that were classified as potentially relevant. The classification is done based on simple, but representative, parts of business process models, also called features.

The greedy algorithm for process similarity search, developed in previous work, is improved in three ways. First, the number of node pairs that must be compared to determine similarity is reduced by initially selecting only a subset of all possible combinations of nodes. Second, the algorithm for similarity computation is improved by computing the similarity incrementally, rather than anew in each iteration of the algorithm. Third, the number of node pairs that must be compared is further reduced by ‘predicting’ the node pairs that should increase the similarity the most in each iteration of the algorithm.

The evaluations that are performed on the algorithm show that, as a consequence of the improvements, the search time of the fastest algorithm for business process similarity search that currently exists can be reduced by a factor 10 with a quality reduction of less than 0.04 (In terms of R-Precision). These reductions are computed as the average over ten search queries. The time reduction for the most complex query is a factor 24.5 and the reduction for the least complex query is a factor 2.5.

The evaluations also show that individual nodes and sequences of two nodes are effective features to quickly compare and classify business process models. Other features that have been used are sequences of three nodes and splits and joins. However, these features do not further improve the quality of the search results or reduce the search time.

There are some research topics that are left for future work. First, in this paper the similarity of nodes in business process models is mainly based on string similarity. However, nodes can be labeled differently using synonyms, in particular when the query models and the models in the dataset are from different organizations. Therefore, we propose that in future work more advanced metrics for label similarity that consider synonyms [8, 27] and domain ontologies [11] are applied. Second, the algorithm in this paper mainly focuses on tasks and connections between them. However, process models often contain more information that may be exploited when determining their similarity, e.g., resources and data used. We propose that the extent to which such information can be used to determine process similarity is investigated in future work. Third, the architecture for fast process similarity search can be extended to incorporate technical measures that improve the efficiency of similarity search. For example, the architecture may allow for distributed processing of search queries.

Access the prototype at: http://is.tm.tue.nl/research/apromore.html. Please take Firefox or Google Chrome as your web browser, since IE does not support the script we use.

## Acknowledgement

The research reported in this paper is supported by the China Scholarship Council (CSC).

### Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.