Keywords

1 Introduction

Graphs offer a convenient way to formally model objects or patterns, which are composed of complex subparts including the relations that might exist between these subparts. In particular, the nodes of a graph can be employed to represent the individual components of a pattern while the edges might represent the structural connections between these components. Attributed graphs, that is, graphs in which nodes and/or edges are labelled with one or more attributes, have been of crucial importance in pattern recognition throughout more than four decades [15]. For instance, graphs are successfully used in the field of recognizing biological patterns where one has to formally describe complex protein structures [6] or molecular compounds [7].

The definition of an adequate dissimilarity model between two patterns is one of the most basic requirements in pattern recognition. The process of evaluating the dissimilarity of two graphs is commonly referred to as graph matching. The overall aim of graph matching is to find a correspondence between the nodes and edges of two graphs. There are several graph matching models available. Spectral methods, for instance, constitute an important class of error-tolerant graph matching procedures with a quite long tradition [812]. Graph kernels constitute a second important family of graph matching procedures and various types of graph kernels emerged during the last decade [1317]. For an extensive review on these and other graph matching methods developed during the last forty years, the reader is referred to [24].

In the present paper we focus on the concept of graph edit distance [1820]. This graph matching model is especially interesting because it is able to cope with directed and undirected, as well as with labelled and unlabeled graphs. However, the major drawback of graph edit distance is its computational complexity which is exponential with respect to the size of the graphs. Graph edit distance belongs to the family of quadratic assignment problems (QAPs), and thus, graph edit distance is known to be an NP-complete problem. In order to reduce the computational complexity of this particular distance model, an algorithmic framework for the approximation of graph edit distance was introduced in [21]. The basic idea of this approach is to find an optimal assignment of local neighbourhoods in the graphs, reducing the problem of graph edit distance to a linear sum assignment problem (LSAP). Recently, a new approximation framework, which extends [2125] in terms of finding a suboptimal assignment of local neighbourhoods, has been presented in [27]. This new approach further reduces the computational complexity of graph edit distance from cubic to quadratic order. An algorithm has been presented in which the human can interact or guide the algorithm to find a proper correspondence between nodes or interest points [28, 29]. And other graph matching techniques have been presented such as [30, 31].

Both approaches, i.e. the original approximation [2125] as well as the faster variant [27], use the same definition of local neighbourhoods, viz. single nodes and their adjacent edges. Yet various other definitions for local structures exist [32]. The present paper investigates the impact of these different local neighbourhoods in the context of the recent quadratic approximation algorithm [27].

2 Definitions and Algorithms

2.1 Graph and Neighbourhood

An attributed graph is defined as a triplet \( G = (\varSigma_{\nu } , \varSigma_{e} ,\gamma_{v} ) \), where \( \varSigma_{v} = \{ v_{a} | a = 1, \ldots ,n\} \) is the set of nodes and \( \varSigma_{e} = \left\{ {\left. {e_{ab} } \right|a,b \in 1, \ldots ,n} \right\} \) is the set of undirected and unattributed edges. In the present paper we investigate graphs with labelled nodes only. Yet, all of our concepts can be extended to graphs with labelled edges. Function \( \gamma_{v} :\varSigma_{v} \to \Delta_{v} \) assigns attribute values from arbitrary domains to nodes. The order of a graph \( G \) is equal to the number of nodes \( n \). The number of edges of node \( v_{a} \) is referred to as \( E(v_{a} ) \). Finally, the neighbourhood of a node \( v_{a} \) is termed \( N_{{v_{a} }} \).

In this paper we focus on three different types of neighbourhoods \( N_{{v_{a} }} \) depending on the amount of structural information which is taken into account. The first definition of \( N_{{v_{a} }} \) is the empty set. That is, no structural information of node \( v_{a} \) is taken into account. Formally, \( N_{{v_{a} }} = \left( {\varSigma_{v}^{{N_{{v_{a} }} }} , \varSigma_{e}^{{N_{{v_{a} }} }} , \gamma_{v} } \right) \) with \( \varSigma_{v}^{{N_{{v_{a} }} }} = \varSigma_{e}^{{N_{{v_{a} }} }} = \{ \} \). We refer to this type of neighbourhood as Node.

The second definition of \( N_{{v_{a} }} \) is given by \( N_{{v_{a} }} = \left( {\varSigma_{v}^{{N_{{v_{a} }} }} , \varSigma_{e}^{{N_{{v_{a} }} }} , \gamma_{v} } \right) \) with \( \varSigma_{v}^{{N_{{v_{a} }} }} = \{ \} \) and \( \varSigma_{e}^{{N_{{v_{a} }} }} = \{ e_{ab} \in \varSigma_{e} \} \). In this case we regard the incident edges of \( v_{a} \) as neighbourhood only (without adjacent nodes). This second definition of \( N_{{v_{a} }} \) is referred to as Degree from now on. The third definition of \( N_{{v_{a} }} \) used in this paper is given by the set of nodes that are directly connected to \( v_{a} \) including all edges that connect these nodes with \( v_{a} \). Formally, \( N_{{v_{a} }} = \left( {\varSigma_{v}^{{N_{{v_{a} }} }} , \varSigma_{e}^{{N_{{v_{a} }} }} , \gamma_{v} } \right) \) with \( \varSigma_{v}^{{N_{{v_{a} }} }} = \left\{ {v_{b} |e_{ab} \in \varSigma_{e} } \right\} \) and \( \varSigma_{e}^{{N_{{v_{a} }} }} = \{ e_{ab} \in \varSigma_{e} |v_{b} \in \varSigma_{v}^{{N_{{v_{a} }} }} \} \). We refer to this definition of a neighbourhood as Star. In Fig. 1 an illustration of the three different neighbourhoods (Node, Star and Degree) is shown.

Fig. 1.
figure 1

Node, Degree and Star neighbourhoods (shown in light-grey) of a single node (shown in dark-grey).

2.2 Graph Edit Distance

A widely used method to evaluate the dissimilarity between two attributed graphs is graph edit distance [18, 19]. The basic idea of this dissimilarity model is to define a distance between two graphs \( G^{p} \) and \( G^{q} \) by means of the minimum amount of distortion required to transform \( G^{p} \) into \( G^{q} \). To this end, a number of distortions or edit operations, consisting of insertion, deletion, and substitution of both nodes and edges are employed. Edit cost functions are typically introduced to quantitatively evaluate the level of distortion of each individual edit operation. The basic idea of this is to assign a cost to the edit operations proportional to the amount of distortion they introduce in the underlying graphs.

A sequence \( (e_{1} , \ldots ,e_{k} ) \) of \( k \) edit operations \( e_{i} \) that transform \( G^{p} \) completely into \( G^{q} \) is called an \( edit\,path\,\uplambda\,(G^{p} , G^{q} ) \) between \( G^{p} \) and \( G^{q} \) . Note that in an edit path \( \uplambda\,(G^{p} , G^{q} ) \) each node of \( G^{p} \) is either deleted or uniquely substituted with a node in \( G^{q} \), and likewise, each node in \( G^{q} \) is either inserted or matched with a unique node in \( G^{p} \). The same applies for the edges.

Let \( {\varvec{\Upsilon}}(G^{p} , G^{q} ) \) denote the set of all edit paths between two graphs \( G^{p} \) and \( G^{q} \). The edit distance of two graphs is defined as the sum of cost of the minimal cost edit path among all competing paths in \( {\varvec{\Upsilon}}(G^{p} , G^{q} ) \).

Optimal algorithms for computing the edit distance are typically based on combinatorial search procedures (such as A* based search techniques). These procedures explore the space of all possible mappings of the nodes and edges of \( G^{p} \) to the nodes and edges of \( G^{q} \) (i.e. the search space corresponds to the set of all edit paths \( {\varvec{\Upsilon}}(G^{p} , G^{q} ) \)). Yet, considering \( m \) nodes in \( G^{p} \) and \( n \) nodes in \( G^{q} \), the set of possible edit paths \( {\varvec{\Upsilon}}(G^{p} , G^{q} ) \) contains \( O(m^{n} ) \) edit paths. Therefore, exact graph edit distance computation is exponential in the number of nodes of the involved graphs.

The problem of minimizing the graph edit distance can be reformulated as an instance of a Quadratic Assignment Problem (QAP). QAPs belong to the most difficult combinatorial optimization problems for which only exponential run time algorithms are known to date (QAPs are known to be NP-complete). The Bipartite Graph Matching algorithm (BP-GED) [21] is an approximation for the graph edit distance that reduces the QAP of graph edit distance computation to an instance of a Linear Sum Assignment Problem (LSAP). This algorithm first generates a cost matrix C which is based on costs of editing local substructures of both graphs. Formally, the cost matrix is defined by:

Where \( C_{i,j} \) denotes the cost of substituting nodes \( v_{i}^{p} \) and \( v_{j}^{q} \) as well as their local neighbourhoods \( N_{{v_{i}^{p} }} \) and \( N_{{v_{j}^{q} }} \). \( C_{i,\varepsilon } \) denotes the cost of deleting node \( v_{i}^{p} \) and its local neighbourhood \( N_{{v_{i}^{p} }} \), and \( C_{\varepsilon ,j} \) denotes the cost of inserting node \( v_{j}^{q} \) and its local neighbourhood \( N_{{v_{j}^{q} }} \).

A linear assignment algorithm can be applied on \( C \) in order to find a (optimal) mapping of nodes and their local neighbourhoods. A large number of solvers for linear sum assignment problems exist [26]. The time complexity of the best performing exact algorithms for LSAPs is cubic in the size of the problem.

Any complete assignment of local substructures derived on C can be reformulated as an admissible edit path from \( {\varvec{\Upsilon}}(G^{p} , G^{q} ) \). That is, the global edge structure from \( G^{p} \) and \( G^{q} \) can be edited with respect to the node operations captured in the mapping of local substructures (this is due to the fact that edit operations on edges always depend on the edit operations actually applied on their adjacent nodes). Eventually, the total cost of all edit operations (applied on both nodes and edges) can be interpreted as a graph edit distance approximation between graphs \( G^{p} \) and \( G^{q} \) (termed BP-GED from now on).

The edit path found with this particular procedure considers the structural information of the graphs in an isolated way only (singular neighbourhoods). Yet, the derived distance considers the edge neighbourhood of \( G^{p} \) and \( G^{q} \) in a global and consistent way and thus the derived distance is in the best case equal to, or in general larger than the exact graph edit distance.

Recently, a new approach which uses an approximation rather than an exact algorithm to solve the assignment of local substructures has been proposed [27]. This new approach employs a greedy assignment algorithm to suboptimally solve the LSAP stated on cost matrix C. The algorithm iterates through C from top to bottom through all rows, and in every row it assigns the current element to the minimum unused element in a greedy manner. More formally, for each row i in the cost matrix \( C = (C_{ij} ) \) the minimum cost entry \( {\varphi }_{i} = argmin_{{\forall {\text{j }}}} c_{ij} \) is determined and the corresponding node edit operation \( (v_{i}^{p} \to v_{\varphi i}^{q} ) \) is added to the mapping of local substructures. By removing column \( {\varphi }_{i} \) in C one can ensure that every column of the cost matrix is considered exactly once.

The remaining parts of the approximation algorithm are identical with the original framework. That is, based on the found assignment of local substructures an admissible edit path and its corresponding sum of costs is derived. However, as the complexity of this suboptimal assignment algorithm is only quadratic (rather than cubic), the time complexity of the complete graph edit distance approximation, termed Greedy-GED from now on, is further reduced.

3 Distance Models for Neighbourhoods

In this section we review and compare various methods proposed in [32] to obtain the individual cost entries \( C_{i,j} ,C_{i,\varepsilon } \), and \( C_{\varepsilon ,j} \) in the cost matrix. These cost entries depend on two weighted disjoint cost values. The first cost is defined with respect to the nodes, while the second cost takes into account the local neighbourhoods of the nodes. Formally, we define \( C_{i,j} ,C_{i,\varepsilon } \), and \( C_{\varepsilon ,j} \) according to the following three cases:

  1. (1)

    If two nodes \( v_{i}^{p} \in \varSigma_{v}^{p} \) and \( v_{j}^{q} \in \varSigma_{v}^{q} \) are mapped to each other, we have

    $$ C_{i,j} = \beta \cdot C(v_{i}^{p} \to v_{j}^{q} ) + \left( {1 - \beta } \right) \cdot C(N_{{v_{i}^{p} }} \to N_{{v_{j}^{q} }} ) $$
    (1)

    where \( \beta \) \( \in \) ]0,1[ is a weighting parameter that controls what is more important, the cost of the pure node substitution \( (v_{i}^{p} \to v_{j}^{q} ) \) or the cost of substituting the neighbourhoods \( N_{{v_{i}^{p} }} \) and \( N_{{v_{j}^{q} }} \) of both nodes.

  2. (2)

    If one node \( v_{i}^{p} \in \varSigma_{v}^{p} \) in \( G^{p} \) is deleted, we have

    $$ C_{i,\varepsilon } = \beta \cdot k_{v} + \left( {1 - \beta } \right) \cdot C(N_{{v_{i}^{p} }} \to \varepsilon ) $$
    (2)

    where \( \beta \in \) ]0,1[ (as defined above), \( k_{v} \) refers to a positive constant cost for deleting one node and \( C(N_{{v_{i}^{p} }} \to \varepsilon ) \) refers to the cost of deleting the complete neighbourhood of \( v_{i}^{p} \)

  3. (3)

    If one node \( v_{i}^{q} \in \varSigma_{v}^{q} \) in \( G^{q} \) is inserted, we finally have (similar to case 2)

$$ C_{\varepsilon ,j} = \beta \cdot k_{v} + \left( {1 - \beta } \right) \cdot C(\varepsilon \to N_{{v_{j}^{q} }} ) $$
(3)

The cost for node substitutions \( C(v_{i}^{p} \to v_{j}^{q} ) \) is commonly defined with respect to the underlying labelling of the involved nodes. Yet, the definition of an adequate cost model for neighbourhoods is not that straightforward, and this, described in greater detail in the next subsection.

3.1 The Cost of Processing Neighbourhoods

If the structural information of the nodes is not considered, that is we employ the Node neighbourhood, we have

$$ C(N_{{v_{i}^{p} }} \to N_{{v_{j}^{q} }} ) = C(N_{{v_{i}^{p} }} \to \varepsilon ) = C(\varepsilon \to N_{{v_{j}^{q} }} ) = 0 $$
(4)

For BP-GED [21] as well as for Greedy-GED [27] the same definition of the neighbourhood of a node \( v \) has been used, viz. \( N_{v} \) is defined to be the set of incident edges of node \( v \). That is, the Degree neighbourhood, as formally described in Sect. 2, is employed. Remember that in our paper unlabeled edges are considered only, and thus, edge substitution is free of cost. Hence, using this definition of a neighbourhood, the cost of substituting two neighbourhoods with each other is given by the difference of the numbers of edges of the involved nodes. Formally,

$$ C(N_{{v_{i}^{p} }} \to N_{{v_{j}^{q} }} ) = k_{e} \cdot\left| {E\left( {v_{i}^{p} } \right) - E\left( {v_{j}^{q} } \right)} \right| $$
(5)

where \( k_{e} \) refers to a positive constant cost for deleting/inserting edges and E(.) refers to the number of edges of a certain node.

Likewise, the deletion and insertion costs of neighbourhoods depend on the number of incident edges \( E\left( {v_{i}^{p} } \right) / E\left( {v_{j}^{q} } \right) \) of the deleted or inserted node \( v_{i}^{p} \) and \( v_{j}^{q} \), respectively. Formally,

$$ C(N_{{v_{i}^{p} }} \to \varepsilon ) = k_{e} \cdot E\left( {v_{i}^{p} } \right) $$
(6)
$$ C(\varepsilon \to N_{{v_{j}^{q} }} ) = k_{e} \cdot E\left( {v_{j}^{q} } \right) $$
(7)

We name this particular definition of the cost for processing neighbourhoods as Degree-cost. We will use this cost model for neighbourhoods as basic reference system.

Next, four other definitions of the cost for processing neighbourhoods are presented. In contrast with the Degree-cost model described above, these definitions are based on the Star neighbourhood.

The following definition of the insertion as well as the deletion cost assume that the complete neighbourhood has to be inserted or deleted when the corresponding node is inserted or deleted, respectively. That is, insertion and deletion costs consider the cost of processing all edges that connect the central node and the cost of processing all adjacent nodes. Formally, the complete deletion and insertion costs of neighbourhoods depend on the number of adjacent nodes.

$$ C(N_{{v_{i}^{p} }} \to \varepsilon ) = \left( {k_{v} + k_{e} } \right) \cdot E\left( {v_{i}^{p} } \right) $$
(8)
$$ C(\varepsilon \to N_{{v_{j}^{q} }} ) = \left( {k_{v} + k_{e} } \right) \cdot E\left( {v_{j}^{q} } \right) $$
(9)

Where \( \left( {k_{v} + k_{e} } \right) \) refers to the cost of deleting or inserting one edge and one node.

The substitution cost \( C(N_{{v_{i}^{p} }} \to N_{{v_{j}^{q} }} ) \) for star neighbourhoods is based on computing a distance between \( N_{{v_{i}^{p} }} \) and \( N_{{v_{j}^{q} }} \). For this particular computation every adjacent node and its corresponding edge is interpreted as an indivisible entity. That is, a Star neighbourhood can be seen as a set of independent entities (nodes with adjacent edge), and thus, the computation of \( C(N_{{v_{i}^{p} }} \to N_{{v_{j}^{q} }} ) \) refers to an assignment problem between two independent sets. This assignment problem can be solved in the same way as it is done with complete graphs by means of a cost matrix that considers substitutions, insertions and deletions. Formally,

$$ C(N_{{v_{i}^{p} }} \to N_{{v_{j}^{q} }} ) = AssignmentCost\left( {N_{{v_{i}^{p} }} ,N_{{v_{j}^{q} }} } \right) $$
(10)

In order to compute this assignment cost we use four different optimization algorithms.

The first algorithm is given by an optimal algorithm for general LAPs known as Hungarian algorithm [33] that runs in cubic time. The second algorithm solves the assignment of \( N_{{v_{i}^{p} }} \) and \( N_{{v_{j}^{q} }} \) by means of the Hausdorff distance for subsets [34]. This assignment algorithm estimates the distance between two sets of entities by removing the restriction of finding a bijective mapping between the individual elements. In contrast with the Hungarian algorithm, Hausdorff assignments can be computed in quadratic time.

The third algorithm computes the dissimilarity between \( N_{{v_{i}^{p} }} \) and \( N_{{v_{j}^{q} }} \) by finding a suboptimal assignment of the individual entities by means of the Greedy assignment algorithm [27] described in Sect. 2 (also in quadratic time).

Finally, we propose to use a Planar distance model. In this case, the relative position of each neighbourhood node is considered. That is, the only allowed assignments of entities from \( N_{{v_{i}^{p} }} \) to entities of \( N_{{v_{j}^{q} }} \) are the ones that are generated from cyclic combinations of the neighbours. Formally, the sets \( N_{{v_{i}^{p} }} \) and \( N_{{v_{j}^{q} }} \) are interpreted as strings and the assignment cost is computed through the Levenshtein distance on these strings [3438].

4 Experimental Evaluation

In the present paper three specific definitions for node neighbourhoods are presented, namely Node, Degree and Star. For the computation of both Node and Degree no additional algorithm is required. Yet, for Star neighbourhood a particular assignment solver is needed. We propose four different algorithms for this task. Hence, in total we have six different cost models to compute the individual entries \( c_{i,j} \in C \). Major aim of the present paper is to show the relevance of these costs models in terms of classification accuracy and runtime using Greedy-GED. We coded the algorithms using MATLAB 2013 and we conducted the experiments on an i5 processor of 1.60 GHz with 4 GB of RAM and Windows 10.

Table 1 shows the recognition ratio of a 1NN classifier and the mean matching runtime achieved with all models on five graph databases, viz. LETTER LOW, LETTER HIGH, GREC, COIL RAG and AIDS from the IAM graph repository [39]. In our experiments parameter \( \beta \) is fixed to 0.5 and we gauge \( k_{v} \) and \( k_{e} \) through a non-exhaustive trial and error process. The best accuracy and the fastest runtime is highlighted in bold face on each database, while the second best is underlined.

Table 1. Recognition ratio (1NN) and Runtime (Øt) of the three neighbourhoods (and for assignment solvers) on five datasets using greedy graph edit distance.

We first focus on the mean run time for one matching in ms (Øt). We observe that on all data sets the Node neighbourhood is the fastest model (as it could be expected) followed by the reference model Degree (which is only slightly slower than the Node neighbourhood). The models that are based on the Star neighbourhood suffer from substantial higher runtimes than Node and Degree (especially on the larger graphs of the GREC and AIDS data sets). Comparing the four assignment solvers with each other, we observe that the optimal (cubic time) Hungarian algorithm provides the highest run times among all competing algorithms, while the differences between the other three algorithms are negligible.

Next, we focus on the recognition rates of the different models. We observe that in three out of five cases the Star neighbourhood achieves the best recognition rates (LETTER LOW and HIGH as well as GREC). The Node neighbourhood and the reference model Degree achieve the best recognition rate on COIL RAG and AIDS, respectively. An interesting observation can be made on the AIDS data set where the reference model Degree significantly outperforms all other methods addressed in the present paper. Overall we conclude that it remains difficult to predict an adequate definition of local neighbourhoods for a given data set and application. Yet, there seems to be a weak tendency that increased neighbourhoods might improve the overall matching accuracy.

5 Conclusions

The fast approximation of graph edit distance is still an open and active area of research in the field of structural pattern recognition. A common model for the approximation of graph distances is based on bipartite graph matching. The basic idea of this approach is to reduce the difficult QAP of graph edit distance computation to an LSAP. This particular algorithmic framework consists of three major steps. In a first step the graphs to be matched are subdivided into individual nodes including local neighbourhoods. Next, in step 2, an algorithm solving the LSAP is employed in order to find an assignment of the nodes (plus local neighbourhoods) of both graphs. Finally, in step 3, an approximate graph edit distance is derived from the assignment of step 2. In the present paper we review six different ways to compute the costs of local neighbourhoods and compare them with each other on five data sets. Although no clear winner can be found in the experimental evaluation, the empirical results suggests to use a larger neighbourhood than it is traditionally employed in the context of bipartite graph matching.