Redefining the Graph Edit Distance

Graph edit distance has been used since 1983 to compare objects in machine learning when these objects are represented by attributed graphs instead of vectors. In these cases, the graph edit distance is usually applied to deduce a distance between attributed graphs. This distance is defined as the minimum amount of edit operations (deletion, insertion and substitution of nodes and edges) needed to transform a graph into another. Since now, it has been stated that the distance properties have to be applied [(1) non-negativity (2) symmetry (3) identity and (4) triangle inequality] to the involved edit operations in the process of computing the graph edit distance to make the graph edit distance a metric. In this paper, we show that there is no need to impose the triangle inequality in each edit operation. This is an important finding since in pattern recognition applications, the classification ratio usually maximizes in the edit operation combinations (deletion, insertion and substitution of nodes and edges) that the triangle inequality is not fulfilled.


Introduction
Several kinds of problems have been used to model attributed graphs for more than 4 decades [1][2][3]. This is because they have the flexible ability to define an element through their structural and semantic information. Some reviews have been published through these years [4][5][6][7]. Figure 1 shows several examples of representing an element through an attributed graph. Applications range from optical character recognition to chemical compound toxicity analysis. Other examples are image processing, text analysis or geolocalization. If elements in pattern recognition are modelled through attributed graphs, a distance between them is needed since, in pattern recognition, comparing elements is an essential task. One of the most widely used distances between attributed graphs is the graph edit distance [8][9][10][11].
Graph edit distance depends on some parameters, which have to be tuned to maximize an objective function, which is related on the task to be carried out. In the earliest presentations of the graph edit distance in 1983, [1,2], this function was considered a metric subject to certain properties on these parameters. A more contemporary book published in 2015 [12] also suggests the same properties. That book analyses some pattern recognition applications and several algorithms that compute the graph edit distance in an optimal and sub-optimal manner.
As an example of how important is to define the graph edit distance as a metric, we could mention the graph databases structured by metric-trees [13]. In this case, if the function to compare attributed graphs is not a metric, some attributed graphs could not be retrieved from the database although they had properly been introduced. Thus, the system performs as these graphs have not never been introduced in the database.
Another example are the pattern recognition methods that compact a set of attributed graphs that have been considered to belong to one class into only one graph representative [14][15][16]. The aim of this type of representations is twofold. First, they want to reduce the runtime of deducing the class of an element to be classified. This is because in these methods, only one comparison is needed between this unclassified graph and each class in the set. In contrast, in other methods, such as the k-nearest neighbours [17], the number of comparisons needed is the number of graphs in the set. Second, they want to increase the representational power of the set. This is because, by compacting the general features of the set, the local discrepancies between the elements in the set tend to be softened. In this case, if the function to compare attributed graphs is not a metric, the representative graph is not properly defined (the construction of the representative is carried out by several comparisons between the graphs in the set) and thus, there is a quality reduction of the pattern recognition task.
For this reason, in the previously commented examples, the graph edit distance parameters are set such as the graph edit distance becomes a metric. If we could somehow relax some of the distance properties while maintaining the graph edit distance a metric, we could extend the parameters' domain and thus, set them at a value where the query precision becomes higher (in the first example) and the representative power of the set also becomes higher (in the second example). Note we are aware that the dissimilarity between patterns in other pattern recognition applications can be defined as non-metric. Finally, the configuration of the graph edit distance parameters considering some matching algorithms is analysed in [18]. That paper demonstrates that only some graph matching algorithms return a metric although some distance properties are not fulfilled. Thus, it concludes that the selection of the graph matching is decisive not only for the run time point of view but also to assure the deduced graph edit distance between the graphs in the application is a metric.
To summarize, in this paper, we show that some properties usually considered in the graph edit distance [1,2,12] can be relaxed while the graph edit distance being a metric. This simplification is central in several pattern recognition applications. This is because all these applications have an objective function to be maximized (or minimized), for instance, the maximization of the recognition ratio or classification accuracy, between others. Thus, it turns out that several machine learning methods have been presented [19][20][21][22][23] to automatically deduce the graph edit distance parameters and, frequently, the objective function of these methods is maximized where these properties do not hold. Due to the findings of this paper, we currently know that the automatically learned graph edit distance parameters makes the graph edit distance to be a metric. The experimental section shows the classification ratio given some graph edit distance parameters and also shows that the maximum classification is achieved at the parameters that we currently can assure the graph edit distance is a metric.
Note that several algorithms have been presented to compute the graph edit distance. This is because its computation has been demonstrated to be NP complete [24]. Basically, there are algorithms that return the exact distance in exponential computational cost [25] and other ones that return an approximation of this value in polynomial computational cost [26][27][28]. The fact of using different edit costs do not influence over the computational cost but only on the algorithm per se. For instance, the computational cost of [25] is exponential, the computational cost of [26][27][28] is cubic and the computational cost of [29] is linear with respect to the number of nodes as it can be deduced in [30,31]. For this reason, in this paper, the experimental section does not show a comparison between these methods but only shows that the edit cost that return the highest classification rations do not fulfil the triangle inequality. For a specific comparison of graph matching methods, we refer to [32].

Graph Edit Distance
A distance is a function that holds the following properties [33]: where x, y and z are elements in a specific domain. The distance domain is [0, ∞).
An attributed graph G = (V, E) is composed of a set of n nodes V = v 1 , … , v n and a set of m edges E = e 1,2 , … , e t,k . Moreover, there is a function on the nodes, v , and on the edges, e , which assign an attribute on these nodes and edges. The graph edit distance between two attributed graphs, GED, G and G ′ is defined as, where E G, G ′ denotes the set of all edit paths e 1 , e 2 , … e k that transform G into G ′ and the cost of the edit path is CED, The set of node substitutions, deletions and insertions is represented by vs , vd and vi . Moreover, the costs of these edit operations are represented by C vs , C vd and C vi . Edit operations and costs on edges have similar definitions: es , ed , ei , C es C ed and C ei . Figure 2 shows a toy example of a graph edit distance between two simple graphs, which are composed of only two nodes and an edge between them. In this case, the transformation between both graphs is carried out by five edit operations: ed , vd , vi , ei and vs . Note that several edit paths transform one graph into the other one that generate different costs (Eq. 3). The edit path that is considered in the graph edit distance is the one that achieves the minimum cost through the previously defined C vs , C vd , C vi , C es C ed and C ei (Eq. 2).
The edit operations e t depend on the attributes on the nodes or on the arcs. For this reason, we define, for the rest of the paper, the following relations: If e t ∈ es , then e t = e ab , e � ij . e ab ∈ G and e � ij ∈ G � . (1) If e t ∈ vd , then e t = v a . v a ∈ G.
If e t ∈ ed , then e t = e ab .e ab ∈ G.
The distance properties (Eq. 1) are taken into consideration in the first definition of the graph edit distance [1] through the edit operations (Lemma 5 in [1]) and they are rewritten (Sect. 2.1.1 in [12]). Authors propose that if the edit operations fulfil the metric properties (Eq. 1) in their domain, then, it is clear that the graph edit distance is a metric. This is due to the graph edit distance is a linear function of the edit operations. In [1,12], the distance properties are written in the following three points. Note the nomenclature has been slightly changed with regard to [1]. Besides, we write C Theorem 1, defined below, demonstrates that the previously defined point 3.C and point 3.D (the ones that relate the triangle inequality to the deletion, insertion and substitution costs) do not need to be imposed to properly define the graph edit distance. We also comment how this finding influence over the string and tree edit distance.
Proof It is trivial to realise that the non-negativity: GED G, G ′ ≥ 0 ; the identity of indiscernible elements: GED G, G � = 0 ⇔ G = G � ; and the symmetry properties: GED G, G � = GED G ′ , G are fulfilled in the graph edit distance in the case that points 1. and 2. are fulfilled.
Thus, we have to prove the triangle inequality, which means that GED G, G ′ ≤ GED G, G �� + GED G ′′ , G ′ holds. We assume e G,G � ∈ E G, G � , e G,G �� ∈ E G, G �� and e G �� ,G � ∈ E G �� , G � are three edit path that make the minimum cost CED in E G, G ′ , E G, G ′′ and E G ′′ , G ′ , respectively. In other words, the ones used to compute GED G, G ′ , GED G, G ′′ and GED G ′′ , G ′ . We define an edit path ê G,G ′ that transforms G into G ′ as a concatenation of e G,G ′′ and e G ′′ ,G ′ . By construction, it holds that GED G, G � = CED G, G �� + CED G �� , G � and by definition of Strings and trees can be modelled as graphs. In the case of strings, each element in the string is represented as a node in the graph and the order of the string elements is represented through edges in the graph that connect the string elements that are placed side by side. In the case of trees, nodes in the trees are represented as nodes in the graphs and the fathersibling relation is represented as an edge in the graph from the father node to the sibling node. For this reason, Theorem 1 can be applied to strings and trees and thus, we have demonstrated that the tree edit distance and the string edit distance do not need to fulfil the triangle inequality between their edit operations to be considered a metric.

Practical Experiment
The aim of the paper is to show that the triangle inequality applied to the edit operations is not needed to be the graph edit distance a metric. Thus, it could be presented without practical evaluation. Nevertheless, we included this evaluation to show that it is usual to have the optimal configuration of the parameters at the point that this triangle inequality is not fulfilled, and therefore, it is worth this demonstration. Thus, in this section, we show that the highest classification ratios appear in the values of the edit operations, such that the triangle inequality between deletion and insertion operations and substitution operation is not fulfilled. Yet, these values were rarely applied since the graph edit distance was not considered a metric. We observe that the algorithms, such as [34], designed to learn the edit operations, do not intrinsically take into consideration fulfilling the triangle inequality properties but they do, for the other distance properties. Thanks to Theorem 1, we know that the graph edit distance continues to be a metric and therefore it is worth setting the edit operations in the values such that the recognition ratio is maximized (even if the triangle inequality on the edit operations is not fulfilled).
The following five public graph databases have been used: Coil-Rag, Grec, Letter-low, Letter-med and Letterhigh. These databases are explained in [32,35], and their most important features and details are summarized in Table 1. For instance, the number of graphs and classes, the average number of nodes and edges per graph and the number of attributes on nodes. Table 2 shows the minimum, maximum, mean and standard deviation of the node substitution costs of these databases. Given that the insertion and deletion edit operations are constants, we have to assure that these constants are set such that C vd + C vi ≥ max C vs , to fulfil the triangle inequality on the edit operations. Hence, the minimum value of C vd and C vi would have to be C vd = C vi = 0.5 max C vs . The last column shows 0.5 max C vs per each database.
Given these databases, we have computed the classification ratio as follows. We used the 1-nearest neighbour classification algorithm [17] for the classification purposes. Moreover, we have used the fast bipartite graph matching to deduce the graph edit distance between graphs [26,27].

SN Computer Science
The edit operations are set as follows: C vs is the Euclidean distance between node attributes.C es is not defined (these databases do not have attributes on the edges). Finally, we impose the insertion and deletion operations on nodes and edges to have the same value and to be the constant K v : K v = C vd = C ed = C vi = C ei . We first deduced the class of each graph in the test set. Then, the classification ratio is computed as the number of times the graphs in the test set are properly classified, divided by the number of graphs in the test set. To obtain the class of a graph in the test set, it is needed to compute the graph edit distance between this graph and all the graphs in the learning set. We conclude the class of the graph in the test set is the class of the graph in the learning set that has returns the minimum distance. Figure 3 shows the classification ratio in the vertical axis and the constant K v in the horizontal axis. Each value in the plot represents the classification ratio when the graph edit distance has been computed considering K v as the deletion and insertion costs on nodes and edges. The range of values of K v has been set from the minimum value of the substitution costs (first column in Table 2) to the maximum value of the substitution costs (second column in Table 2). Moreover, the arrow shows the range of values such that the triangle inequality between edit costs are fulfilled. It ranges from the last column of Table 2 to infinitive. Computing each plot took approximately two hours (MacMini, I5, Matlab). We realize that the classification ratio is maximized at the value of the insertion and deletion costs that are much smaller than half of the maximum value of the substitution cost (the last column of Table 2), which is the point that the triangle inequality begins not to be fulfilled. Therefore, we conclude that it is worth knowing that in the whole range of K v the graph edit distance is a metric although the triangle inequality on the edit operations does not hold on the lower part of the range of K v .

Discussion and Conclusion
Considering the practical experiment, we could deduce that imposing the insertion and deletion costs (represented as K v ) to be the mean of half of the substitution cost is a good option since it is at the point where the classification ratio is maximized. Nevertheless, in the past, this setting was not used due to the graph edit distance was not considered to be a metric. This paper relaxes the properties needed to be the graph edit distance a metric and for this reason, we can confirm that at this point, the graph edit distance keeps being a metric. Thus, it is worth using these costs.
Three algorithms have been presented to automatically deduce the edit operations [19][20][21]. Given the topic of this paper, we observe that their automatically found edit operations (the values that maximize the classification ratio) do not fulfil the triangle inequality. Nevertheless, we currently confirm that the graph edit distance is a metric at the insertion and deletion costs optimized by those learning algorithms. Clearly, this paper does not state that in all the applications, the classification ratio is maximized in the point that the triangle inequality between edit operations is not fulfilled. We only state that in these cases, we can consider using the learned parameters since we currently know that the graph edit distance is a metric. Finally, note that the learning methods impose, by construction of the algorithms, to fulfil the non-negativity, the symmetry and the identity of the edit operations.
As a future work, we want to analyse how influences the relaxation of other distance properties, specifically, the symmetry property. Note that in the classification scenarios, we want to know the distance between a graph of the test set and a graph of the learning set. Since, the symmetric distance, that is, the distance between a graph of the learning set and a graph of the test set is never computed, it could be worth not to fulfil the symmetry property at the edit operation level. This means that the insertion costs do not have to be the same as the deletion costs on nodes or edges. To do so, we are computing some experiments similar to the ones presented in Fig. 3 but taking some values for the insertion cost that are different from the deletion cost, that is, K vd = C vd ≠ K ve = C ed .