GraphMDL: Graph Pattern Selection based on Minimum Description Length

. Many graph pattern mining algorithms have been designed to identify recurring structures in graphs. The main drawback of these approaches is that they often extract too many patterns for human analysis. Recently, pattern mining methods using the Minimum Description Length (MDL) principle have been proposed to select a characteristic subset of patterns from transactional, sequential and relational data. In this paper, we propose an MDL-based approach for selecting a characteristic subset of patterns on labeled graphs. A key notion in this paper is the introduction of ports to encode connections between pattern occurrences without any loss of information. Experiments show that the number of patterns is drastically reduced. The selected patterns have complex shapes and are representative of the data.


Introduction
Many elds have complex data that need labeled graphs, i.e. graphs where vertices and edges have labels, for an accurate representation.For instance, in chemistry and biology, molecules are represented as atoms and bonds; in linguistics, sentences are represented as words and dependency links; in the semantic web, knowledge graphs are represented as entities and relationships.Depending on the domain, graph datasets can be made of large graphs or large collections of graphs.Graphs are complex to analyze in order to extract knowledge, for instance to identify frequent structures in order to make them more intelligible.
In the eld of pattern mining, there has been a number of proposals, namely graph mining approaches, to extract frequent subgraphs.Classical approaches to graph mining, e.g.gSpan [12] and Gaston [7], work on collections of graphs, and generate all patterns w.r.t. a frequency threshold.The major drawback of this kind of approach is the huge amount of generated patterns, which renders them dicult to analyze.Some approaches such as CloseGraph [13] reduce the number of patterns by only generating closed patterns.However, the set of closed patterns generally remains too large, with a lot of redundancy between patterns.Constraint-based approaches, such as gPrune [14], reduce the number of extracted patterns by extracting only the patterns following a certain acceptance rule.These algorithms generally manage to reduce the number of patterns, however they also limit their type.Additionally, if the acceptance rule is user-provided, the user needs some background knowledge on the data.
More eective approaches to reduce the number of patterns are those based on the Minimum Description Length (MDL) principle [3].The MDL principle comes from information theory, and states that the model that describes the data the best is the one that compresses the data the best.It has been shown on sets of items [10], sequences [9] and relations [4] that an MDL-based approach can select a small and descriptive subset of patterns.Few MDL-based approaches have been proposed for graphs.SUBDUE [1] iteratively compresses a graph by replacing each occurrence of a pattern by a single vertex.At each step, the chosen pattern is the one that compresses the most.The drawback of SUBDUE is that the replacement of pattern occurrences by vertices entails a loss of information.VoG [5] summarizes graphs as a composition of predened families of patterns (e.g., paths, stars).Like SUBDUE, VoG aims to only extract interesting patterns, but instead of evaluating each pattern individually like SUBDUE, it evaluates the set of extracted patterns as a whole.This allows the algorithm to nd a good set of patterns instead of a set of good patterns.
One limitation of VoG is that the type of patterns is restricted to predened ones.Another limitation is that VoG works on unlabeled graphs, (e.g.network graphs), while we are interested in labeled graphs.
The contribution of this paper (Section 3) is a novel approach called Graph-MDL, leveraging the MDL principle to select graph patterns from labeled graphs.Contrary to SUBDUE, GraphMDL ensures that there is no loss of information thanks to the introduction of the notion of ports associated to graph patterns.Ports represent how adjacent occurrences of patterns are connected.
We evaluate our approach experimentally (Section 4) on two datasets with different kinds of graphs: one on AIDS-related molecules (few labels, many cycles), and the other one on dependency trees (many labels, no cycles).Experiments validate our approach by showing that the data can be signicantly compressed, and that the number of selected patterns is drastically reduced compared to the number of candidate patterns.More so, we observe that the patterns can have complex and varied shapes, and are representative of the data.
2 Background Knowledge

The MDL Principle
The Minimum Description Length (MDL) principle [3] is a technique from the domain of information theory that allows to select the model, from a family of models, that best describes some data.The MDL principle states that the best model M for describing some data D is the one that minimizes the description length L(M, D) = L(M ) + L(D|M ), where L(M ) is the length of the model and L(D|M ) the length of the data encoded with the model.The MDL principle does not dene how to compute every possible description length.However, common primitives exist for data and distributions [6]:  An element x ∈ X with uniform distribution has a code of log(|X |) bits.An element x ∈ X , appearing usage(x, D) times in some data D has a code of L X usage (x, D) = −log usage(x,D) bits.This encoding is optimal.
An integer n ∈ N without a known upper bound can be encoded with a universal integer encoding, whose size in bits is noted L N (n) 1 .
Description lengths of elements that are common to all models are usually ignored, since they do not aect their comparison.
Krimp [10] is a pattern mining algorithm using the MDL principle to select a characteristic set of itemset patterns from a transactional database.Because of its good performances, Krimp has been adapted to other types of data, such as sequences [9] and relational databases [4].In our approach we redene Krimp's key concepts on graphs, in order to apply a Krimp-like approach to graph mining.

Graphs and Graph Patterns
Denition 1.A labeled graph G = (V, E, l V , l E ) over two label sets L V and L E is a data structure composed of a set of vertices V , a set of edges E ⊆ V × V , and two labeling functions l V ∈ V → 2 L V and l E ∈ E → L E that associate a set of labels to vertices, and one label to edges.
G is said undirected if E is symmetric, and simple if E is irreexive.
Although our approach applies to all labeled graphs, in the following we only consider undirected simple graphs, so as to compare ourselves with existing tools and benchmarks.Fig.
for all e ∈ E P . 1 In our implementation we use Elias gamma encoding [2], shifted by 1 so that it can encode 0. Therefore L N (n) = 2 log(n + 1) + 1.We dene graph patterns as graphs G P having some occurrences in the data graph G D .Fig. 2 shows the three embeddings ε 1 , ε 2 , ε 3 of a two-vertices graph pattern into the graph of Fig. 1.We dene singleton patterns as the elementary patterns.A vertex singleton pattern is a graph with one vertex having one label.An edge singleton pattern is a graph with two unlabeled vertices, connected by a single labeled edge.Fig. 3 shows examples of singleton patterns.

GraphMDL: MDL for Graphs
In this section we present our contribution: the GraphMDL approach.This approach takes as input a graph the original graph G o and a set of patterns extracted from that graph the candidate patterns and outputs the most descriptive subset of candidate patterns according to the MDL principle.The candidates can be generated with any graph mining algorithm, e.g.gSpan [12].
The intuition behind GraphMDL is that since data and patterns are both graphs, the data can be seen as a composition of pattern embeddings.Informally, we want a user analyzing the output of GraphMDL to be able to say the data is composed of one occurrence of pattern A, connected to one occurrence of pattern B, which is itself connected to one occurrence of pattern C.More so, we want the user to be able to tell how these structures are connected together: which vertices of each pattern are used to connect it to other patterns.

Model: A Code Table for Graph Patterns
Similarly to Krimp [10], we dene our model as a Code Table (CT), i.e. a set P of patterns with associated coding information.A rst dierence with Krimp is that the patterns are graph patterns.A second dierence is the need for additional coding information: a single code would not suce since all the information related to connectivity between pattern occurrences would be lost.
We therefore introduce the notion of ports in order to represent how pattern embeddings connect to each other to form the original graph.The set of ports of a pattern is a subset of the vertices of the pattern.Intuitively, a pattern vertex is a port if at least one pattern embedding maps this vertex to a vertex in the original graph that is also used by another embedding (be it of the same pattern or a dierent one).For example, in Fig. 5a the three occurrences of pattern P 1 are inter-connected through their middle vertex: this vertex is a port.Since port information increases the description length, we expect our approach to select patterns with few ports.
Fig. 4 shows an example of CT associated to the graph of Fig. 1.Every row of the CT is composed of three parts, and contains information about a pattern P ∈ P (e.g. the rst row contains information about pattern P 1).The rst part of a row is the graph G P , which represents the structure of the pattern (e.g.P 1 is a pattern with three labeled vertices and two labeled edges).The second part of a row is the code c P , associated to the pattern.The third part of a row is the description of the port set of the pattern, Π P , (e.g.P 1 has two ports, its rst two vertices, with codes of 2 and 0.42 bits 2 ).We note Π the set of all ports of all patterns.Like Krimp, the length of the code of a pattern or port depends on its usage in the encoding of the data, i.e. how many times it is used to describe the original graph G o (e.g.P 1 has a code of 1 bit because it is used 3 times and the sum of pattern usages in the CT is 6, see Sections 3.2 and 3.3).

Encoding the Data with a Code Table
The intuition behind GraphMDL is that we can represent the original graph G o (i.e. the data) as a set of pattern occurrences, connected via ports.Encoding the data with a CT consists in creating a structure that explicits which occurrences are used and how they interconnect to form the original graph.We call this structure the rewritten graph G r .Denition 3. A rewritten graph G r = (V r , E r , l r V , l r E ) is a graph where the set of vertices is emb is the set of pattern embedding vertices and V r port is the set of port vertices.E r ⊆ V r emb × V r port is the set of edges from embeddings to ports, l r V ∈ V r emb → P and l r E ∈ E r → Π are the labelings.
2 MDL approaches deal with theoretical code lengths, which may not be integers.
In order to compute the encoding of the data graph with a given CT, we start with an empty rewritten graph.One after another, we select patterns from the CT.For each pattern, we compute the occurrences of its graph G P .Similarly to Krimp, we limit embeddings overlaps: we admit overlap on vertices (since it is the key notion behind ports), but we forbid edge overlaps.
Each retained embedding is represented in the rewritten graph by a pattern embedding vertex : a vertex v e ∈ V r emb with a label P ∈ P indicating which pattern it instantiates.Vertices that are shared by several embeddings are represented in the rewritten graph by a port vertex v p ∈ V r port .We add an edge (v e , v p ) ∈ E r between the pattern embedding vertex v e of a pattern P and the port vertex v p , when the embedding associated to v e maps the pattern's port v π ∈ Π P to v p .We label this edge v π .
We make sure that code tables always include all singleton patterns, so that they can always encode any vertex and edge of the original graph.

Description Lengths
In this section we dene how to compute the description length of the CT and the rewritten graph.Description lengths are used to compare CTs.Formulas are explained below and grouped in Fig. 6.
Code table.The description length L(M ) = L(CT ) of a CT is the sum of the description lengths of its rows (skipping rows with unused patterns), and every row is composed of three parts: the pattern graph structure, the pattern code, and the pattern port description.
To describe the structure G = G P of a pattern (L(G)) we start by encoding the number of vertices of the pattern.Then we encode the vertices one after the other.For each vertex v, we encode its labels then its adjacent edges.To encode the vertex labels (L V (v, G)) we specify their number rst, then the labels themselves.To encode the adjacent edges (L E (v, G)) we specify their number (between 0 and |V | − 1 in a simple graph), then for each edge, its destination vertex and its label.To avoid encoding twice the same edge, we decide in undirected graphs to encode edges with the vertex with the smallest identier.
Vertex and edge labels are encoded based on their relative usage in the original . Since this encoding does not change between CTs, it is a meaningful way to compare them.

L(cP
Formulas used for computing description lengths.The structure G P = (V P , E P , l P V , l P E ) is shortened to G = (V, E, lV , lE) for ease of reading.
The second element of a CT row is the code c P associated to the pattern (L(c P )).This code is based on the usage of the pattern in the rewritten graph.
The last element of a CT row is the description of the pattern's ports (L(Π P )).First, we encode the number of pattern's ports (between 0 and |V |).Then we specify which vertices are ports: if there are k ports, then there are |V | k possibilities.Finally, we encode the port codes (L(c π , P )): their code is based on the usage of the port in the rewritten graph w.r.t.other ports of the pattern.
Rewritten graph.The rewritten graph has two types of vertices: port vertices and pattern embedding vertices.Port vertices do not have any associated information, so we just need to encode their number.The description length L(D|M ) = L(G r ) of the rewritten graph is the length needed for encoding the number of vertex ports plus the sum of the description lengths L emb (v, P, G r ) of the pattern embedding vertices v. Every pattern embedding vertex has a label l r V (v) specifying its pattern P , encoded with the code c P of the pattern.We then encode the number of edges of the vertex i.e. the number of ports of this embedding in particular (between 0 and |Π P |).Then for each edge we encode the port vertex to which it is connected and to which port it corresponds (using the port code c π ).

The GraphMDL Algorithm
In previous subsections we presented the dierent MDL denitions that Graph-MDL uses to evaluate pattern sets (CT).A naive algorithm for nding the most descriptive pattern set (in the MDL sense) could be to create a CT for every possible subset of candidates and retain the one yielding the smallest description length.However, such an approach is often infeasible because of the large amount of possible subsets.That is why GraphMDL applies a greedy heuristic algorithm, adapting Krimp algorithm [10] to our MDL denitions.
Like Krimp, our algorithm starts with a CT composed of all singletons, which we call CT 0 .One after the other, candidates are added to the CT if they allow to lower the description length.Two heuristics guide GraphMDL: the candidate order and the order of patterns in the CT.We use the same heuristics as Krimp, with the dierence that we dene the size of a pattern as its total number of labels (vertices and edges).We also implement Krimp's post-acceptance pruning: after a pattern is accepted in the CT, GraphMDL veries if the removal of some patterns from the CT allows to lower the description length L(M, D).

Experimental Evaluation
In order to evaluate our proposal, we developed a prototype of GraphMDL.
The prototype was developed in Java 1.8 and is available as a git repository 3 .

Datasets
The rst two datasets that we use, AIDS-CA and AIDS-CM, are part of the National Cancer Institute AIDS antiviral screen data 4 .They are collections of graphs often used to compare graph mining algorithms [11].Graphs of this collection represent molecules: vertices are atoms and edges are bonds.We stripped all hydrogen atoms from the molecules, since their positions can be inferred.
We took our third dataset, UD-PUD-En, from the Universal Dependencies project 5 .This project curates a collection of trees describing dependency rela- tionships between words of sentences of multiple corpora in multiple languages.
We used the trees corresponding to the English version of the PUD corpus.the number of elementary graphs in the dataset, the total amount of vertices, the total amount of edges, the number of dierent vertex labels, and the number of dierent edge labels.Since GraphMDL works on a single graph instead of a collection, we aggregate collections into a single graph with multiple connected components when needed.We generate candidate patterns by using a gSpan implementation available on its author's website 6 .

Quantitative Evaluation
Table 2 presents the results of the rst experiment.For instance the rst line tells that we ran GraphMDL on the AIDS-CA dataset, with as candidates the 2194 patterns generated by gSpan for a support threshold of 20%.It took 19 minutes for our approach to select a CT composed of 115 patterns, yielding a description length that is 24% of the description length obtained by the singleton-only CT 0 .
Selected patterns have a median of 9 labels and 3 ports.
We observe that the number of patterns of a CT is often signicantly smaller than the number of candidates.This is particularly remarkable for experiments ran with small support thresholds, where GraphMDL reduces the number of patterns up to 300 times: patterns generated for these support thresholds probably contain a lot of redundancy, that GraphMDL avoids.
We also note that the description lengths of the CTs found by GraphMDL are between 20% and 40% of the lengths of the baseline code tables CT 0 , which shows that our algorithm succeeds in nding regularities in the data.Description lengths are smaller when the number of candidates is higher: this may be because with more candidates, there are more chances of nding good candidates that allow to better reduce description lengths. 6https://sites.cs.ucsb.edu/~xyan/software/gSpan.htm We observe that GraphMDL can nd patterns of non-trivial size, as shown by the median label count in Table 2. Also, most patterns have few ports, which shows that GraphMDL manages to nd models in which the original graph is described as a set of components without many connections between them.We think that a human will interpret such a model with more ease, as opposed to a model composed of entangled components.

Qualitative Evaluations
Interpretation of rewritten graphs.Fig. 7 shows how GraphMDL uses patterns selected on the AIDS-CM dataset to encode one of the graphs of the dataset (more results are available in our git repository).It illustrates the key idea behind our approach: nd a set of patterns so that each one describes part of the data, and connect their occurrences via ports to describe the whole data.
We observe that GraphMDL selects bigger patterns (such as P2), describing big chunks of data, as well as smaller patterns (such as P3, edge singleton), that can form bridges between pattern occurrences.Big patterns increase the description length of the CT, but describe more of the data in a single occurrence, whereas small patterns do the opposite.Following the MDL principle, GraphMDL nds a good balance between the two types of patterns.
It is interesting to note that pattern P1 in Fig. 7 corresponds to the carboxylic acid functional group, common in organic chemistry.GraphMDL selected this pattern without any prior knowledge of chemistry, solely by using MDL.
Comparison with SUBDUE.On the right of Fig. 7 we can observe the encoding found by SUBDUE on the same graph.The main disadvantage of SUBDUE is information loss: we can see that the data is composed of two occurrences of pattern P1, but not how these two occurrences are connected.Thanks to the notion of ports, GraphMDL does not suer from this problem: the user can exactly know which atoms lie at the boundary of each pattern occurrence.Assessing patterns through classication.We showed in the previous experiments that GraphMDL manages to reduce the amount of patterns, and that the introduction of ports allows for a precise analysis of graphs.We now ask ourselves if the extracted patterns are characteristic of the data.To evaluate this aspect, we adopt the classication approach used by Krimp [10].We apply GraphMDL independently on each class of a multi-class dataset, and then use the resulting CTs to classify each graph: we encode it with each of the CTs, and classify it in the class whose CT yields the smallest description length L(D|M ).
Since GraphMDL is not designed with the goal of classication in mind, we would expect existing classiers to outperform GraphMDL.In particular, note that patterns are selected on each class independently of other classes.Indeed, GraphMDL follows a descriptive approach whereas classiers generally follow a discriminative approach.Table 3 presents the results of this new experiment.
We compare GraphMDL with graph classication algorithms found in the literature [8], and a baseline that classies all graphs as belonging to the largest class.The AIDS-CA/CI dataset is composed of the CA class of the AIDS dataset and a same-size same-labels random sample from the CI class (corresponding to negative examples).The other datasets 7 are from [8].We performed a 10-fold validation repeated 10 times and report average accuracies and standard deviations.
GraphMDL clearly outperforms the baseline on two datasets, AIDS and Mutag, but is only comparable to the baseline for the PTC datasets.On Mutag, GraphMDL is less accurate than other classiers but closer to them than to the baseline.On the PTC datasets, we hypothesize that the learned descriptions are not discriminative w.r.t. the chosen classes, although they are characteristic enough to reduce description length.Nonetheless results are still better than random guessing (accuracy would be 50%).An interesting point of GraphMDL classication is that it is explainable: the user can look at how the patterns of the two classes encode a graph (similarly to Fig. 7) and understand why one class is chosen over another.

Conclusion
In this paper, we have proposed GraphMDL, an MDL-based pattern mining approach to select a representative set of graph patterns on labeled graphs.We 7 For concision, we do not report on PTC-{MM,FM}, they yield similar results.
proposed MDL denitions allowing to compute description lengths necessary to apply the MDL principle.The originality of our approach lies in the notion of ports, which guarantee that the original graph can be perfectly reconstructed, i.e., without any loss of information.Our experiments show that GraphMDL signicantly reduces the amount of patterns w.r.t.complete approaches.Further, the selected patterns can have complex shapes with simple connections.The introduction of the notion of ports facilitates interpretation w.r.t. to SUBDUE.
We plan to apply our approach to more complex graphs, e.g.knowledge graphs.

Fig. 4 .
Fig. 4. Example of a GraphMDL code table over the graph of Fig. 1.Pattern and port usages, and code lengths have been added as illustration and are not part of the table denition.Unused singleton patterns are omitted.

Fig. 5 .
Fig. 5. How the data graph of Fig. 1 is encoded with the code table of Fig. 4. a) Retained occurrences of CT patterns.b) The rewritten graph.Blue squares are pattern embeddings (their label indicates the pattern), white circles are port vertices.Edge labels represent which pattern port correspond to each port vertex.

Fig. 5
Fig. 5 shows the graph of Fig. 1 encoded with the CT of Fig. 4. Embeddings of CT patterns become pattern embedding vertices in the rewritten graph (blue squares).Vertices that are at the boundary between multiple embeddings become port vertices in the rewritten graph (white circles).When an embedding has a port, its pattern embedding vertex in the rewritten graph is connected to the corresponding port vertex and the edge label indicates which pattern's port it is.For instance, the three retained occurrences of pattern P 1 all share the same vertex labeled Y (middle of the original graph), thus in the rewritten graph the three corresponding pattern embedding vertices are connected to the same port vertex via port v 2 .
Let G P and G D be graphs.An embedding (or occurrence)

Table 1 .
Characteristics of the datasets used in the experiments

Table 1
presents the main characteristics of the three datasets that we use: