Vouw: Geometric Pattern Mining using the MDL Principle

We introduce geometric pattern mining, the problem of finding recurring local structure in discrete, geometric matrices. It differs from existing pattern mining problems by identifying complex spatial relations between elements, resulting in arbitrarily shaped patterns. After we formalise this new type of pattern mining, we propose an approach to selecting a set of patterns using the Minimum Description Length principle. We demonstrate the potential of our approach by introducing Vouw, a heuristic algorithm for mining exact geometric patterns. We show that Vouw delivers high-quality results with a synthetic benchmark.


Introduction
Frequent pattern mining [1] is the well-known subfield of data mining that aims to find and extract recurring substructures from data, as a form of knowledge discovery. The generic concept of pattern mining has been instantiated for many different types of patterns, e.g., for item sets (in Boolean transaction data), subgraphs (in graphs/networks), and episodes (in sequences). So far, however, little research has been done on pattern mining for raster-based data, i.e., geometric matrices in which the row and column orders are fixed. The exception is geometric tiling [3,10], but that problem only considers tiles, i.e., rectangular-shaped patterns, in Boolean data.
In this paper we generalise this setting in two important ways. First, we consider geometric patterns of any shape that are geometrically connected, i.e., it must be possible to reach any element from any other element in a pattern by only traversing elements in that pattern. Second, we consider discrete geometric data with any number of possible values (which includes the Boolean case). We call the resulting problem geometric pattern mining.   Figure 1a shows a 32×24 grayscale 'geometric matrix', with each element in [0, 255], apparently filled with noise. If we take a closer look at all horizontal pairs of elements, however, we find that the pair (146, 11) is, amongst others, more prevalent than expected from 'random noise' (Figure 1b). If we would continue to try all combinations of elements that 'stand out' from the background noise, we would eventually find four copies of the letter 'I' set in 16 point Garamond Italic (Figure 1c).
The 35 elements that make up a single 'I' in the example form what we call a geometric pattern. Since its four occurrences jointly cover a substantial part of the matrix, we could use this pattern to describe the matrix more succinctly than by 768 independent values. That is, we could describe it as the pattern 'I' at locations (5,4), (11,11), (20, 3), (25, 10) plus 628 independent values, hereby separating structure from accidental (noise) data. Since the latter description is shorter, we have compressed the data. At the same time we have learned something about the data, namely that it contains four I's. This suggests that we can use compression as a criterion to find patterns that describe the data.

Approach and contributions.
Our first contribution is that we introduce and formally define geometric pattern mining, i.e., the problem of finding recurring local structure in geometric, discrete matrices. Although we restrict the scope of this paper to two-dimensional data, the generic concept applies to higher dimensions. Potential applications include the analysis of satellite imagery, texture recognition, and (pattern-based) clustering.
We distinguish three types of geometric patterns: 1) exact patterns, which must appear exactly identical in the data to match; 2) fault-tolerant patterns, which may have noisy occurrences and are therefore better suited to noisy data; and 3) transformation-equivalent patterns, which are identical after some transformation (such as mirror, inverse, rotate, etc.). Each consecutive type makes the problem more expressive and hence more complex. In this initial paper we therefore restrict the scope to the first, exact type.
As many geometric patterns can be found in a typical matrix, it is crucial to find a compact set of patterns that together describe the structure in the data well. We regard this as a model selection problem, where a model is defined by a set of patterns. Following our observation above, that geometric patterns can be used to compress the data, our second contribution is the formalisation of the model selection problem by using the Minimum Description Length (MDL) principle [7,4]. Central to MDL is the notion that 'learning' can be thought of as 'finding regularity' and that regularity itself is a property of data that is exploited by compressing said data. This matches very well with the goals of pattern mining, as a result of which the MDL principle has proven very successful for MDL-based pattern mining [11,6].
Finally, our third contribution is Vouw, a heuristic algorithm for MDL-based geometric pattern mining that (1) finds compact yet descriptive sets of patterns, (2) requires no parameters, and (3) is tolerant to noise in the data (but not in the occurrences of the patterns). We empirically evaluate Vouw on synthetic data and demonstrate that it is able to accurately recover planted patterns.

Related Work
As the first pattern mining approach using the MDL principle, Krimp [11] was one of the main sources of inspiration for this paper. Many papers on pattern-based modelling using MDL have appeared since, both improving search, e.g., Slim [9], and extensions to other problems, e.g., Classy [6] for rule-based classification.
The problem closest to ours is probably that of geometric tiling, as introduced by Gionis et al. [3] and later also combined with the MDL principle by Tatti and Vreeken [10]. Geometric tiling, however, is limited to Boolean data and rectangularly shaped patterns (tiles); we strongly relax both these limitations (but as of yet do not support patterns based on densities or noisy occurrences).
Campana et al. [2] also use matrix-like input data (textures) and develop a compression-based similarity measure. Their method, however, cannot be used for explanatory data analysis as it relies on a generic image compression algorithm that is essentially a black box.

Geometric Pattern Mining using MDL
We define geometric pattern mining on bounded, discrete and two-dimensional raster-based data. We represent this data as an M × N matrix A whose rows and columns are finite and in a fixed ordering (i.e., reordering rows and columns semantically alters the matrix). Each element a i,j ∈ S, where row i ∈ [0; N ), column j ∈ [0; M ), and S is a finite set of symbols, i.e., the alphabet of A.
According to the MDL principle, the shortest (optimal) description of A reveals all structure of A in the most succinct way possible. This optimal description is only optimal if we can unambiguously reconstruct A from it and nothing more-the compression is both minimal and lossless. Figure 2 illustrates how an example matrix could be succinctly described using patterns: matrix A is decomposed into patterns X and Y . A set of such patterns constitutes the model for a matrix A, denoted H A (or H for short when A is clear from the context). In order to reconstruct A from this model, we also need a mapping from the H A back to A. This mapping represents what (two-part) MDL calls the the data given the model H A . In this context we can think of this as a set of all instructions required to rebuild A from H A , which we call the instantiation of H A and is denoted by I in the example. These concepts allow us to express matrix A as a decomposition into sets of local and global spatial information, which we will next describe in more detail.

Patterns and Instances
We define a pattern as an M X × N X submatrix X of the original matrix A. Elements of this submatrix may be ·, the empty element, which gives us the ability to cut-out any irregular-shaped part of A. We additionally require the elements of X to be adjacent (horizontal, vertical or diagonal) to at least one non-empty element and that no rows and columns are empty.
From this definition, the dimensions M X × N X give the smallest rectangle around X (the bounding box). We also define the cardinality |X| of X as the number of non-empty elements. We call a pattern X with |X| = 1 a singleton pattern, i.e., a pattern containing exactly one element of A.
Each pattern contains a special pivot element: pivot(X) is the first non-empty element of X. A pivot can be thought of as a fixed point in X which we can use to position its elements in relation to A. This translation, or offset, is a tuple q = (i, j) that is on the same domain as an index in A. We realise this translation by placing all elements of X in an empty M × X size matrix such that the pivot element is at (i, j). We formalise this in the instantiation operator ⊗: We define the instance X ⊗ (i, j) as the M × N matrix containing all elements of X such that pivot(X) is at index (i, j) and the distances between all elements are preserved. The resulting matrix contains no additional non-empty elements.
Since this does not yield valid results for arbitrary offsets (i, j), we enforce two constraints: (1) an instance must be well-defined: placing pivot(X) at index (i, j) must result in an instance that contains all elements of X, and (2) elements of instances cannot overlap, i.e., each element of A can be described only once.

Two pattern instances
From here on we will use the same letter in lower case to denote an arbitrary instance of a pattern, e.g., x = X ⊗ q when the exact value of q is unimportant. Since instances are simply patterns projected onto an M × N matrix, we can reverse ⊗ by removing all completely empty rows and columns: Let X ⊗ q be an instance of X, then by definition we say that (X ⊗ q) = X.
We briefly introduced the instantiation I as a set of 'instructions' of where instances of each pattern should be positioned in order to obtain A. As Figure 2 suggests, this mapping has the shape of an M × N matrix.

The Problem and its Solution Space
Larger patterns can be naturally constructed by joining (or merging) smaller patterns in a bottom-up fashion. To limit the considered patterns to those relevant to A, instances can be used as an intermediate step. As Figure 3 demonstrates, we can use a simple element-wise matrix addition to sum two instances and use . 3: Example of joining patterns X and Y to construct a new pattern Z.
to obtain a joined pattern. Here we start by instantiating X and Y with offsets (1, 0) and (1, 1), respectively. We add the resulting x and y to obtain z, the union of X and Y with relative offset (1, 1) − (1, 0) = (0, 1). The Sets H A and I A . We define the model class H as the set of all possible models for all possible inputs. Without any prior knowledge, this would be the search space. To simplify the search, however, we only consider the more bounded subset H A of all possible models for A, and I A , the set of all possible instantiations for these models. To this end we first define H 0 A to be the model with only singleton patterns, i.e., H 0 A = S, and denote its corresponding instantiation matrix by I 0 A . Given that each element of I 0 A must correspond to exactly one element of A in H 0 A , we see that each I i,j = a i,j and so we have I 0 A = A. Using H 0 A and I 0 A as base cases we can now inductively define I A : in lexicographical order. Then the set I is also in I A , providing I equals I except: I k,l := · This shows we can add any two instances together, in any order, as they are by definition always non-overlapping and thus valid in A, and hereby obtain another element of I A . Eventually this results in just one big instance that is equal to A. Note that when we take two elements I i,j , I k,l ∈ I we force (i, j) ≤ (k, l), not only to eliminate different routes to the same instance matrix, but also so that the pivot of the new pattern coincides with I i,j . We can then leave I k,l empty.
The construction of I A also implicitly defines H A . While this may seem odd-defining models for instantiations instead of the other way around-note that there is no unambiguous way to find one instantiation for a given model. Instead we find the following definition by applying the inductive construction: So for any instantiation I ∈ I A there is a corresponding set in H A of all patterns that occur in I. This results in an interesting symbiosis between model and instantiation: increasing the complexity of one decreases that of the other. This construction gives a tightly connected lattice as shown in Figure 4.

Encoding Models and Instances
From all models in H A we want to select the model that describes A best. Two-part MDL [4] tells us to choose that model that minimises the sum of Fig. 4: Model space lattice for a 2 × 2 Boolean matrix. The V, W, and Z columns show which pattern is added in each step, while I depicts the current instantiation.
where L 1 and L 2 are two functions that give the length of the model and the length of 'the data given the model', respectively. In this context, the data given the model is given by I A , which represents the accidental information needed to reconstruct the data A from H A .
In order to compute their lengths, we need to decide how to encode H A and I. As this encoding is of great influence on the outcome, we should adhere to the conditions that follow from MDL theory: (1) the model and data must be encoded losslessly; and (2) the encoding should be as concise as possible, i.e., it should be optimal. Note that for the purpose of model selection we only need the length functions; we do not need to actually encode the patterns or data.

Code length functions. Although the patterns in H and instantiation matrix
I are all matrices, they have different characteristics and thus require different encodings. For example, the size of I is constant and can be ignored, while the sizes of the patterns vary and should be encoded. Hence we construct different length functions 1 for the different components of H and I, as listed in Table 1.
When encoding I, we observe that it contains each pattern X ∈ H multiple times, given by the usage of X. Using the prequential plug-in code [4] to encode I enables us to omit encoding these usages separately, which would create unwanted bias. The prequential plug-in code gives us the following length function for I. We use = 0.5 and elaborate on its derivation in the Appendix 2 .
Each length function has four terms. First we encode the total size of the matrix. Since we assume M N to be known/constant, we can use this constant to define the uniform distribution 1 M N , so that log M N encodes an arbitrary index of A. Next we encode the number of elements that are non-empty. For patterns Table 1: Code length definitions. Each row specifies the code length given by the first column as the sum of the remaining terms.

Matrix
Bounds # Elements Positions Symbols this value is encoded together with the third term, namely the positions of the non-empty elements. We use the previously encoded M X N X in the binominal function to enumerate the ways we can place the |X| elements onto a grid of M X N X . This gives us both how many non-empties there are as well as where they are. Finally the fourth term is the length of the actual symbols that encode the elements of matrix. In case we encode single elements of A, we assume that each unique value in A occurs with equal probability; without other prior knowledge, using the uniform distribution has minimax regret and is therefore optimal. For the instance matrix, which encodes symbols to patterns, the prequential code is used as demonstrated before. Note that L N is the universal prior for the integers [8], which can be used for arbitrary integers and penalises larger integers.

The Vouw Algorithm
Pattern mining often yields vast search spaces and geometric pattern mining is no exception. We therefore use a heuristic approach, as is common in MDL-based approaches [11,9,6]. We devise a greedy algorithm that exploits the inductive definition of the search space as shown by the lattice in Figure 4. We start with a completely underfit model (leftmost in the lattice), where there is one instance for each matrix element. Next, in each iteration we combine two patterns, resulting in one or more pairs of instances to be merged (i.e., we move one step right in the lattice). In each step we merge the pair of patterns that improves compression most, and we repeat this until no improvement is possible.

Finding candidates
The first step is to find the 'best' candidate pair of patterns for merging (Algorithm 1). A candidates is denoted as a tuple (X, Y, δ), where X and Y are patterns and δ is the relative offset of X and Y as they occur in the data. Since we only need to consider pairs of patterns and offsets that actually occur in the instance matrix, we can directly enumerate candidates from the instantiation matrix and never even need to consider the original data. The support of a candidate, written sup(X, Y, δ), tells how often it is found in the instance matrix. Computing support is not completely trivial, as one candidate occurs multiple times in 'mirrored' configurations, such as (X, Y, δ) and (Y, X, −δ), which are equivalent but can still be found separately. Furthermore,

Algorithm 2 Vouw
Input: H, I 1: for all xi ∈ I | (xi) = X do 8: for all y ∈ POST(xi) | (y) = Y do 9: xi ← Z, y ← · 10: end for 11: end for 12: end if 13: repeat until ∆L best < 0 due to the definition of a pattern, many potential candidates cannot be considered by the simple fact that their elements are not adjacent.

Peripheries.
For each instance x we define its periphery: the set of instances which are positioned such that their union with x produces a valid pattern. This set is split into the anterior-ANT(X) and posterior POST(X) peripheries, containing instances that come before and after x in lexicographical order, respectively. This enables us to scan the instance matrix once, in lexicographical order. For each instance x, we only consider the instances POST(x) as candidates, thereby eliminating any (mirrored) duplicates.
Self-overlap. Self-overlap happens for candidates of the form (X, X, δ). In this case, too many or too few copies may be counted. Take for example a straight line of five instances of X. There are four unique pairs of two X's, but only two can be merged at a time, in three different ways. Therefore, when considering candidates of the form (X, X, δ), we also compute an overlap coefficient. This coefficient e is given by e = (2N X + 1)δ i + δ j + N X , which essentially transforms δ into a one-dimensional coordinate space of all possible ways that X could be arranged after and adjacent to itself. For each instance x 1 a vector of bits V (x) is used to remember if we have already encountered a combination x 1 , x 2 with coefficient e, such that we do not count a combination x 2 , x 3 with an equal e. This eliminates the problem of incorrect counting due to self-overlap.

Gain computation
After candidate search we have a set of candidates C and their respective supports. The next step is to select the candidate that gives the best gain: the improvement in compression by merging the candidate pair of patterns. For each candidate c = (X, Y, δ) the gain ∆L(A , c) is comprised of two parts: (1) the negative gain of adding the union pattern Z to the model H, resulting in H , and (2) the gain of replacing all instances x, y with relative offset δ by Z in I, resulting in I . We use length functions L 1 , L 2 to derive an equation for gain: As we can see, the terms with L 1 are simplified to −L p (Z) and the model's length because L 1 is simply a summation of individual pattern lengths. The equation of L 2 requires the recomputation of the entire instance matrix' length, which is expensive considering we need to perform it for every candidate, every iteration. However, we can rework the function L pp in Equation (2) by observing that we can isolate the logarithms and generalise them into which can be used to rework the second part of Equation (3) in such way that the gain equation can be computed in constant time complexity.
Notice that in some cases the usages of X and Y are equal to that of Z, which means additional gain is created by removing X and Y from the model.

Mining a Set of Patterns
In the second part of the algorithm, listed in Algorithm 2, we select the candidate (X, Y, δ) with the largest gain and merge X and Y to form Z, as explained in Section 3.2. We linearly traverse I to replace all instances x and y with relative offset δ by instances of Z. (X, Y, δ) was constructed by looking in the posterior periphery of all x to find Y and δ, which means that Y always comes after X in lexicographical order. The pivot of a pattern is the first element in lexicographical order, therefore pivot(Z) = pivot(X). This means that we can replace all matching x with an instance of Z and all matching y with ·.

Improvements
Local search. To improve the efficiency of finding large patterns without sacrificing the underlying idea of the original heuristics, we add an optional local search. Observe that without local search, Vouw generates a large pattern X by adding small elements to an incrementally growing pattern, resulting in a behaviour that requires up to |X| − 1 steps. To speed this up, we can try to 'predict' which elements will be added to X and add them immediately. After selecting candidate (X, Y, δ) and merging X and Y into Z, for all m resulting instances z i ∈ z 0 , . . . , z m−1 we try to find pattern W and offset δ such that This yields zero or more candidates (Z, W, δ), which are then treated as any set of candidates: candidates with the highest gain are iteratively merged until no candidates with positive gain exist. This essentially means that we run the baseline algorithm only on the peripheries of all z i , with the condition that the support of the candidates is equal to that of Z.
Reusing candidates. We can improve performance by reusing the candidate set and slightly changing the search heuristic of the algorithm. The Best-* heuristic selects multiple candidates on each iteration, as opposed to the baseline Best-1 heuristic that only selects a single candidate with the highest gain. Best-* selects candidates in descending order of gain until no candidates with positive gain are left. Furthermore we only consider candidates that are all disjoint, because when we merge candidate (X, Y, δ), remaining candidates with X and/or Y have unknown support and therefore unknown gain.

Experiments
To asses Vouw's practical performance we primarily use Ril, a synthetic dataset generator developed for this purpose. Ril utilises random walks to populate a matrix with patterns of a given size and prevalence, up to a specified density, while filling the remainder of the matrix with noise. Both the pattern elements and the noise are picked from the same uniform random distribution on the interval [0, 255]. The signal-to-noise ratio (SNR) of the data is defined as the number of pattern elements over the matrix size M N . The objective of the experiment is to assess whether Vouw recovers all of the signal (the patterns) and none of the noise. Figure 5 gives an overview of what the generated data looks like, and how it is mined and evaluated.
Implementation. The implementation 3 used consists of the Vouw algorithm (written in vanilla C/C++), a GUI, and the synthetic benchmark Ril. Experiments were performed on an Intel Xeon-E2630v3 with 512GB RAM.
Evaluation. Completely random data (noise) is unlikely to be compressed. The SNR tells us how much of the data is noise and thus conveniently gives us an upper bound of how much compression could be achieved. We use the ground truth SNR versus the resulting compression ratio as a benchmark to tell us how close we are to finding all the structure in the ground truth.
In addition, we also compare the ground truth matrix to the obtained model and instantiation. As singleton patterns do not yield any compression over the baseline model, we reconstruct the matrix omitting any singleton patterns. Ignoring the actual values, this gives us a Boolean matrix with 'positives' (pattern occurrence=signal) and 'negatives' (no pattern=noise). By comparing each element in this matrix with the corresponding element in the ground truth matrix, precision and recall can be calculated and evaluated. Figure 6 (left) shows the influence of ground truth SNR on compression ratio for different matrix sizes. Compression ratio and SNR are clearly strongly correlated. Figure 6 (right) shows that patterns with a low prevalence (i.e., number of planted occurrences) have a lower probability of being 'detected' by the algorithm as they are more likely to be accidental/noise. Increasing the matrix size also increases this threshold. In Table 2 we look at the influence of the two improvements upon the baseline algorithm as described in Section 4.4. In terms of quality, local search can improve the results quite substantially while Best-* notably lowers precision. Both improve speed by an order of magnitude.

Conclusions
We introduced geometric pattern mining, the problem of finding recurring structures in discrete, geometric matrices, or raster-based data. Further, we presented Vouw, a heuristic algorithm for finding sets of geometric patterns that are good descriptions according to the MDL principle. The baseline algorithm is capable of accurately recovering patterns from synthetic data, and the resulting compression ratios are on par with the expectations based on the density of the data. Of the two improvements, especially the local search appears valuable as it improves precision and recall as well as runtime. For the future, we think that extensions to fault-tolerant patterns and clustering have large potential.