1 Introduction

Our aim is to review the development of pattern mining methods based on and inspired from the Minimum Description Length (MDL) principle. Although this is an unrealistic goal, we strive for completeness in our coverage of these methods.

Mining frequent patterns is a core task in data mining, and itemsets are probably the most elementary and best studied type of pattern. Soon after the introduction of the frequent itemset mining task (Agrawal et al. 1993), it became obvious that beyond issues of efficiency (Agrawal and Srikant 1994; Mannila et al. 1994), the problem of selecting patterns constituted a major challenge to tackle, lest the analysts drown under the deluge of extracted patterns.

Various properties and measures have been introduced to select itemsets (Webb and Vreeken 2013; Geng and Hamilton 2006). They include identifying representative itemsets (Bastide et al. 2000), using user-defined constraints to filter itemsets (De Raedt and Zimmermann 2007; Soulet et al. 2011; Guns et al. 2013), considering dependencies between itemsets (Jaroszewicz and Simovici 2004; Yan et al. 2005; Tatti and Heikinheimo 2008; Mampaey 2010) and trying to evaluate the statistical significance of itemsets (Webb 2007; Gionis et al. 2007; Tatti 2010; Hämäläinen and Webb 2018), also looking into alternative approaches to explore the search space (Boley et al. 2011; Guns et al. 2011). Initially focused on individual itemsets, approaches were later introduced to evaluate itemsets collectively, trying for instance to identify and remove redundancy (Gallo et al. 2007), also in an iterative manner (Hanhijärvi et al. 2009; Boley et al. 2013). The goal hence moved from mining collections of good patterns to mining good collections of patterns, which is also the main objective when applying the MDL principle.

Before we go any further, let us explain more precisely what we consider, for the present purpose, to constitute patterns. Patterns are about repetitions. We adopt the point of view that patterns express the repeated presence in the data of particular items, attribute values or other discrete properties. We divide them into two main categories.

Itemsets are strict conjunctive patterns that require the occurrence of all involved items for the pattern to be considered as occurring in a transaction. For a given dataset, it is thus straightforward to determine where an itemset occurs and where it does not, that is, to compute the pattern’s support. Vice versa, when it is known that an itemset occurs in a given transaction, no further information is necessary, as it implies that all items must be present. These concepts naturally extend beyond transactional data. However, such occurrence requirements are fairly strict, and it can be useful to consider more relaxed patterns. In particular, one might use disjunctions, allowing patterns to express a choice between involved items or attributes. Given a dataset, one can then still straightforwardly determine where a pattern holds. On the other hand, knowing that the pattern occurs no longer unambiguously provides information about which item or attribute is present. More or less additional information is needed, depending for instance on whether it is an inclusive or exclusive disjunction. We refer to such patterns that express the presence of a specific substructure in the data as substructure patterns.

As an alternative way to relax occurrence requirements, patterns might express that selected items or properties are typically present but not all need always occur, for instance by means of density thresholds. In that case, which data instances belong to the support of the pattern, or vice-versa where each item or property holds, needs to be specified explicitly as it is not directly implied. Rather than the occurrence of a specific substructure, it is then the homogeneity of repetitions within the area delimited by the selected instances and attributes that is of interest. Because such patterns can be seen as delineating homogeneous rectangles in the data, we refer to them as block patterns.

Furthermore, one is typically looking for a collection of patterns and might impose constraints on the overlap between them, that is, require that the patterns involve disjoint sets of attributes, characterise disjoint sets of instances, or both. In particular, the patterns might be required to form a partition of the data, dividing all instances and all attributes into disjoint subsets. Such a partitioning requirement is incompatible with a strict occurrence requirement in practice, in the sense that it is not in general possible to identify a collection of substructure patterns that forms a partition of the data. On the other hand, block patterns might be required to form a partition of the data, corresponding roughly to biclustering approaches, or they might be allowed to overlap, as in tiling approaches.

In summary, we adopt a rather broad definition of what constitute patterns, from itemsets to biclusters over discrete data. However, we stop short of considering clusters more in general as patterns. Clustering constitutes another important field of data mining beside—and partially overlapping with—pattern mining. There, the goal is to organise data instances into groups such that instances within the same group are similar to each other and dissimilar to instances in other groups. Clustering often handles continuous data, typically relying on a concept of distance. Here, we focus on formalisms and methods that are by nature more discrete.

The reader is expected to be familiar with common pattern mining tasks and techniques, but not necessarily with concepts from information theory and coding, of which we therefore give an overview in Sect. 2, before presenting the two main encoding strategies that use respectively substructure and block patterns.

In Sects. 3, 4, 5 and 6, we turn to the review of MDL-based methods for pattern mining proper. The methods are grouped first by the type of data, then by the type of patterns considered, as outlined in Fig. 1. Our focus is on the various encodings designed for these different types of data and patterns, rather than on algorithmic issues related to searching the patterns. In Sect. 3, we start with one of the thickest branches, stemming from the Krimp algorithm for mining itemsets from transactional data. We continue with itemsets and with other types of patterns for tabular data in Sect. 4, followed by graphs and temporal data in Sects. 5 and 6, respectively. Finally, we consider some discussion points, in Sect. 7. For further context about the development of the principle and discussion of related problems, please refer to the extended manuscript.Footnote 1 In addition, the main characteristics and bibliographic details of publications from Sects. 3, 4, 5 and 6 have been collected into a searchable table.Footnote 2

Fig. 1
figure 1

Organisation of Sects. 3, 4, 5 and 6. MDL-based methods for pattern mining are grouped first by the type of data (tabular, graph or temporal), then by the type of patterns and strategies considered (blocks or substructures). Numbers refer to sections. The three main data types and their subtypes are represented by coloured shapes. Simple unlabelled graphs can be represented as binary matrices. Thus, some methods designed for binary data can be applied to them, and vice versa, some graph-mining methods in effect process binary data. The corresponding sections are therefore represented as lying at the intersection between binary and graph data. Dashed and dotted lines are used to group methods associated to the two main strategies (cf. Sect. 2.3). Block-based strategies are used to mine block patterns (also tiles and segments) that group together elements that are similarly distributed. On the other hand, dictionary-based strategies are used to mine substructure patterns (also motifs and episodes) that capture specific arrangements and co-occurrences between elements (Color figure online)

2 The basics in a nutshell

The Minimum Description Length (MDL) principle is a model selection method grounded in information theory.

The 1948 article by Shannon is widely seen as the birth certificate of information theory, whereas the introduction of the Minimum Description Length principle can be dated back to the seminal paper by Rissanen in 1978. The textbooks by Stone (2013) and by Cover and Thomas (2012) provide respectively an accessible introduction and a more detailed account of information theory and related concepts, while the textbook by Grünwald (2007) is often regarded as the major reference about the MDL principle.

The MDL principle is maybe the most popular among several similar principles. It can be seen as a practical variant of the Kolmogorov complexity, according to which the complexity of an object, for instance a dataset, is the length of the shortest computer program that outputs it. The idea is that regularities in the object can be exploited to make the program more concise. For example, the string that consists of a thousand repetitions of the word baobab can be produced with the following short program: for i in range(1000): print(‘baobab’). To output instead a string of the same length where the three letters appear in a completely random manner, we have no choice but to embed the string in the program, resulting in a much longer program. The second string is thus considered to be more complex than the first.

However, the Kolmogorov complexity cannot be computed in general. The MDL principle makes it practical by considering more restricted description methods rather than a universal programming language. Specifically, when working with the MDL principle, a class of models of the data is considered and the model that allows to most concisely represent the data is selected.

2.1 Encoding data

Clearly, this is akin to data encoding, where the aim is to map the input object to a sequence, referred to as its description, from which the original object can then be reconstructed. In practical scenarios of data encoding, the aim might be to efficiently and reliably either store the object on a storage medium or transmit it over a communication channel. The system of rules that dictates how the object is converted into the description and back is called a code. The processes of mapping the object to its description and of reconstructing it are called encoding and decoding, respectively. The considered storage or channel is typically binary, meaning that the object is mapped to a binary sequence, i.e. over the alphabet \(\{0, 1\}\), whose length is hence measured in bits. There has been a lot of studies about communication through noisy channels, that is, when errors might be introduced into the transmitted sequence, and how to recover from it, but this is not of interest to us. Instead of noise-resistant codes, we focus purely on data compression, on obtaining compact descriptions. In general, data compression can be either lossless or lossy, depending whether the source object can be reconstructed exactly or only approximately.

Typically, encoding an object means mapping its elementary parts to binary codewords and concatenating them. Care must be taken to ensure that the resulting bit-string is decodable, that is, that it can be broken down back into the original codewords from which the parts can be recovered. For instance, imagine the data consist of the outcome of five throws of a pair of dice, i.e. a list of five integers in the interval [1, 12]. If we simply turn the values into their binary representations and concatenate them, we might not be able to reconstruct the list of integers. For example, there is no way to tell whether 1110110110 stands for 11 10 1 10 110, i.e. \(\langle 3, 2, 1, 2, 6\rangle \), or for 1 110 1 101 10, i.e. \(\langle 1, 6, 1, 5, 2\rangle \). To avoid this kind of confusion, we want our code to be such that there is a single unambiguous way to split an encoded string into codewords. One strategy is to use separators. For instance, we might represent each integer by as many 1s, separated by 0s, so that \(\langle 3, 2, 1, 2, 6\rangle \) becomes 1110110101101111110.

More in general, this is where the prefix-free property becomes very useful. A prefix-free code (also confusingly known as prefix code or instantaneous code) is such that no extension of a codeword can itself be a codeword.

Fixed-length codes (a.k.a. uniform codes), that assign distinct codewords of the same length to every symbol in the input alphabet, clearly satisfy the prefix-free property. For an input alphabet consisting of n distinct symbols, the codewords must be of length \(\lceil \log _2(n) \rceil \). Such a code minimises the worst-case codeword length, that is, the longest codeword is as short as possible. With such a code, every symbol is worth the same. Therefore it is a good option for pointing out an element among canonically ordered options under a uniform distribution, without a priori bias. For example, the sequence baeaecdaeeccbc over the five-letter alphabet \(\langle \texttt {a}, \texttt {b}, \texttt {c}, \texttt {d}, \texttt {e}\rangle \) might be encoded in 42 bits, as

$$\begin{aligned} \texttt {001~000~100~000~100~010~011~000~100~100~010~010~001~010}\;. \end{aligned}$$

When symbols are not uniformly distributed, using codewords of different lengths can result in codes that are more efficient on average. There are only so many short codewords available, so one needs to choose wisely what they are used for. Intuitively, symbols that are more frequent should be assigned shorter codewords. The Kraft–McMillan inequality (also known simply as Kraft inequality) gives a necessary and sufficient condition for the existence of a prefix-free code. Specifically, it states that, for a finite input alphabet \({\mathcal {A}}\), the codeword lengths for any prefix-free code C must satisfy

$$\begin{aligned} \sum _{x \in {\mathcal {A}}} 2^{-L_C(x)} \le 1\;, \end{aligned}$$

where \(L_C(x)\) denotes the length of the codeword assigned by C to symbol x. Vice versa, given codeword lengths satisfying this inequality, there exists a prefix-free code with these codeword lengths. Furthermore, if P is a probability distribution over a discrete input alphabet \({\mathcal {A}}\), there exists a prefix-free code C such that for all \(x \in {\mathcal {A}}\), \(L_C(x) = \lceil -\log _2\left( P(x)\right) \rceil \). A pair of related techniques to construct such a varying-length prefix-free code is commonly referred to as Shannon–Fano code. Moreover, given an input alphabet \({\mathcal {A}}\) where each symbol \(x_i\) has an associated weight \(w_i\) that might represent, in particular, its frequency of occurrence, a code C is optimal if for any other code \(C'\) we have

$$\begin{aligned} \sum _{x_i \in {\mathcal {A}}} w_i L_C(x_i) \le \sum _{x_i \in {\mathcal {A}}} w_i L_{C'}(x_i)\;. \end{aligned}$$

Huffman’s algorithm is an ingenious simple algorithm allowing to construct an optimal prefix-free code. Considering the example sequence baeaecdaeeccbc again, Huffman’s algorithm would assign shorter codewords to the more frequent letters a, c and e, while ensuring that the prefix-free property is satisfied, allowing for instance to encode the sequence in just 31 bits, as

$$\begin{aligned} \texttt {011~00~11~00~11~10~010~00~11~11~10~10~011~10}\;. \end{aligned}$$

Note that to use such a code, the alphabet and the associated probability distribution must be shared by the sender and the receiver. Indeed, it is not enough that the receiver is able to recover the transmitted codewords, the receiver must also know which symbol is represented by each codeword. In order to transmit a sequence of symbols, a simple way to proceed is therefore to first transmit the information about the distribution, then to transmit the actual sequence using Shannon–Fano coding, resulting in a two-part code. For example, we would first need to transmit the occurrence counts of the five letters, or equivalently their assigned codewords, according to some agreed protocol, before sending the actual encoded 31-bit sequence.

Relying on a sequential prediction, or prequential, strategy provides an alternative for coding that does not require the probability distribution to be known a priori. Simply put, the idea is to start with some initial probability distribution over the alphabet, e.g. uniform, and at each step encode the next element using a prefix-free code based on the current distribution, then update the probability distribution to account for this occurrence. These predictive plug-in codes have useful properties. In particular, the total code length does not depend on the order in which the elements are encoded, and is within a constant factor of the optimal.

Let us consider the sequence baeaecdaeeccbc as an example once more. We might start with occurrence counts initialised to one for all five letters, and run Huffman’s algorithm to assign them codewords accordingly. Having agreed on this protocol, the sender and the receiver obtain the same initial codewords, without the sender first having to explicitly share the information about the distribution.

The sender first transmits the codeword for letter b, which the receiver can correctly decode using the initial code. Both sides increment their occurrence count of b by one, and update their codewords by running Huffman’s algorithm again. Note that at this point, b has the highest occurrence count and will thus be assigned a short codeword (2 bits). Next, the sender transmits the new codeword for a, which the receiver is able to correctly decode. The occurrence counts are incremented and the codewords updated, on both sides in parallel. And so on, until the entire sequence has been encoded. Crucially, as they apply the same updating protocol in parallel, relying only on information shared so far, sender and receiver maintain identical codes at every step. Later in the process, the occurrence counts of the letters approach their overall frequencies and the lengths of the assigned codewords converge towards those obtained using Shannon–Fano coding with prior knowledge of the distribution.

More recent advances in MDL and model selection theory introduced Bayesian and normalised maximum likelihood (NML) codes. Like prequential codes, both are one-part codes. In contrast to crude codes, such refined codes remove the need to explicitly encode the parameters of the distribution, thereby avoiding the associated bias, and have useful properties, including optimality guarantees. In particular, the term universal is commonly used to refer to a code that, to put it simply, performs essentially as well as the best-fitting code for the input, for any possible input. This use of universal should not be confused with another common use, referring to codes for representing integers, which we present next. On the downside, refined one-part codes are not easily explained in terms of practical encoding and are not always computationally feasible.

Finally, a universal code for integers is a prefix-free code that maps non-negative integers to binary codewords. Such a code can be used to encode positive integers when the upper-bound cannot be determined a priori. Elias codes, which come in three variants—namely Elias gamma code, Elias delta code, and Elias omega code—are universal codes commonly used for this purpose. For instance, an integer \(x \ge 1\) is encoded using the Elias gamma code as \(\lfloor \log _2(x) \rfloor \) zeros followed by the binary representation of x. These codes penalise large values, since larger integers are assigned longer codewords.

2.2 Applying the MDL principle in pattern mining

It is important to note that when applying the MDL principle, compression is used as a tool to compare models, rather than as an end in itself. In other words, we do not care about actual descriptions, only about their lengths. Furthermore, we do not care about the absolute magnitude of the description achieved with a particular model as much as compared to those achieved with other candidate models. This has a few important consequences.

First, as the code does not need to be usable in practice, the requirement of integer code lengths can be lifted, allowing for finer comparisons. In particular, this means that for a discrete input alphabet \({\mathcal {A}}\) with probability distribution P, the most reasonable choice for assigning codewords is a prefix-free code C such that for all \(x \in {\mathcal {A}}\), \(L_C(x) = -\log _2\left( P(x)\right) \). This is often referred to as Shannon–Fano coding, although no underlying practical code is actually required, or even considered. Note that the entropy of P corresponds to the expected number of bits needed to encode in this way an outcome generated by P.

Second, elements that are necessary in order to make the encoding system work but are common to all candidate models might be left out, since they play no role in the comparison. Third, to ensure a fair comparison, only lossless codes are considered. Indeed, comparing models on the basis of their associated description lengths would be meaningless if we are unable to disentangle the savings resulting from a better ability to exploit the structure of the data, versus from increased distortion in the reconstruction. In some cases, this simply means that corrections that must be applied to the decoded data in order to achieve perfect reconstruction are considered as part of the description.

Though one-part codes might be used for components, at the highest level most proposed approaches use the crude two-part MDL, which requires to first explicitly describe the model M, then describe the data using that model, rather than refined one-part MDL, which corresponds to using the entire model class \({\mathcal {M}}\) to describe the data. That is because the aim is not just to know how much the data can be compressed, but how that compression is achieved, by identifying the associated model. The overall description length is the sum of the lengths of the two parts, so the best model \(M \in {\mathcal {M}}\) to explain the data D is the one that minimises \(L(M,D) = L(M) + L(D\mid {}M)\), where L denotes description lengths in bits, as above. The two parts can be understood as capturing respectively the complexity of the model and the fit of the model to the data, and the MDL principle as a way to strike a balance between the two.

One can also view this from the perspective of probabilistic modeling. Consider a family of probability distributions parameterised by some set \(\Theta \), that is, \({\mathcal {M}} = \{ p_{\theta }, \theta \in \Theta \}\), where each \(p_\theta \) assigns probabilities to datasets. In addition, consider a distribution \(\pi \) over the elements of \({\mathcal {M}}\). In this context, in accordance with the Bayesian framework, the best explanation for a dataset D is the element of \({\mathcal {M}}\) minimising

$$\begin{aligned} -\log \pi (\theta ) - \log p_{\theta }(D)\,, \end{aligned}$$

where the two terms are the prior and the negative log likelihood of the data, respectively. When using a uniform prior, this means selecting the maximum likelihood estimator. Since codes can be associated to probability distributions, we can see the connection to the two-part MDL formulation above, where the term representing the description length of the model, L(M), corresponds to the prior, and the term representing the description length of the data given the model, \(L(D\mid {}M)\), corresponds to the likelihood.

The Bayesian Information Criterion (BIC) is a closely related model selection method that aims to find a balance between the fit of the model, measured in terms of the likelihood function, and the complexity of the model, measured in terms of the number of parameters k. Denoting as n the number of data points in D, it can be written as \(k \log n - 2 \log p_{\theta }(D)\).

When applying the MDL principle to a pattern mining task, the models considered consist of patterns, capturing structure and regularities in the data. Depending on the type of data and the application at hand, one must decide what kind of patterns are relevant, i.e. (i) a pattern language must be chosen. The model class \({\mathcal {M}}\) then consists of all possible subsets of patterns from the chosen pattern language. Note that we generally use the term model to refer to a specific collection of patterns and the single corresponding probability distribution over the data, whereas from the statistical modeling perspective, model typically refers to a family of probability distributions. Next, (ii) an encoding scheme must be designed, i.e. a mechanism must be engineered to encode patterns of the chosen type and to encode the data by means of such patterns. Finally, (iii) a search algorithm must be devised to explore the space of patterns and identify the best set of patterns with respect to the encoding, i.e. the set of patterns that results in the shortest description length.

2.3 Dictionary-based vs. block-based strategies

The crude two-part MDL requires to explicitly specify the model (the L(M) part). This means designing an ad-hoc encoding scheme to describe the patterns. This is both a blessing and a curse, because it gives leeway to introduce some bias, i.e. penalise properties of patterns that are deemed less interesting or useful than others by making them more costly in terms of code length. But it can involve some difficult design choices and lead to accusations of “putting your fingers on the scale”.

When considering substructure patterns, encoding the data using the model (the \(L(D\mid {}M)\) part) can be fairly straightforward, simply replacing occurrences of the patterns in the data by their associated codewords. Knowing how many times each pattern X is used in the description of the data D with model M, denoted as \({{\,\mathrm{\hbox {usage}}\,}}_{D, M}(X)\), X can be assigned a codeword of length

$$\begin{aligned} L(X) = -\log _2 \left( \frac{{{\,\mathrm{\hbox {usage}}\,}}_{D, M}(X)}{\sum _{Y \in M} {{\,\mathrm{\hbox {usage}}\,}}_{D, M}(Y)} \right) \end{aligned}$$

using Shannon–Fano coding. The design choice of how to cover the data with a given set of patterns, dealing with possible overlaps between patterns in M, determines where each pattern is used, defining \({{\,\mathrm{\hbox {usage}}\,}}_{D, M}(X)\). The requirement that the encoding be lossless means that elementary patterns, e.g. singleton itemsets, must be included in the model, to ensure complete coverage. In this case, encoding the model (the L(M) part) consists in providing the mapping between patterns and codewords, typically referred to as the code table. That is, for each pattern in turn, describing it and indicating its associated codeword. Such a code table or dictionary-based strategy, which corresponds to frequent pattern mining approaches, is a common way to proceed.

An alternative strategy is to divide the data into blocks, that might or might not be allowed to overlap, each of which is associated with a specific probability distribution over the possible values and should be as homogeneous as possible. The values within each block are then encoded using a code optimised for the corresponding distribution. In that case, encoding the model consists in indicating the limits of each block and the associated probability distribution. Such a block-based strategy corresponds to segmentation, biclustering and tiling approaches.

Fig. 2
figure 2

A toy binary dataset, with six columns (i.e. items) denoted AF, nine rows (i.e. transactions) denoted (1)–(9), containing twenty-four ones (i.e. positive entries or item occurrences) represented as black squares

These two main strategies, dictionary-based and block-based, that use respectively substructure and block patterns, are an important distinguishing factor that we use to categorise approaches, as depicted in Fig. 1 with dotted and dashed lines, respectively. We further illustrate and contrast these strategies on a toy binary dataset shown in Fig. 2. The dataset has six columns (i.e. items) denoted AF, nine rows (i.e. transactions) denoted (1)–(9), and contains twenty-four ones (i.e. positive entries or item occurrences) represented as black squares. In particular, the approaches delineated here to illustrate the two strategies are based on the work of Siebes et al. (2006) and Smets and Vreeken (2012) (cf. Sect. 3), on one hand, and of Chakrabarti et al. (2004) (cf. Sect. 4.2), on the other hand.

Fig. 3
figure 3

Dictionary-based strategy, examples on the toy binary dataset of Fig. 2. The model M consists of a code table associating patterns, here itemsets, to codewords (left). The data D is encoded using the model by replacing occurrences of the itemsets by the associated codewords (right). Coloured blocks represent prefix-free codewords assigned to items (green) and itemsets (blue), their width is proportional to the code length (Color figure online)

Examples of modeling the toy binary dataset using a dictionary-based approach are provided in Fig. 3. The simplest model is considered first (i), which contains as its patterns all singleton itemsets from the dataset and is often referred to as the standard code table (ST). A non-trivial code table is considered next (ii). In both cases, the model is represented as a code table associating patterns, here itemsets, to codewords. Encoding the model means encoding each pair. On one hand, itemsets are specified by listing the items they contain. This uses Shannon–Fano coding over the alphabet of items associated to their frequency of occurrence in the data, which is assumed to be shared knowledge. On the other hand, codewords are assigned using Shannon–Fano coding over the alphabet of itemsets included in the code table, associated to their usage.

In Fig. 3, the prefix-free codewords assigned to items and to itemsets are represented by coloured blocks in shades of green and blue, respectively. The width of a block is proportional to the length of the represented codeword. For instance, the first row of the non-trivial code table (bottom-left quadrant of Fig. 3) specifies the first itemset in the code table, in this case ADE, listing the three items that constitute it, A, D and E, using the items’ codewords (green blocks), then provides the codeword assigned to ADE by the code based on the usage of itemsets selected in this model (blue block).

Note that for the standard code table, these two prefix-free codes are the same because the standard code table consists of all singleton items and their usage is precisely their frequency of occurrence in the data. Therefore, all codewords in the top half of Fig. 3 are represented as green blocks. For the same reason, the length of the codeword associated to item x by the first code, which does not depend on the code table, is denoted as \(L_{ST}(x)\), whereas the length of the codeword associated to itemset X by the second code, which is different for different code tables, is denoted as \(L_{CT}(X)\).

Then, encoding the data using the model simply means replacing itemset occurrences by the corresponding codewords.

In summary, the overall description length can be computed as

(1)

In this example, the standard code table and the non-trivial code table yield an overall description length of 92.530 bits and 63.333 bits, respectively. By identifying items that occur frequently together, the latter results in a shorter description length, and one can therefore conclude that it constitutes a better model for the dataset, according to the MDL criterion.

A similar approach can be built by replacing static Shannon–Fano coding with dynamic prequential coding. In that case, encoding the model only requires to list the itemsets, i.e. only the first column of the code table is needed. Then, the data is encoded by enumerating the itemset occurrences using prequential coding (cf. Sect. 2.1). In practise, the process of updating the codewords does not actually need to be carried through. Luckily, the description length of the data can be calculated with a (not so simple) formula that does not depend on the order in which the itemsets are transmitted (see Budhathoki and Vreeken 2015).

Fig. 4
figure 4

Block-based strategy, examples on the toy binary dataset of Fig. 2. The model M partitions the dataset D into blocks, each associated to a specific probability distribution over the values (left), which is used to encode the entries (right). More intense shades of orange represent higher probabilities of ones within the corresponding block

Examples of modeling the toy binary dataset using a block-based approach are provided in Fig. 4. The simplest model is considered first (i), which consists of a single block. A non-trivial model dividing the dataset into six blocks is considered next (ii). The approach illustrated here requires the patterns to form a partition of the data. Therefore, any considered model consists of a set of non-overlapping blocks covering the entire dataset and, more specifically, is such that the rows and the columns are divided into disjoint groups. This requirement means that each row and each column belongs to exactly one group, a fact that can be exploited when designing the encoding, as explained below. Each block is associated to a specific probability distribution over the values occurring within it. In this case, the values are zero and one, corresponding to a Bernoulli distribution. Encoding the model means specifying the groups of rows and of columns that define the blocks, and the probability associated to each one. In this approach, all blocks are induced by a unique partition of the rows and of the columns, which can be specified by defining an order over the rows (resp. columns) and indicating the number of rows (resp. columns) in each group. Finally, the number of ones in each block is transmitted, allowing to compute the corresponding probability. This is achieved through a combination of fixed-length and universal codes.

Next, we delve deeper into the details of the encoding scheme, to illustrate the choices that are involved in its design. We let

  • m and n denote respectively the number of rows and columns in the dataset,

  • k and l denote respectively the number of row and column groups,

  • \(m_i\) and \(n_j\) denote respectively the number of rows and columns in block \(B_{i,j}\),

  • \(\gamma _v(B_{i,j})\) denote the number of entries in block \(B_{i,j}\) equal to v, and

  • \({{\,\mathrm{\hbox {L}_{{\mathbb {N}}}}\,}}(x)\) denote the MDL optimal universal code length for integer x.Footnote 3

The formula for computing the overall description length is provided in Eq. 2. The number of rows and the number of columns are transmitted using universal coding, since these values are not bounded a priori, and the order of rows (resp. of columns) is then specified by listing row (resp. column) identifiers in turn, using a fixed-length code (cf. part (a) of Eq. 2). In fact, this part of the encoding is independent of the model and has a constant length for a given dataset. Therefore, it does not impact the comparison and can be ignored. The number of row groups k (resp. column groups l) could be transmitted using a fixed-length code \(\log _2(m)\) (resp. \(\log _2(n)\)), since there cannot be more groups than there are rows (resp. columns). However, universal coding is used instead, to favour partitions with fewer groups. Then, assuming that the numbers of rows and of columns in the groups are sorted and transmitted by decreasing order, upper bounds \(m^{*}_i\) and \(n^{*}_j\) on \(m_i\) and \(n_j\), respectively, can be derived given shared knowledge since already transmitted values constrain the remaining ones: \(m^{*}_i = \left( \sum _{t=i}^{k} m_t \right) - k +i\), for \(i = 1,\dots k-1\) and \(n^{*}_j = \left( \sum _{t=j}^{l} n_t \right) - l +j\), for \(j = 1,\dots l-1\).

These upper bounds are used to transmit the numbers of rows and of columns in the groups with fixed-length codes (cf. part (b) of Eq. 2). Also the number of ones in each block can be transmitted using a fixed-length code, since it takes value between zero and the number of entries in the block (cf. part (c) of Eq. 2).

The data is then encoded using the model. This is done by listing the entries in each block with a prefix-free code adjusted to the probability distribution within the block (cf. part (d) of Eq. 2).

In summary, the overall description length can be computed as

(2)

In Fig. 4, shades of orange are used to represent probability distributions within the blocks, with more intense shades representing higher probabilities of ones. In this example, the single-block model and the six-block model yield an overall description length of 118.204 bits and 113.433 bits, respectively. By partitioning the dataset into blocks that are particularly dense or particularly sparse, the latter results in a shorter description length, and one can therefore conclude that it constitutes a better model for the dataset, according to the MDL criterion.

Removing the partition constraint and allowing overlaps between the block patterns would require to modify the encoding since we could no longer assume that each row and each column belongs to a single group.

2.4 Algorithms

The MDL principle provides a basis for designing a score for patterns, but no way to actually find the best collection of patterns with respect to that score. The space of candidate patterns, and even more so the space of candidate pattern sets, is typically extremely large, if not infinite, and rarely possesses any useful structure. Hence, exhaustive search is generally infeasible, heuristic search algorithms are employed, and one must be satisfied with finding a good set of patterns rather than an optimal one. Mainly, algorithms can be divided between (i) approaches that generate a large collection of patterns then (iteratively) select a small set out of it, and (ii) approaches that generate candidates on-the-fly, typically in a levelwise manner, possibly in an anytime fashion.

The former approaches are typically less efficient, since they generate many more candidates than necessary, but constitute a useful basis for building a proof-of-concept. Because recomputing costs following the addition or removal of a candidate pattern is often prohibitively expensive, an important component of the heuristic search algorithms consists in efficiently and accurately bounding these costs.

With a dictionary-based strategy, the simplest model consists of all separate basic elements. The search for candidates can thus start from this model and progressively combine elements. Vice versa, with a block-based strategy, the simplest model consists of a single block covering the entire dataset. The search for candidates can thus start from this model and progressively split elements. Intuitively, the two main strategies lend themselves respectively to bottom-up and top-down iterative exploration algorithms.

In summary, to apply the MDL principle to a pattern mining task, one might proceed in three main steps. First, define the pattern language, deciding what constitutes a structure of interest given the data and application considered. Second, define a suitable encoding scheme, designing a system to encode the patterns and to encode the data using such patterns. Third, design a search algorithm, allowing to identify in the data a collection of patterns that yields a good compression under the chosen encoding scheme.

Our main focus here is on the second step, the design of an encoding scheme for the different types of patterns, which is the core of the MDL methodology and its distinctive ingredient. For the most part, we do not discuss search algorithms and are not concerned by issues of performance, which are more generic aspects of pattern mining methodologies.

3 Mining itemsets with Krimp & Co.

A transactional database consists of a collection of sets, called transactions, over a universe of items. The prototypical example for this type of data comes from market-basket analysis, which is also where some of the terminology is borrowed from. Alternatively, a transactional database can be represented as a binary matrix. Frequent itemset mining, that is, finding items that frequently co-occur in a transactional database, is a central task in data mining (cf. Sect. 1).

3.1 Krimp

The introduction by Siebes et al. in 2006 of a MDL-based algorithm for mining and selecting small but high-quality collections of itemsets sparked a productive line of research, including algorithmic improvements, adaptations to different tasks, and various applications of the original algorithm. The algorithm, soon dubbed Krimp (van Leeuwen et al. 2006), Dutch for “to shrink”, is a prime example of a dictionary-based strategy (cf. Sect. 2.3), illustrated in Fig. 3.

Through an evaluation on a classification task, van Leeuwen et al. (2006) show that the selected itemsets are representative. Specifically, considering a labelled training dataset, Krimp is applied separately on the transactions associated to each class to mine a code table. A given test transaction can then be encoded using each of the code tables, and assigned to the class that corresponds to the shortest code length.

The Krimp algorithm has subsequently been presented with varying depths of technicality (van Leeuwen et al. 2009b; Vreeken et al. 2011; Siebes 2014; van Leeuwen and Vreeken 2014). It is a corner stone of a couple of doctoral dissertations (Vreeken 2009; van Leeuwen 2010) and also contributed to spark a more conceptual discussion (Siebes 2012).

3.2 Algorithmic improvements

Works building on Krimp include several algorithmic improvements. In particular, the Slim algorithm of Smets and Vreeken (2012) modifies Krimp and greedily generates candidates by merging patterns, instead of evaluating candidates from a pre-mined list. The example presented in Fig. 3(ii) (cf. Sect. 2.3) was actually obtained by running the Slim algorithm on the toy dataset of Fig. 2. Hess et al. (2014) propose a data structure similar to the FP-tree to facilitate the recomputation of usages when the Krimp code table is updated, making the mining algorithm more efficient. Sampson and Berthold (2014) apply widening, i.e. diversification of the search, to Krimp .

3.3 Finding differences and anomalies

The analysis of differences between databases and the detection of anomalies are derivative tasks that have attracted particular attention. Vreeken et al. (2007a) use Krimp to measure differences between two databases, by comparing the length of the description of a database obtained with a code table induced on that database versus one induced on the other database. The coverage of individual transactions by the selected itemsets, and the specific code tables obtained are also compared. The DiffNorm algorithm introduced by Budhathoki and Vreeken (2015) aims to encode multiple databases at once without redundancy, and allows to investigate the differences and similarities between the databases by inspecting the obtained code table. As a major contribution, Budhathoki and Vreeken (2015) improve the encoding by replacing the Shannon–Fano code used in the original Krimp by a prequential plug-in code (cf. Sect. 2.1).

Smets and Vreeken (2011) use Krimp for outlier detection by looking at how many bits are needed to encode a transaction. If this number is much larger than expected, the transaction is declared anomalous. In addition, the encodings of transactions can be scrutinised to obtain further insight into how they depart from the rest. Akoglu et al. (2012b) design an algorithm that detects as anomalies transactions having a high encoding cost. Their proposed algorithm mines multiple code tables, rather than a single one in Krimp , and handles categorical data. Bertens et al. (2015) (also Bertens et al. 2017) propose a method to detect anomalous co-occurrences based on the Krimp /Slim code tables.

3.4 Mining rule sets

Going beyond itemsets, a closely related task is to mine rules. van Leeuwen and Galbrun (2015) propose to mine association rules across a two-view transactional dataset, such that one view can be reconstructed from the other, and vice versa. Fischer and Vreeken (2019) instead consider a unique set of items and aim to mine associations rules that allow to reconstruct the dataset, enabling corrections in order to increase the robustness of the results. They then apply the approach to learn rules about activation patterns in neural networks (Fischer et al. 2021).

Aoga et al. (2018) present a method to encode a binary label associated to each transaction, using the original transactions and a list of rules, each associated to a probability that the target variable holds true. Proença and van Leeuwen (2020a) (also Proença and van Leeuwen 2020b) consider a similar task, but with multiple classes and targeted towards predictive rather than descriptive rules, then looking for rules that capture deviating groups of transactions, i.e. dealing with the subgroup discovery task (Proença et al. 2020, 2021b). Beside binary attributes, i.e. items, Proença et al. (2020) also consider nominal and numerical attributes, and aim to predict a single numerical target, modeled using normal distributions, instead of a binary target. Proença et al. (2021a) further extend the approach to more complex nominal and numerical targets by resorting to normalised maximum likelihood (NML) and Bayesian codes, respectively.

Whereas the former two output a set of rules, these latter methods return a list of rules, such that at most one rule, the first valid rule encountered in the list, applies to any given transaction. In all cases, the dataset, or part of it, is assumed to be given and the aim is to reconstruct a target variable, or the rest of the dataset.

3.5 Other adaptations

Further work also includes extending the Krimp approach to more expressive patterns, such as patterns mined from relational databases (Koopman and Siebes 2008, 2009), and using Krimp for derivative tasks.

van Leeuwen and Siebes (2008) use Krimp to detect changes in the distribution of items in a streaming setting. If a code table induced from an earlier part of the stream no longer provides good compression as compared to a code table induced from a more recent part of the stream, it is a signal that the distribution has changed. Bonchi et al. (2011) extend the approach to the probabilistic setting, where the occurrence of an item in a transaction is associated to a probability, aiming to find a collection of itemsets that compress the data well in expectation.

Given an original database, Vreeken et al. (2007b) use Krimp to generate a synthetic database similar to the original one, for use in the context of privacy-preserving data mining. The code table is induced on the original database, itemsets are then sampled from it and combined to generate synthetic transactions. Vreeken and Siebes (2008) use Krimp for data completion. In a way, this turns the MDL principle on its head. Starting from an incomplete database, rather than looking for the patterns that compress the data best, the proposed approach looks for the data that is best compressed by the patterns and considers that to be the best completion. Instead of a single code table, Siebes and Kersten (2011) look for a collection of code tables, that capture the structure of the dataset at different levels of granularity. Representing a categorical dataset as transactional data by mapping each value of an attribute to a distinct item implies that each transaction contains exactly one item for each categorical attribute, and hence that all transactions have the same length. Siebes and Kersten (2012) consider the problem of smoothing out the small scale structure from such datasets. That is, they aim to replace entries in the data so that its large scale structure is maintained but it can be compressed better.

3.6 Applications

The Krimp algorithm has also been employed to tackle problems in different domains, including clustering tagged media (van Leeuwen et al. 2009a), summarising text (Vanetik and Litvak 2017, 2018), detecting malware (Asadi and Varadharajan 2019a, b) and analysing the Semantic Web (Bobed et al. 2019).

4 Tabular data (continued)

Transactional data can be seen as a binary matrix or table, and is hence a form of tabular data. Data tables might also contain categorical or real-valued attributes, which can be either turned into binary attributes as a pre-processing or handled directly with dedicated methods.

4.1 More itemsets

Beside Krimp and algorithms inspired from it, different approaches have been proposed to mine itemsets from binary matrices using the MDL principle.

Heikinheimo et al. (2009) focus on finding collections of low-entropy itemsets, typically even more compact than those obtained with Krimp . On the other hand, the main objective of Mampaey and Vreeken (2010) is to provide a summary of the dataset in the form of a partitioning of the items. For each partition, codewords are associated to the different combinations of the items that comprise the partition, and used to encode the corresponding subset of the data. In the Pack algorithm, proposed by Tatti and Vreeken (2008), the code representing an item is made dependent on the presence/absence of other items in the same transaction. This can be represented as decision trees whose intermediate nodes are items and leaves contain code tables for other items. Fischer and Vreeken (2020) introduce a rich pattern language that can capture both co-occurrences and mutual exclusivity of items.

Mampaey et al. (2012) present a variant of their MTV itemset mining algorithm (Mampaey et al. 2011) where the MDL principle is used to choose the collection of itemsets that yields the best model, as an alternative to the Bayesian Information Criterion (BIC).

4.2 Blocks in binary data

A different type of patterns that can be mined from binary matrices consists of blocks defined by a collection of rows and columns with a homogeneous density of ones, sometimes known as tiles or as biclusters, constituting the basis of block-based strategies (cf. Sect. 2.3). Chakrabarti et al. (2004) propose a method to partition the rows and columns of the matrix into subsets that define homogeneous blocks. The example presented in Fig. 4 follows this approach. Tatti and Vreeken (2012a) introduce the Stijl algorithm to mine a hierarchy of tiles from a binary matrix. The adjectives geometric or spatial typically refer to cases where the order of the rows and columns of the data matrix is meaningful (e.g. Papadimitriou et al. 2005; Faas and van Leeuwen 2020).

4.3 Formal Concept Analysis (FCA)

The field of Formal Concept Analysis focuses on basically the same problem as itemset mining, but from a slightly different perspective and with its own formalism. Some steps have been taken towards applying and further developing MDL-based methods within this framework (Otaki and Yamamoto 2015; Yurov and Ignatov 2017). In particular, as a topic in her doctoral dissertation, Makhalova (2021) extensively studied MDL-based itemset mining methods from the perspective of FCA (also Makhalova et al. 2018a, b, 2019b, 2021).

4.4 Boolean Matrix Factorisation (BMF)

Factorising a matrix consists in identifying two matrices, the factors, so that the original matrix can be reconstructed as their product. One challenge is to find the right balance between the complexity of the decomposition (generally measured by the size of the factors) and the accuracy of the reconstruction, which is where the MDL principle comes in handy (Miettinen and Vreeken 2011, 2014; Hess et al. 2017). Whereas the other approaches permit reconstruction errors in both values, Makhalova and Trnecka (2019) do not allow factors to cover zero entries, considering so-called “from-below” factorisations (also Makhalova and Trnecka 2021). Lucchese et al. (2014) (also Lucchese et al. 2010b, a) propose a MDL-based score to compare factors of a fixed size.

An example of a factorisation of the toy binary dataset from Fig. 2 is provided in Fig. 5. The model consists of a pair of factor matrices whereas the data is encoded as a mask of corrections. To reconstruct the original data, the Boolean matrix product of the factors is computed and the mask of corrections is applied to the resulting matrix.

When applied to a Boolean matrix, factorisation shares some similarities with itemset mining, as it aims to identify items that occur (or do not occur) frequently together. That is, the two factor matrices can be interpreted as specifying itemsets and indicators of occurrence, respectively. On the other hand, the factors can also be interpreted as specifying possibly overlapping fully dense blocks. A mask applied to the reconstructed data provides a global error correction mechanism. Thus, Boolean matrix factorisation can be seen as a hybrid approach.

Fig. 5
figure 5

Boolean Matrix Factorisation, example on the toy binary dataset of Fig. 2. The model M consists of a pair of factor matrices (left) and the data D is encoded as a mask of corrections (right). To reconstruct the original data, first the Boolean matrix product of the factors is computed, resulting in the matrix with two fully dense blocks \(B_1\) and \(B_2\). Then, the mask of corrections is applied to this matrix, that is, the value of the cells indicated in the mask (grey squares) are flipped. Note that the blocks drawn with the hatch pattern depict the intermediate step of the reconstruction and are not actually encoded as such

4.5 Categorical data

One way to mine datasets involving categorical attributes is to binarise them and then apply, for instance, an itemset mining algorithm. However, binarisation entails a loss of information and dedicated methods can hence offer a better alternative.

Mampaey and Vreeken (2013) introduce an approach to detect correlated attributes, i.e. such that the different categories occur in a coordinated manner, whereas He et al. (2014) present a subspace clustering method, i.e. look not only for groups of attributes, but also corresponding groups of rows where coordinated behaviour occurs.

4.6 Numerical data

A numerical data table containing m columns can be seen as a collection of points in a m-dimensional space. In pattern mining, numerical data is often handled by applying discretisation, which requires to partition the data into coherent blocks. While it can be seen as a data exploration task in its own right, providing an overview of the dataset and the distribution of values, discretisation often constitutes a pre-processing task to allow applying algorithms that can handle only discrete input data. Yet, choosing good parameters for the discretisation can be difficult, and its quality can have a major impact on later processing. Unsupervised discretisation, where no side information is available, is in contrast to supervised discretisation, that takes into account class labels and often precedes a machine learning task. Here we focus on the former.

Kontkanen and Myllymäki (2007) propose a histogram density estimation method that relies on the MDL principle, formalised using the normalised maximum likelihood (NML) distribution. This method is employed by Kameya (2011) to discretise time-series seen as a collection of two-dimensional time–measurement data points, and extended by Yang et al. (2020) to two-dimensional numerical data, more in general. Along similar lines, Nguyen et al. (2014) aim to automatically identify a high-quality discretisation that preserves the interactions between attributes. Witteveen et al. (2014) extend the Kraft inequality (cf. Sect. 2.1) to numerical data and introduce an approach to find hyperintervals, i.e. multidimensional blocks.

Makhalova et al. (2019a) consider the problem of mining interesting hyperrectangles from discretised numerical data, and aim to design an encoding that accommodates overlaps between patterns (Makhalova et al. 2020, 2022).

Lakshmanan et al. (2002) formalise mining OLAP data, i.e. multidimensional datasets, as a problem of finding a cover in a multidimensional array containing positive, negative and neutral cells. The aim is then to find the most compact set of hyperrectangles that covers all positive cells, none of the negative cells, and no more than a chosen number of the neutral cells. The score is presented as a generalised MDL due to the tolerance on neutral cells. However, coverings are evaluated by simply counting cells, which does not actually adhere with the principle, generalised or otherwise.

5 Graphs

In this section, we consider approaches for mining graphs. At their simplest, graphs are undirected and unlabelled, but they can also come with directed edges, with node or edge labels, or be dynamic, that is, time-evolving. The main tasks consist in identifying nodes that have similar connection patterns to group them into homogeneous blocks and in finding recurrent connection substructures. These correspond respectively to block-based and dictionary-based strategies (cf. Sect. 2.3).

For illustrative purposes, we consider a toy graph, shown in Fig. 6, and delineate approaches that follow either strategy. The example shown in Fig. 7 illustrates the block-based strategy and follows the work of Chakrabarti (2004) (cf. Sect. 5.1), whereas the example shown in Fig. 8 illustrates the dictionary-based strategy and follows the work of Bariatti et al. (2020a) (cf. Sect. 5.5).

Fig. 6
figure 6

A toy graph, with eight nodes denoted AH carrying labels from \(\{x, y, z\}\) and connected by undirected edges (right). The adjacency matrix (left) captures the structure of the graph, ignoring the labels

Fig. 7
figure 7

Block-based strategy, example on the toy graph of Fig. 6. Ignoring labels and considering its adjacency matrix, the graph can be encoded in a very similar way as a binary tabular dataset (see Sect. 2.3 and Fig. 4). Furthermore, since the graph is undirected, and its matrix hence symmetric, it is enough to encode the lower triangular part of the matrix (depicted with solid lines and colour fill), from which the upper triangular part (depicted with dotted lines and hatch pattern) can be reconstructed. More intense shades of orange represent higher probabilities of ones within the corresponding block

Fig. 8
figure 8

Dictionary-based strategy, example on the toy graph of Fig. 6. The idea is similar to the one for binary tabular datasets (see Sect. 2.3 and Fig. 3). However, it is not enough to simply replace the subgraph patterns by their assigned codewords (depicted as blue blocks), the information about how the subgraphs are connected also needs to be encoded. Here, this is done through ports associated to the patterns and their assigned codewords (depicted as tan blocks)

Looking at the corresponding adjacency matrix, a simple unlabelled graph can be represented as a binary table. Approaches from Sects. 3 and 4 can thus readily be used for mining graphs. On one hand, the problem of grouping nodes into blocks that constitute particularly dense subgraphs, or communities, is closely related to identifying particularly dense tiles in a binary matrix. On the other hand, approaches that follow a dictionary-based strategy and aim to identify substructures in the graphs share similarities with their counterparts for binary tabular data. However, it is not enough to simply replace the subgraph patterns by their assigned codewords. The information about how the subgraphs are connected also needs to be encoded, requiring more complex encoding schemes.

The survey by Liu et al. (2018a) covers graph summarisation methods in general, whereas Feng (2015) presents various information-theoretic graph mining methods, both of which include MDL-based methods for analysing graphs as a subset.

5.1 Grouping nodes into blocks

Rosvall and Bergstrom (2007) present an information-theoretic framework to identify community structure in networks by grouping nodes and propose to use the MDL principle to automatically select the number of groups in which to arrange the nodes.

Chakrabarti (2004) proposes to compress the adjacency matrix of a graph by grouping nodes into homogeneous blocks (see Fig. 7), with a top-down procedure to search for a good partition. Navlakha et al. (2008) similarly propose to build graph summaries by grouping nodes into supernodes, but with a bottom-up search procedure. A superedge linking two supernodes represents edges between all pairs of elementary nodes from either supernodes (hence a supernode with a loop represents a clique). When reconstructing the original graph, after expanding the supernodes and superedges, some corrections must be performed, to add and remove spurious edges. Navlakha et al. (2008) let the cost of encoding a graph equal the number of superedges and edge corrections, ignoring the cost of the assignment of nodes to supernodes.

Khan et al. (2015b) (also Khan 2015; Khan et al. 2015a) work with essentially the same encoding, using locality-sensitive hashing (LSH) to identify candidates for merger, additionally considering node labels (Khan et al. 2014). Akoglu et al. (2012a) also aim at grouping nodes while taking into account node attributes.

A similar block summary approach for bipartite graphs is proposed by Feng et al. (2012), with a more complete encoding. Papadimitriou et al. (2008) also focus on bipartite graphs, but the obtained blocks are arranged into a hierarchy, while He et al. (2009) consider k-partite graphs.

In order to compress the adjacency matrix of an input graph more efficiently, He et al. (2011) look for nodes with similar connection patterns, corresponding to similar rows in the matrix, and encode the differences, possibly in a recursive manner. The approach is used to spot nodes with unusual connections patterns, that do not lend themselves to grouping.

LeFevre and Terzi (2010) propose a supernode summary involving superedge weights that represent the probability that an edge exists for each pair of nodes in the incident supernodes. In one variant of the problem, the MDL principle is used to choose the number k of supernodes that strikes the best balance between model complexity (k) and fit to the data (reconstruction error). Lee et al. (2020) also consider summarising graphs by grouping nodes together, but fix a maximum length for the description of the model, i.e. the hypernodes, and look for the summary that minimises the reconstruction error, measured as the length of the description of edge corrections.

Plant et al. (2020) use a MDL-inspired score to learn graph embeddings. That is, the aim is to project the nodes into a multi-dimensional space, so that the structure of the graph is preserved as much as possible and, more specifically, such that connected nodes are placed close to each other. Therefore, the distance between any pair of nodes in the embedding is used to compute a probability that the nodes are connected, which, in turn is used to encode the presence or absence of the corresponding edge. Then, the quality of an embedding can be measured by how much it allows to compress the adjacency matrix.

5.2 Grouping nodes into blocks in dynamic graphs

Sun et al. (2007) introduce a block summary approach for dynamic graphs, extended to multiple dimensions or contexts by Jiang et al. (2016). Araujo et al. (2014b) also propose a block summary approach for dynamic graphs, which they later extend to multiple dimensions or contexts as represented by qualitative labels on the edges (Araujo et al. 2016).

5.3 Finding hyberbolic communities

Instead of looking for blocks of uniform density and motivated by the observation that node degrees in real-world networks often follow a power-law distribution, Araujo et al. (2014a) propose the model of hyperbolic communities. The name refers to the fact that when nodes in such communities are ordered by degree, edges in the adjacency matrix mostly end up below a hyperbola.

Kang and Faloutsos (2011) (also Lim et al. 2014) decompose the input graph into hubs and spokes, with superhubs connecting the hubs recursively, and introduce a cost to evaluate how well the decomposition allows to compress the graph. This type of decomposition is proposed as an alternative to a decomposition into cliques, referred to as “cavemen communities”.

5.4 The Map Equation

Rosvall and Bergstrom (2008) propose a method to reveal the important connectivity structure of weighted directed graphs. The approach assigns codes to nodes in such a way that random walks over the graph can be described succinctly. Furthermore, nodes are partitioned into modules so that the codes for nodes are unique within each module but can be reused between modules. A walk over the graph can then be described using a combination of codes indicating transitions between modules and lists of the successive nodes encountered within each module. The resulting two-level summary of the graph maps its main structures and the connections between and within them, and the approach is therefore referred to as the Map Equation (Rosvall et al. 2009) or the Infomap algorithm.Footnote 4

Later, refinements and extensions of the method were proposed, to study changes in the connectivity structure over time (Rosvall and Bergstrom 2010), reveal multi-level hierarchical connectivity structure (Rosvall and Bergstrom 2011), support overlaps between modules (Viamontes Esquivel and Rosvall 2011), among others (Bohlin et al. 2014; De Domenico et al. 2015; Edler et al. 2017a; Emmons and Mucha 2019; Calatayud et al. 2019). In particular, the method has found application in the analysis of ecological communities (Edler et al. 2017b; Blanco et al. 2021; Rojas et al. 2021).

5.5 Identifying substructures

Cook and Holder (1994) (also Ketkar et al. 2005) propose the Subdue algorithm to mine substructures from graphs, possibly with labels, using the MDL principle. A substructure of the graph can be encoded and its occurrences in the graph be replaced by a single node. This can be done recursively, generating a hierarchical summary of the original graph. There are two shortcomings to the approach. First, replacing the substructure by a single node does not preserve the complete information about the connections to neighbours. Second, the matching of substructures is done in an approximate way, with an arbitrary fixed cost, rather than a proper encoding of the reconstruction errors (using the MDL principle for this evaluation is left for future work). Substructures are scored individually rather than in combination. The Subdue algorithm is used by Jonyer et al. (2004) for the induction of context-free grammars, and by Bloem (2013) in comparative experiments against practical data compression with the GZIP algorithm.

Bloem and de Rooij (2018) (also Bloem and de Rooij 2020) propose to use the MDL principle when evaluating the statistical significance of the presence of substructures in a graph.

The VoG algorithm presented by Koutra et al. (2014) (also Koutra et al. 2015) allows to decompose the graph into basic primitives such as cliques, stars, and chains, which can overlap on nodes (but not on edges). Error corrections are then applied, to add and remove spurious edges. This can be seen as a global use of primitives. Liu et al. (2015) (also Liu et al. 2016) use the MDL principle to compare the ability of VoG and graph clustering methods to generate graph summaries. Liu et al. (2018b) build on VoG and address some of the shortcomings, such as the bias towards star structures, the inability to exploit edge overlaps, and the dependency on candidate order. Goebl et al. (2016) introduce a similar approach, with some of the same primitives, but prohibiting overlaps with the aim to make visualisation and interpretation easier. The approach presented by Bariatti et al. (2020a) removes the limitation to a predefined set of primitives and considers labelled graphs (see Fig. 8). The authors later upgraded the approach by generating candidates on-the-fly, thereby providing an anytime mining algorithm (Bariatti et al. 2021), and proposed a visualisation tool for the obtained graph patterns (Bariatti et al. 2020b). This work forms the basis of a doctoral dissertation (Bariatti 2021).

The approach proposed by Coupette and Vreeken (2021) aims to highlight similarities and differences between graphs, and is akin in spirit to the DiffNorm algorithm for transactional data (cf. Sect. 3.3). It looks for a common model consisting of basic primitives, like those used in VoG, such that each graph can be reconstructed based on these primitives adjusted through parameters specific to the graph, as well as additional structures, where necessary.

Feng et al. (2013) exploit basic structures of graphs, like stars and triangles, to save on the encoding of the adjacency matrix. Such primitives assign a probability to the existence of an edge, which is used to encode it. Which primitive applies is determined in part based on structure information available so far, i.e. previously decoded. This can be seen as a local use of primitives.

Belth et al. (2020) learn relational rules which can be used to summarise knowledge graphs, involving typed edges and nodes.

5.6 Identifying substructures in dynamic graphs

Shah et al. (2015) (also Shah et al. 2017) extend the VoG approach to dynamic graphs. More specifically, they incorporate the temporal aspect of substructures appearing only at given time steps, across a range of contiguous time steps, periodically, or in a flickering fashion. Therefore, in addition to decomposing the graph into basic structures, one needs to indicate when these structures appear. The Mango algorithm by Saran and Vreeken (2019) also looks for predefined structures in a dynamic graph, aiming more specifically at tracking their evolution through time.

5.7 Finding pathways between nodes

Given a large graph, Akoglu et al. (2013) consider the problem of identifying a set of marked nodes. This can be done by listing the node identifiers or by navigating between nodes. The latter strategy requires to choose between the limited number of neighbours of each traversed node, rather than among all possible nodes in the graph, potentially leading to shorter descriptions. In particular, the problem formulated by Akoglu et al. (2013) consists in finding the best collection of trees spanning the marked nodes in the graph. The graph as such is not encoded, it is regarded as shared knowledge.

Prakash et al. (2014) similarly assume shared knowledge of the graph. The aim is then to transmit the starting points and spread of an epidemic through the graph over a sequence of time steps, assuming a “susceptible–infected” (SI) epidemic model.

6 Temporal data

In this section, we look at data where the attribute values come as a sequence, i.e. in a specific order. In particular, this order might correspond to time, in which case the data is called temporal. In some cases only the order matters, whereas in other cases absolute positions are associated to the values, such as timestamps in temporal data. In addition to time, spatial dimension(s) might be associated to the values, resulting in spatio-temporal data. The terms sequential data and sequence are sometimes used to refer more narrowly to sequences of discrete attributes or items, which are typically called events. On the other hand, the term timeseries is generally used to refer to real-valued attributes sampled at regular or irregular time intervals. Text and genetic data (such as DNA or RNA sequences) fall into the former category. More specifically, such data generally comes in the form of strings, that is, as sequences of characters that represent occurrences of single items where the order is meaningful, not the positions. The data might consist of a single long sequence or of a database of multiple, typically shorter, sequences.

As with other types of data, most of the work on mining sequential data can be divided into two main tasks, namely segmentation and frequent pattern mining, corresponding to block-based and dictionary-based strategies, respectively (cf. Sect. 2.3). In segmentation problems (a.k.a. change point detection), the aim is to divide the input data into homogeneous blocks or segments, each associated to specific occurrence probabilities of the different events. On the other hand, in frequent pattern mining, the aim is to find recurrent substructures, which are commonly referred to as episodes and motifs when considering sequences and timeseries, respectively.

For illustrative purposes, we consider a toy sequence, shown in Fig. 9, and delineate approaches that follow either strategy. The example shown in Fig. 10 illustrates the block-based strategy and follows the work of Kiernan and Terzi (2008) (cf. Sect. 6.2), whereas the example shown in Fig. 11 illustrates the dictionary-based strategy and follows the work of Tatti and Vreeken (2012b) (cf. Sect. 6.5). While similar to their counterparts for binary tabular data, approaches for temporal data must account for the order that the special dimension of time imposes on the occurrences.

Fig. 9
figure 9

A toy sequence, with nineteen consecutive occurrences over three distinct events \(x\), \(y\) and \(z\) (top), also represented as a binary matrix (bottom) indicating which event (row) occurs at a given time step (column)

Fig. 10
figure 10

Block-based strategy, example on the toy sequence of Fig. 9. The sequence can be encoded in a very similar way as a binary tabular dataset (see Sect. 2.3 and Fig. 4). In this case, however, the order of the columns corresponds to time and is therefore fixed, but the events (i.e. rows) can be arranged into different groups in the different time segments (i.e. column groups). More intense shades of orange represent higher probabilities of ones within the corresponding block

Fig. 11
figure 11

Dictionary-based strategy, example on the toy sequence of Fig. 9. The idea is similar to the one for binary tabular datasets (see Sect. 2.3 and Fig. 3). Episode patterns are assigned codewords (depicted as blue blocks). The original sequence is reconstructed by reading codewords from the pattern stream (\({\mathcal {P}}\)) and inserting the events of the corresponding episode into the sequence, in order. In addition, each multi-event episode in the code table is associated to an additional pair of codewords (depicted as bronze blocks). These codewords, read from the gap stream (\({\mathcal {G}}\)), indicate whether to insert the next element from the current episode (1) or to leave a gap where to insert the next episode (0)

6.1 Finding haplotype blocks

A haplotype, or haploid genotype, is a group of alleles of different genes on a single chromosome, which are closely linked and typically inherited as a unit. Several works have been dedicated to the problem of finding haplotype block boundaries, i.e. identifying block structure in genetic sequences (Koivisto et al. 2002; Anderson and Novembre 2003; Greenspan and Geiger 2003; Mannila et al. 2003; Greenspan and Geiger 2004). This requires jointly partitioning multiple aligned strings.

6.2 Segmenting sequences

Several approaches have also been developed for segmenting event sequences more in general.

The method introduced by Kiernan and Terzi (2008) partitions a sequence into time segments, then partitions the events of each segment into groups (see Fig. 10). The proposed algorithm is then extended to allow overlaps and gaps between segments (Kiernan and Terzi 2009a) and a tool to visualise the obtained segmentation is proposed (Kiernan and Terzi 2009b). Wang et al. (2010) further aim to model dependencies between segments.

The algorithm proposed by Lam et al. (2014a) partitions the alphabet into subsets, then separately encodes the sequence projected on each subset of symbols, as well as a sequence that maps each position to the corresponding subset. Chen et al. (2018) adopt a generic point of view on sequence segmentation, considering that the input data can be either univariate or multivariate, consist of categorical or real-valued variables, with no assumption on the underlying distribution. Gautrais et al. (2020) aim to segment a sequence in such a way that each segment contains a collection of recurrent “signature” events. In particular, they consider retail data and apply the approach to the sequences of transactions of individual customers, in order to analyse their shopping behaviour.

6.3 Segmenting timeseries

Hu et al. (2011) (also Hu et al. 2013, 2015) propose an approach to represent a timeseries with an Adaptive Piecewise Constant Approximation (APCA) and define the intrinsic cardinality of the data to be the number of distinct constant values in the approximation. In this approach, a timeseries is encoded by specifying the end-points and value of each segments, then listing reconstruction errors. Vespier et al. (2012) also decompose timeseries into segments, but consider components at multiple time-scales, modeled as piecewise constant or polynomial functions.

Matsubara et al. (2014) consider the problem of segmenting multivariate timeseries. A timeseries is modeled using a multi-level hidden Markov model (HMM), where high-level states represent regimes that contain lower level states. Wu et al. (2020) also consider a problem of timeseries segmentation, and model each segment with a Markov chain.

Rakthanmanon et al. (2011) (also Rakthanmanon et al. 2012) consider the problem of clustering sequential data, in particular such as arises from discretised timeseries where each distinct numerical value is mapped to a distinct symbol. The authors argue that not every value is of interest, because sequences tend to contain meaningless transitions, and that the MDL principle can help identify the segments of interest. Begum et al. (2013) (also Begum et al. 2014) use the clusters obtained with this approach for the task of semi-supervised classification.

6.4 Mining substrings

The algorithm devised by Evans et al. (2006) (also Markham et al. 2009) searches for the best set of substrings to encode an input string according to the proposed Optimal Symbol Compression Ratio (OSCR) (Evans et al. 2003). The algorithm, which has been applied primarily to analyse genetic sequences (Evans et al. 2007), is iterative, at each step picking the substring that compresses most and replacing it by a temporary code. Selected substrings can be recursive, in the sense that they contain previously selected substrings. In the end, the selected substrings are assigned codes using Huffman coding.

6.5 Mining episodes from sequences

Lam et al. (2012) propose to encode timestamped sequences with absolute positioning. That is, the positions of covered occurrences are listed separately for each singleton event or selected subsequence. A fixed-length code is used, so all elements (event or position) cost the same and, in particular, occurrences can appear arbitrarily far apart with no penalty. Follow-up work (Lam et al. 2014b) focuses on strings. The proposed algorithms have the same names (SeqKrimp and GoKrimp), but use a different encoding mechanism. Specifically, having constructed a dictionary mapping subsequences to codewords, each match of a selected subsequence is replaced by its associated codeword, followed by Elias codes indicating the gaps between occurrences of the successive events of the subsequence. For a given subsequence, a subroutine is proposed to find the matches with minimum gap cost. Lam et al. (2013) consider a similar problem in a streaming setting. The proposed encoding points back to the previous occurrence of the subsequence, with a flag to indicate when an extended subsequence should be recorded as new, that is, added to the dictionary.

The Sqs (“squeeze”) algorithm of Tatti and Vreeken (2012b) follows a dictionary-based strategy and is similar to Krimp but for sequences (see Fig. 11). Each selected subsequence, or episode, is assigned a codeword representing it, as well as a pair of codewords representing gap (move to next position) and fill (insert event) operations. Gaps are allowed but not interleaving. In other words, gaps must be filled by singletons. This work is then extended in multiple ways, to take into account an ontology over the events, resulting in algorithm Nemo (Grosse and Vreeken 2017) and by adding support for rich interleaving and choice of events in patterns, resulting in algorithm Squish (Bhattacharyya and Vreeken 2017).

After focusing on the analysis of seismic data, aiming to cluster and compare seismograms represented as multiple aligned sequences (Bertens and Siebes 2014), Bertens et al. (2016) consider multivariate event sequences more in general and propose algorithm Ditto, which can be seen as an extension of Sqs to handle multivariate patterns. The work constitutes the basis of a dissertation focused on detecting anomalies and mining multivariate event sequences, also in combination, i.e. employing Ditto to detect anomalies in such sequences (Bertens 2017, Chapter 7). Hinrichs and Vreeken (2017) use compression and the Sqs algorithm to analyse the similarities between sequence databases in terms of occurring sequential patterns, focusing mostly on text data. The proposed algorithm, called SqsNorm, provides for sequential data the type of analysis that DiffNorm allows for transactional data (cf. Sect. 3.3).

The approach proposed by Wiegand et al. (2021) is clearly related to Sqs and Squish but aims to summarise entire complex event sequences, rather than capturing fragmentary behaviour. The models considered resemble Petri nets or finite state machines and specify conditional transitions between events. The data is then represented as a succession of instructions that, when fed through the model, allow to reconstruct the original event log. The authors later present a similar model called event-flow graph (Wiegand et al. 2022). Instead of pattern nodes, this model involves rules defined over attribute vectors associated to the sequences.

Cüppers and Vreeken (2020) aim to identify sequential patterns that reliably predict the impending occurrence of an event of interest. In other words, they look for a set of rules, but in sequential rather than transactional data (cf. Sect. 3.4). Along similar lines, Bourrand et al. (2021a) (also Bourrand et al. 2021b) aim to discover a compact set of sequential rules from a single long event sequence.

Fowkes and Sutton (2016) propose a generative probabilistic model of sequence databases. The authors discuss the connection between probabilistic modeling and description length, and compare their proposed algorithm, which is not based on the MDL principle, to Sqs and GoKrimp.

Ibrahim et al. (2016) consider sequences with timestamps and patterns that consist of subsequences with fixed inter-event times. The data is encoded by listing the patterns, along with their occurrences. More specifically, for each subsequence, the events and inter-event times are specified, as well as the timestamp of the first event of each occurrence. Mitra and Sastry (2019) propose a generative statistical model (a hidden Markov model, HMM) as justification for this encoding. Yan et al. (2018) look for patterns in a stream of sequential data where each event is associated to a timestamp, using a sliding window and maximum gap constraint.

6.6 Mining motifs from timeseries

Tanaka and Uehara (2003) (also Tanaka et al. 2005) look for motifs in timeseries, representing discretised values as integers using a fixed-length code.

Shokoohi-Yekta et al. (2015) consider the problem of extracting rules from timeseries, aiming to match shapes rather than the precise values. The proposed score is used to evaluate the consequent of candidate rules, allowing to compare consequents of different lengths. The score evaluates the compression gain resulting from specifying the motif once and then listing the errors for each occurrence, instead of listing the actual values for each occurrence. This is applied to evaluate candidates individually, not as a set.

6.7 Mining periodic patterns

Exploiting regularities not only about what happens, that is, finding coordinated event occurrences, but also about when it happens, that is, finding consistent inter-occurrence time intervals, can allow to further compress the data.

In the context of “smart homes” and health monitoring, Heierman et al. (2004) look for periodically repeating events or sets of events in a sequence, with a MDL criterion to identify interesting candidates, which are then used to automatically construct a Markov model (HPOMDP) (also Heierman and Cook 2003; Das and Cook 2004; Youngblood et al. 2005). The work of Rashidi and Cook (2013) shares the same context and goal, further accounting for discontinuities in the repetitions and variations in the order of the events.

Galbrun et al. (2018) introduce patterns involving nested periodic recurrences of different events and a method for constructing them by combining simple cycles into increasingly complex and expressive patterns.

6.8 Trajectories

Phan et al. (2013) consider data that represent a collection of moving objects, each associated at each timestamp with a geospatial position. As a pre-processing step, the objects must be clustered based on proximity, separately for each timestamp. An object is allowed to belong to several clusters at any given timestamp. The goal is then to find a sequence of clusters (at most one per timestamp) having objects in common. The result is called a swarm and intended to represent objects moving together. When encoding the trajectory of an object (as a sequence of clusters) patterns can be used whole, or from/to an intermediate position.

Also considering spatio-temporal data but in a different scenario, Zhao et al. (2019) aim to mine frequent patterns in trajectories over a road network. Mapping the trajectories to the corresponding road segments effectively turns the problem into a frequent sequence mining problem. The MDL principle is used to formulate a problem of trajectory spatial compression, addressed using a dictionary-based strategy.

7 Discussion

Our goal here is to open a discussion on issues relevant to MDL-based methods for pattern mining. In particular, the design of the encoding is a crucial ingredient when developing such a method. We consider different questions that might be raised by the involved choices, regarding, in particular, conformity and suitability with respect to the MDL principle. To illustrate the discussion, we point to various works listed in the previous sections.

7.1 Encoding in question

Alleged infractions to the MDL principle can be of different kinds and degrees of severity. They include cases where (i) the assignment of codewords ignores information theory, (ii) the proposed encoding is not functional due to some information missing, and (iii) the proposed encoding clearly cannot achieve a good compression due to the presence of unnecessary unjustified terms.

Several methods assign the same, typically unit, cost to all encoded elements (which might be items, nodes, edges, events, timestamps, etc. depending on the case, or even entire patterns), so that the description length is simply the number of encoded elements (see for instance Navlakha et al. (2008) in Sect. 5.1, and Phan et al. (2013); Zhao et al. (2019) in Sect. 6.8). Some authors motivate this choice by the need to avoid penalising large values, or to circumvent other encoding issues (see for instance Ibrahim et al. (2016); Lam et al. (2012) in Sect. 6.5). In the method proposed by Yan et al. (2018) (cf. Sect. 6.5), for instance, the cost of a pattern is first defined to be the number of characters used to represent it, that is, the number of events plus the number of timestamps. In a second version of the method, this cost is then defined to be equal to one for all patterns, reportedly to avoid bias against patterns involving more events.

Transmitting the dataset through a binary communication channel as efficiently as possible is a thought experiment of sorts that motivates the score. If short codewords are assigned to specific elements because they are deemed more valuable and useful, then other elements will have to be assigned longer codewords, because not everything can be transmitted cheaply. One can think of it as the fundamental limits of information theory, through this compression scenario, forcing the designer of the method to make choices as to what he considers important and interesting. One might argue that using unit costs corresponds to using a fixed-length code and rescaling everything for convenience. This indeed simplifies the design of the encoding, as it avoids making decisions, and in this sense short-circuits the principle.

Small coding elements, such as for example required to delimit patterns in the code table, are often omitted. This is sometimes done deliberately, putting forth, in particular, the use of a pre-defined framework to be filled with the relevant values, which is common to all models and can therefore be ignored (see for instance Vreeken et al. (2011) in Sect. 3, and van Leeuwen and Galbrun (2015) in Sect. 3.4), but is sometimes left unexplained and might seem accidental. In the approach proposed by Lam et al. (2014b) (cf. Sect. 6.5), it is unclear how the receiver knows where the codewords end when decoding the dictionary. On the other hand, Tanaka and Uehara (2003) (cf. Sect. 6.6) use a fixed-length code to encode values from a set, but the number of distinct values, which varies for different models, is not transmitted, so that the receiver cannot deduce the codeword length, and hence cannot decode the message.

More substantial pieces might also be missing. For instance, the encodings proposed by Lam et al. (2014a) (cf. Sect. 6.2) and by Hu et al. (2011) (cf. Sect. 6.3) do not account for the transmission of the assignment of symbols to subsets and of the mapping of offset values to codewords for the corrections, respectively, which are needed to reconstruct the data.

Explanations about the encoding are sometimes kept at the level of intuitions, and the details provided can be insufficient to properly understand how it works (see for instance Khan et al. (2015b) in Sect. 5.1, Matsubara et al. (2014) in Sect. 6.3, Heierman et al. (2004); Rashidi and Cook (2013) in Sect. 6.7, and Phan et al. (2013) in Sect. 6.8).

Arguably, ensuring decodability would in some cases require only minor modifications of the encoding scheme, and would likely have no major impact on the results. Furthermore, how much effort should be spent ensuring that the proposed encoding works is debatable, since it will never be used in practice.

The choice of encoding can sometimes seem sub-optimal, ill-suited or introduce undesirable bias. For instance, Navlakha et al. (2008) (cf. Sect. 5.1) list edge corrections that should be applied to the reconstructed graph, indicating for each one the sign of the correction. It seems, however that this information is unnecessary, as it can be inferred from the reconstructed graph, by checking whether the edge is present (must be deleted) or absent (must be added). Hu et al. (2011) (cf. Sect. 6.3) encode a list of value corrections using Huffman coding, meaning that having few distinct but recurrent error values is rewarded, not necessarily small ones. Using a universal code for the corrections would instead encourage small error values, which might be more intuitive. In any case, it is advisable to lay bare and motivate the potential biases introduced by the choice of encoding, whenever possible.

Considering a sequence, Lam et al. (2013) (cf. Sect. 6.5) encode the occurrence of an event or subsequence by pointing back to the position of the first occurrence. Pointing back, instead, to the position of the last encountered occurrence would require to encode smaller values and might lead to savings. Keeping track of the order in which the patterns were last encountered and referring to the position in that list, so that repetitions of the same pattern do not fill up the list, is another alternative. There are often different ways to achieve the same purpose, not necessarily with a clear overall best choice. In addition to pointing back to previous occurrences, Lam et al. (2013) maintain a dictionary of patterns. It is unclear whether the dictionary is actually needed for the encoding, or is primarily used to recover the encountered patterns.

What is part of the encoding of the model and what is part of the encoding of the data given the model is sometimes not entirely obvious. For example, the algorithm of Lam et al. (2014a) (cf. Sect. 6.2) encodes a sequence by partitioning the alphabet and considering separately the subsequences over each subset of symbols. The authors present the term that corresponds to the assignment of positions to subsets as part of the encoding of the model. Debatably, it can be considered instead as part of the encoding of the data given the model, while the assignment of symbols to subsets, which is ignored, would belong to the encoding of the model. Besides, encodings often actually consist of three terms, (i) a description of the set of patterns (the model), (ii) information to reconstruct the data using these patterns, and (iii) a list of corrections to apply to the reconstructed data in order to recover the original data, with the latter two together representing the data given the model.

7.2 Code of choice

Prequential plug-in codes, and refined codes more in general, provide means to avoid unwanted bias arising from arbitrary choices in the encoding (cf. Sect. 2).

For instance, Budhathoki and Vreeken (2015) use prequential coding for the itemset occurrences in the DiffNorm algorithm (cf. Sect. 3.3). The choice is especially relevant in this scenario where the goal is to contrast the itemset make-up of different datasets, and not to inspect the usage of itemsets in a particular dataset. Bhattacharyya and Vreeken (2017) as well as Wiegand et al. (2021) use prequential coding for the streams that contain information about pattern occurrences (cf. Sect. 6.5). Other recent works (cf. Sects. 4.2, 4.6 and 5.5, respectively Faas and van Leeuwen 2020; Makhalova et al. 2020; Bloem and de Rooij 2020) also use prequential coding, while Bertens et al. (2016); Hinrichs and Vreeken (2017) (cf. Sect. 6.5) both explicitly suggest upgrading the current encoding with a prequential code, as a direction for future work. Going further, Proença et al. (2021a) improved on their earlier work (Proença and van Leeuwen 2020a) (cf. Sect. 3.4) by replacing prequential coding, which is only asymptotically optimal, with normalised maximum likelihood (NML), which is optimal for fixed sample sizes, employing similar techniques as Kontkanen and Myllymäki (2007) (cf. Sect. 4.6).

However, modern Bayesian and NML codes can be challenging to compute, or even downright infeasible. Furthermore, one-part codes can be less intuitive than two-part codes, and do not provide as direct an access to information about pattern usage. For instance, Mampaey and Vreeken (2010) (cf. Sect. 4.1) compare two encodings, with and without prequential coding, and, obtaining similar results, choose to proceed with the latter as it is more intuitive. All in all, modern refined codes have improved theoretical properties, but using them to build better methods comes with some challenges.

7.3 The letter or the spirit

Some approaches use the MDL principle to score and compare individual candidate patterns, rather than evaluating them in combination (see for instance Cook and Holder (1994) in Sect. 5.5, Shokoohi-Yekta et al. (2015) in Sect. 6.6, as well as Heierman et al. (2004); Rashidi and Cook (2013) in Sect. 6.7).

Considering a two-view dataset, i.e. a dataset consisting of two tables, the approach proposed by van Leeuwen and Galbrun (2015) assumes knowledge of one table to encode the other, and vice versa. Arguably, this approach does not correspond to a practical encoding, like other MDL-based approaches, but also not to a realistic compression scenario, yet it serves as a reasonable motivation for the proposed score.

The proposed score might actually be entirely ad-hoc, in the sense that it does not correspond to the length of an encoding that could be used to represent the data (see for instance Makhalova et al. (2019a) in Sect. 4.6). One might reasonably devise and justify an evaluation measure suited to the problem at hand, but labelling it as following the MDL principle is arbitrary and inappropriate, short of an explanation of how this corresponds to encoding, and can only lead to confusion.

Authors sometimes approach the topic with caution and include disclaimers stating that their proposed methods are inspired by or in the spirit of the MDL principle (see for instance Shokoohi-Yekta et al. (2015) in Sect. 6.6). This can be seen as a way to allow oneself to take some liberties with the principle, indeed considering it as a source of inspiration rather than as law, but also as a way to preventively fend off criticism and accusations of heresy. There is indeed a range of opinions about how closely one must conform to the MDL principle and to information theory.

7.4 Making comparisons

How to make meaningful comparisons between compression-based scores and between corresponding results requires careful consideration. For instance, one might ponder whether the compression achieved for a dataset is an indication of how much structure is present in it, or at least how much could be detected, and to what extent it can serve as a measure of the performance of the algorithm.

Does it make sense to compare the length of a dataset encoded with the proposed scheme to the original unencoded data? And is it a problem if the latter is shorter? Keeping in mind that compression is used as a tool for comparing models, rather than for practical purposes, we answer both questions in the negative. The compression achieved with the simplest model, be it the code table containing only elementary patterns such as the singleton itemsets, known as the standard code table in Krimp (Vreeken et al. (2011) in Sect. 3), for dictionary-based approaches (cf. Fig. 3(i)) or the single-block model for block-based approaches (cf. Fig. 4(i)), is often considered as a basis for comparison. The ratio of the compression achieved with a considered model to the compression achieved with the elementary model, known as the compression ratio, is then computed and used to compare different models, with lower compression ratios corresponding to better models. This is a way to normalise the scores and allow more meaningful comparisons and evaluations.

A direct comparison of the raw description lengths, in terms of numbers of bits, of the same data encoded with different methods is typically not meaningful. For instance, it does not make sense to compare the description lengths reported in Fig. 3 to those reported in Fig. 4. Comparing compression ratios across different methods is not really meaningful either in general. Indeed, an easy way to win this contest would be to design an artificial encoding that penalises very heavily the use of elementary patterns. If the different methods handle compatible pattern languages, comparing the compression ratios achieved when considering as model, in turn, the set of patterns selected by each method and applying either encoding can be of interest, and might shed some light on the respective biases of the methods. If the pattern languages are not compatible, then no quantitative comparison can be devised easily and great care must be taken to choose suitable encodings. Qualitative evaluations of obtained patterns are valuable, despite being subjective and domain dependent. In the end, finding a good set of interesting and interpretable patterns is what matters.

7.5 Beyond mining patterns with MDL

Finally, we highlight two directions of research that do not fall strictly within the realm of MDL-based pattern mining methods, yet are clearly related, constitute recently active and fruitful research topics, and might therefore be of interest to the reader.

The first direction consists of studies of correlation and causality that build on algorithmic information theory in general and, for a few of them, on MDL-based pattern mining techniques more in particular. For instance, Budhathoki and Vreeken (2017a) propose two algorithmic correlation measures and present practical instanciations based on the MDL principle, using the Slim and Pack algorithms (cf. Sects. 3.2 and 4.1 , respectively). They also propose to use a MDL score to infer the direction of causality between pairs of discrete variables (Budhathoki and Vreeken 2017b).

The second direction consists in the development of pattern mining methods relying on a different modeling approach, also grounded in information theory, namely on maximum entropy modeling. In particular, De Bie et al. (2010) introduced a framework for data mining based on maximum entropy modeling, sometimes referred to as the FORSIED framework, for Formalising Subjective Interestingness, from which models are derived for different types of assumptions, data and patterns (including for instance Kontonasios and De Bie 2012; Kontonasios et al. 2013; van Leeuwen et al. 2016; Adriaens et al. 2019; Puolamäki et al. 2020).

8 Conclusion

After giving an outline of relevant concepts from information theory and coding, we reviewed MDL-based methods for mining various types of data and patterns. In particular, we focused on aspects related to the design of an encoding scheme, rather than on algorithmic issues for instance, since the former constitutes the most distinctive ingredient of MDL methodologies, but also a major stumbling block and source of contention. We pointed out two main strategies that underpin the majority of approaches and that can be used to categorise them. Namely, we distinguished dictionary-based approaches from block-based approaches. Then, we considered some discussion points pertaining to the use of MDL in pattern mining, and highlighted related problems that constitute promising directions for future research. Indeed, there is still room for further development in mining patterns with MDL-inspired methods, and beyond.