The Minimum Description Length Principle for Pattern Mining: A Survey

This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum. Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems.


Introduction
Our aim is to review the development of pattern mining methods based on and inspired from the Minimum Description Length (MDL) principle. Although this is an unrealistic goal, we strive for completeness in our coverage of these methods.
Before we go any further, let us explain more precisely what we consider, for the present purpose, to constitute patterns. Patterns are about repetitions. We adopt the point of view that patterns express the repeated presence in the data of particular items, attribute values or other discrete properties. We divide them into two main categories.
Itemsets are strict conjunctive patterns that require the occurrence of all involved items for the pattern to be considered as occurring in a transaction. For a given dataset, it is thus straightforward to determine where an itemset occurs and where it does not, that is, to compute the pattern's support. Vice versa, when it is known that an itemset occurs in a given transaction, no further information is necessary, as it implies that all items must be present. These concepts naturally extend beyond transactional data. However, such occurrence requirements are fairly strict, and it can be useful to consider more relaxed patterns. In particular, one might use disjunctions, allowing patterns to express a choice between involved items or attributes. Given a dataset, one can then still straightforwardly determine where a pattern holds. On the other hand, knowing that the pattern occurs no longer unambiguously provides information about which item or attribute is present. More or less additional information is needed, depending for instance on whether it is an inclusive or exclusive disjunction. We refer to such patterns that express the presence of a specific substructure in the data as substructure patterns.
As an alternative way to relax occurrence requirements, patterns might express that selected items or properties are typically present but not all need always occur, for instance by means of density thresholds. In that case, which data instances belong to the support of the pattern, or vice-versa where each item or property holds, needs to be specified explicitly as it is not directly implied. Rather than the occurrence of a specific substructure, it is then the homogeneity of repetitions within the area delimited by the selected instances and attributes that is of interest. Because such patterns can be seen as delineating homogeneous rectangles in the data, we refer to them as block patterns.
Furthermore, one is typically looking for a collection of patterns and might impose constraints on the overlap between them, that is, require that the patterns involve disjoint sets of attributes, characterise disjoint sets of instances, or both. In particular, the patterns might be required to form a partition of the data, dividing all instances and all attributes into disjoint subsets. Such a partitioning requirement is incompatible with a strict occurrence requirement in practice, in the sense that it is not in general possible to identify a collection of substructure patterns that forms a partition of the data. On the other hand, block patterns might be required to form a partition of the data, corresponding roughly to biclustering approaches, or they might be allowed to overlap, as in tiling approaches.
In summary, we adopt a rather broad definition of what constitute patterns, from itemsets to biclusters over discrete data. However, we stop short of considering clusters more in general as patterns. Clustering constitutes another important field of data mining beside-and partially overlapping with-pattern mining. There, the goal is to organise data instances into groups such that instances within the same group are similar to each other and dissimilar to instances in other groups. Clustering often handles continuous data, typically relying on a concept of distance. Here, we focus on formalisms and methods that are by nature more discrete.
The reader is expected to be familiar with common pattern mining tasks and techniques, but not necessarily with concepts from information theory and coding, of which we therefore give an overview in Section 2, before presenting the two main encoding strategies that use respectively substructure and block patterns. Background work is covered in Section 3. We start with the theory behind the MDL principle and similar principles. Then, we go over a few examples of uses of the principle in the adjacent fields of machine learning and natural language processing. We end with an overview of methods that involve practical compression as a data mining tool or that consider the problem of selecting itemsets. In Sections 4-7, we turn to the review of MDL-based methods for pattern mining proper. The methods are grouped first by the type of data, then by the type of patterns considered, as outlined in Figure 1. Our focus is on the various encodings designed for these different types of data and patterns, rather than on algorithmic issues related to searching the patterns. In Section 4, we start with one of the thickest branches, stemming from the Krimp algorithm for mining itemsets from transactional data. We continue with itemsets and with other types of patterns for tabular data in Section 5, followed by graphs and temporal data in Sections 6 and 7, respectively. Finally, we consider some discussion points, in Section 8, before highlighting related problems that have recently attracted research interest, in Section 9.
Sections contain lists of references ordered by publication year, to provide an better overview of the development in the corresponding sub-field. To keep things simple, even though sometimes related to different sub-fields, each publication is assigned to a single sub-field, that to which it is considered most relevant. For ease of reference, a complete bibliography ordered alphabetically is included at the end. A version of this survey has been published in a peer-reviewed journal. 1 In addition, the main characteristics and bibliographic details of publications from Sections 4-7 have been collected into a searchable  Figure 1: Organisation of Sections 4-7. MDL-based methods for pattern mining are grouped first by the type of data (tabular, graph or temporal), then by the type of patterns and strategies considered (blocks or substructures). Numbers refer to sections. The three main data types and their subtypes are represented by coloured shapes. Simple unlabelled graphs can be represented as binary matrices. Thus, some methods designed for binary data can be applied to them, and vice versa, some graph-mining methods in effect process binary data. The corresponding sections are therefore represented as lying at the intersection between binary and graph data. Dashed and dotted lines are used to group methods associated to the two main strategies (cf. Section 2.3). Block-based strategies are used to mine block patterns (also tiles and segments) that group together elements that are similarly distributed. On the other hand, dictionary-based strategies are used to mine substructure patterns (also motifs and episodes) that capture specific arrangements and co-occurrences between elements.
back is called a code. The processes of mapping the object to its description and of reconstructing it are called encoding and decoding, respectively. The considered storage or channel is typically binary, meaning that the object is mapped to a binary sequence, i.e. over the alphabet {0, 1}, whose length is hence measured in bits. There has been a lot of studies about communication through noisy channels, that is, when errors might be introduced into the transmitted sequence, and how to recover from it, but this is not of interest to us. Instead of noise-resistant codes, we focus purely on data compression, on obtaining compact descriptions. In general, data compression can be either lossless or lossy, depending whether the source object can be reconstructed exactly or only approximately.
Typically, encoding an object means mapping its elementary parts to binary codewords and concatenating them. Care must be taken to ensure that the resulting bit-string is decodable, that is, that it can be broken down back into the original codewords from which the parts can be recovered. For instance, imagine the data consist of the outcome of five throws of a pair of dice, i.e. a list of five integers in the interval [1,12]. If we simply turn the values into their binary representations and concatenate them, we might not be able to reconstruct the list of integers. For example, there is no way to tell whether 1110110110 stands for 11 10 1 10 110, i.e. 3, 2, 1, 2, 6 , or for 1 110 1 101 10, i.e. 1, 6, 1, 5, 2 . To avoid this kind of confusion, we want our code to be such that there is a single unambiguous way to split an encoded string into codewords. One strategy is to use separators. For instance, we might represent each integer by as many 1s, separated by 0s, so that 3, 2, 1, 2, 6 becomes 1110110101101111110.
More in general, this is where the prefix-free property becomes very useful. A prefix-free code (also confusingly known as prefix code or instantaneous code) is such that no extension of a codeword can itself be a codeword.
Fixed-length codes (a.k.a. uniform codes), that assign distinct codewords of the same length to every symbol in the input alphabet, clearly satisfy the prefix-free property. For an input alphabet consisting of n distinct symbols, the codewords must be of length ⌈log 2 (n)⌉. Such a code minimises the worst-case codeword length, that is, the longest codeword is as short as possible. With such a code, every symbol is worth the same. Therefore it is a good option for pointing out an element among canonically ordered options under a uniform distribution, without a priori bias. For example, the sequence baeaecdaeeccbc over the five-letter alphabet a, b, c, d, e might be encoded in 42 bits, as 001 000 100 000 100 010 011 000 100 100 010 010 001 010 .
When symbols are not uniformly distributed, using codewords of different lengths can result in codes that are more efficient on average. There are only so many short codewords available, so one needs to choose wisely what they are used for. Intuitively, symbols that are more frequent should be assigned shorter codewords. The Kraft-McMillan inequality (also known simply as Kraft inequality) gives a necessary and sufficient condition for the existence of a prefix-free code. Specifically, it states that, for a finite input alphabet A, the codeword lengths for any prefix-free code C must satisfy where L C (x) denotes the length of the codeword assigned by C to symbol x. Vice versa, given codeword lengths satisfying this inequality, there exists a prefix-free code with these codeword lengths. Furthermore, if P is a probability distribution over a discrete input alphabet A, there exists a prefix-free code C such that for all x ∈ A, L C (x) = ⌈− log 2 P (x) ⌉. A pair of related techniques to construct such a varying-length prefix-free code is commonly referred to as Shannon-Fano code. Moreover, given an input alphabet A where each symbol x i has an associated weight w i that might represent, in particular, its frequency of occurrence, a code C is optimal if for any other code C ′ we have Huffman's algorithm is an ingenious simple algorithm allowing to construct an optimal prefix-free code. Considering the example sequence baeaecdaeeccbc again, Huffman's algorithm would assign shorter codewords to the more frequent letters a, c and e, while ensuring that the prefix-free property is satisfied, allowing for instance to encode the sequence in just 31 bits, as 011 00 11 00 11 10 010 00 11 11 10 10 011 10 .
Note that to use such a code, the alphabet and the associated probability distribution must be shared by the sender and the receiver. Indeed, it is not enough that the receiver is able to recover the transmitted codewords, the receiver must also know which symbol is represented by each codeword. In order to transmit a sequence of symbols, a simple way to proceed is therefore to first transmit the information about the distribution, then to transmit the actual sequence using Shannon-Fano coding, resulting in a two-part code. For example, we would first need to transmit the occurrence counts of the five letters, or equivalently their assigned codewords, according to some agreed protocol, before sending the actual encoded 31-bit sequence.
Relying on a sequential prediction, or prequential , strategy provides an alternative for coding that does not require the probability distribution to be known a priori. Simply put, the idea is to start with some initial probability distribution over the alphabet, e.g. uniform, and at each step encode the next element using a prefixfree code based on the current distribution, then update the probability distribution to account for this occurrence. These predictive plug-in codes have useful properties. In particular, the total code length does not depend on the order in which the elements are encoded, and is within a constant factor of the optimal.
Let us consider the sequence baeaecdaeeccbc as an example once more. We might start with occurrence counts initialised to one for all five letters, and run Huffman's algorithm to assign them codewords accordingly. Having agreed on this protocol, the sender and the receiver obtain the same initial codewords, without the sender first having to explicitly share the information about the distribution. The sender first transmits the codeword for letter b, which the receiver can correctly decode using the initial code. Both sides increment their occurrence count of b by one, and update their codewords by running Huffman's algorithm again. Note that at this point, b has the highest occurrence count and will thus be assigned a short codeword (2 bits). Next, the sender transmits the new codeword for a, which the receiver is able to correctly decode. The occurrence counts are incremented and the codewords updated, on both sides in parallel. And so on, until the entire sequence has been encoded. Crucially, as they apply the same updating protocol in parallel, relying only on information shared so far, sender and receiver maintain identical codes at every step. Later in the process, the occurrence counts of the letters approach their overall frequencies and the lengths of the assigned codewords converge towards those obtained using Shannon-Fano coding with prior knowledge of the distribution.
More recent advances in MDL and model selection theory introduced Bayesian and normalised maximum likelihood (NML) codes. Like prequential codes, both are one-part codes. In contrast to crude codes, such refined codes remove the need to explicitly encode the parameters of the distribution, thereby avoiding the associated bias, and have useful properties, including optimality guarantees. In particular, the term universal is commonly used to refer to a code that, to put it simply, performs essentially as well as the best-fitting code for the input, for any possible input. This use of universal should not be confused with another common use, referring to codes for representing integers, which we present next. On the downside, refined one-part codes are not easily explained in terms of practical encoding and are not always computationally feasible.
Finally, a universal code for integers is a prefix-free code that maps non-negative integers to binary codewords. Such a code can be used to encode positive integers when the upper-bound cannot be determined a priori. Elias codes, which come in three variants-namely Elias gamma code, Elias delta code, and Elias omega code-are universal codes commonly used for this purpose. For instance, an integer x ≥ 1 is encoded using the Elias gamma code as ⌊log 2 (x)⌋ zeros followed by the binary representation of x. These codes penalise large values, since larger integers are assigned longer codewords.

Applying the MDL principle in pattern mining
It is important to note that when applying the MDL principle, compression is used as a tool to compare models, rather than as an end in itself. In other words, we do not care about actual descriptions, only about their lengths. Furthermore, we do not care about the absolute magnitude of the description achieved with a particular model as much as compared to those achieved with other candidate models. This has a few important consequences.
First, as the code does not need to be usable in practice, the requirement of integer code lengths can be lifted, allowing for finer comparisons. In particular, this means that for a discrete input alphabet A with probability distribution P , the most reasonable choice for assigning codewords is a prefix-free code C such that for all x ∈ A, L C (x) = − log 2 P (x) . This is often referred to as Shannon-Fano coding, although no underlying practical code is actually required, or even considered. Note that the entropy of P corresponds to the expected number of bits needed to encode in this way an outcome generated by P .
Second, elements that are necessary in order to make the encoding system work but are common to all candidate models might be left out, since they play no role in the comparison. Third, to ensure a fair comparison, only lossless codes are considered. Indeed, comparing models on the basis of their associated description lengths would be meaningless if we are unable to disentangle the savings resulting from a better ability to exploit the structure of the data, versus from increased distortion in the reconstruction. In some cases, this simply means that corrections that must be applied to the decoded data in order to achieve perfect reconstruction are considered as part of the description.
Though one-part codes might be used for components, at the highest level most proposed approaches use the crude two-part MDL, which requires to first explicitly describe the model M , then describe the data using that model, rather than refined one-part MDL, which corresponds to using the entire model class M to describe the data. That is because the aim is not just to know how much the data can be compressed, but how that compression is achieved, by identifying the associated model. The overall description length is the sum of the lengths of the two parts, so the best model M ∈ M to explain the data D is the one that minimises L(M, D) = L(M ) + L(D | M ), where L denotes description lengths in bits, as above. The two parts can be understood as capturing respectively the complexity of the model and the fit of the model to the data, and the MDL principle as a way to strike a balance between the two.
One can also view this from the perspective of probabilistic modeling. Consider a family of probability distributions parameterised by some set Θ, that is, M = {p θ , θ ∈ Θ}, where each p θ assigns probabilities to datasets. In addition, consider a distribution π over the elements of M. In this context, in accordance with the Bayesian framework, the best explanation for a dataset D is the element of M minimising where the two terms are the prior and the negative log likelihood of the data, respectively. When using a uniform prior, this means selecting the maximum likelihood estimator. Since codes can be associated to probability distributions, we can see the connection to the two-part MDL formulation above, where the term representing the description length of the model, L(M ), corresponds to the prior, and the term representing the description length of the data given the model, L(D | M ), corresponds to the likelihood.
The Bayesian Information Criterion (BIC) is a closely related model selection method that aims to find a balance between the fit of the model, measured in terms of the likelihood function, and the complexity of the model, measured in terms of the number of parameters k. Denoting as n the number of data points in D, it can be written as k log n − 2 log p θ (D).
When applying the MDL principle to a pattern mining task, the models considered consist of patterns, capturing structure and regularities in the data. Depending on the type of data and the application at hand, one must decide what kind of patterns are relevant, i.e. (i) a pattern language must be chosen. The model class M then consists of all possible subsets of patterns from the chosen pattern language. Note that we generally use the term model to refer to a specific collection of patterns and the single corresponding probability distribution over the data, whereas from the statistical modeling perspective, model typically refers to a family of probability distributions. Next, (ii) an encoding scheme must be designed, i.e. a mechanism must be engineered to encode patterns of the chosen type and to encode the data by means of such patterns. Finally, (iii) a search algorithm must be devised to explore the space of patterns and identify the best set of patterns with respect to the encoding, i.e. the set of patterns that results in the shortest description length.

Dictionary-based vs. block-based strategies
The crude two-part MDL requires to explicitly specify the model (the L(M ) part). This means designing an ad-hoc encoding scheme to describe the patterns. This is both a blessing and a curse, because it gives leeway to introduce some bias, i.e. penalise properties of patterns that are deemed less interesting or useful than others by making them more costly in terms of code length. But it can involve some difficult design choices and lead to accusations of "putting your fingers on the scale".
When considering substructure patterns, encoding the data using the model (the L(D | M ) part) can be fairly straightforward, simply replacing occurrences of the patterns in the data by their associated codewords. Knowing how many times each pattern X is used in the description of the data D with model M , denoted as usage D,M (X), X can be assigned a codeword of length using Shannon-Fano coding. The design choice of how to cover the data with a given set of patterns, dealing with possible overlaps between patterns in M , determines where each pattern is used, defining usage D,M (X). The requirement that the encoding be lossless means that elementary patterns, e.g. singleton itemsets, must be included in the model, to ensure complete coverage. In this case, encoding the model (the L(M ) part) consists in providing the mapping between patterns and codewords, typically referred to as the code table. That is, for each pattern in turn, describing it and indicating its associated codeword. Such a code table or dictionary-based strategy, which corresponds to frequent pattern mining approaches, is a common way to proceed.
An alternative strategy is to divide the data into blocks, that might or might not be allowed to overlap, each of which is associated with a specific probability distribution over the possible values and should be as homogeneous as possible. The values within each block are then encoded using a code optimised for the corresponding distribution. In that case, encoding the model consists in indicating the limits of each block and the associated probability distribution. Such a block-based strategy corresponds to segmentation, biclustering and tiling approaches.
These two main strategies, dictionary-based and block-based, that use respectively substructure and block patterns, are an important distinguishing factor that we use to categorise approaches, as depicted in Figure 1 with dotted and dashed lines, respectively. We further illustrate and contrast these strategies on a toy binary dataset shown in Figure 2. The dataset has six columns (i.e. items) denoted A-F , nine rows (i.e. transactions) denoted (1)-(9), and contains twenty-four ones (i.e. positive entries or item occurrences) represented as black squares. In particular, the approaches delineated here to illustrate the two strategies are based on the work of Siebes, Vreeken, and  and Smets and Vreeken (2012) (cf. Section 4), on one hand, and of  (cf. Section 5.2), on the other hand.
Examples of modeling the toy binary dataset using a dictionary-based approach are provided in Figure 3. The simplest model is considered first (i), which contains as its patterns all singleton itemsets from the dataset and is often referred to as the standard code table (ST). A non-trivial code table is considered next (ii). In both cases, the model is represented as a code table associating patterns, here itemsets, to codewords. Encoding the model means encoding each pair. On one hand, itemsets are specified by listing the items they contain. This Figure 2: A toy binary dataset, with six columns (i.e. items) denoted A-F , nine rows (i.e. transactions) denoted (1)-(9), containing twenty-four ones (i.e. positive entries or item occurrences) represented as black squares.
uses Shannon-Fano coding over the alphabet of items associated to their frequency of occurrence in the data, which is assumed to be shared knowledge. On the other hand, codewords are assigned using Shannon-Fano coding over the alphabet of itemsets included in the code table, associated to their usage.
In Figure 3, the prefix-free codewords assigned to items and to itemsets are represented by coloured blocks in shades of green and blue, respectively. The width of a block is proportional to the length of the represented codeword. For instance, the first row of the non-trivial code table (bottom-left quadrant of Figure 3) specifies the first itemset in the code table, in this case ADE, listing the three items that constitute it, A, D and E, using the items' codewords (green blocks), then provides the codeword assigned to ADE by the code based on the usage of itemsets selected in this model (blue block).
Note that for the standard code table, these two prefix-free codes are the same because the standard code table consists of all singleton items and their usage is precisely their frequency of occurrence in the data. Therefore, all codewords in the top half of Figure 3 are represented as green blocks. For the same reason, the length of the codeword associated to item x by the first code, which does not depend on the code table, is denoted as L ST (x), whereas the length of the codeword associated to itemset X by the second code, which is different for different code tables, is denoted as L CT (X).
Then, encoding the data using the model simply means replacing itemset occurrences by the corresponding codewords.
In summary, the overall description length can be computed as In this example, the standard code table and the non-trivial code table yield an overall description length of 92.530 bits and 63.333 bits, respectively. By identifying items that occur frequently together, the latter results in a shorter description length, and one can therefore conclude that it constitutes a better model for the dataset, according to the MDL criterion.
A similar approach can be built by replacing static Shannon-Fano coding with dynamic prequential coding. In that case, encoding the model only requires to list the itemsets, i.e. only the first column of the code table is needed. Then, the data is encoded by enumerating the itemset occurrences using prequential coding (cf. Section 2.1). In practise, the process of updating the codewords does not actually need to be carried through. Luckily, the description length of the data can be calculated with a (not so simple) formula that does not depend on the order in which the itemsets are transmitted (see Budhathoki and Vreeken, 2015).
Examples of modeling the toy binary dataset using a block-based approach are provided in Figure 4. The simplest model is considered first (i), which consists of a single block. A non-trivial model dividing the dataset into six blocks is considered next (ii). The approach illustrated here requires the patterns to form a partition of the data. Therefore, any considered model consists of a set of non-overlapping blocks covering the entire dataset and, more specifically, is such that the rows and the columns are divided into disjoint groups. This requirement means that each row and each column belongs to exactly one group, a fact that can be exploited when designing the encoding, as explained below. Each block is associated to a specific probability distribution over the values occurring within it. In this case, the values are zero and one, corresponding to a Bernoulli distribution. Encoding i) The simplest model, with all singleton itemsets, a.k.a. standard code table (ST).
Next, we delve deeper into the details of the encoding scheme, to illustrate the choices that are involved in its design. We let • m and n denote respectively the number of rows and columns in the dataset, • k and l denote respectively the number of row and column groups, • m i and n j denote respectively the number of rows and columns in block B i,j , • γ v (B i,j ) denote the number of entries in block B i,j equal to v, and • L N (x) denote the MDL optimal universal code length for integer x. 3 The formula for computing the overall description length is provided in Equation 2. The number of rows and the number of columns are transmitted using universal coding, since these values are not bounded a priori, and the order of rows (resp. of columns) is then specified by listing row (resp. column) identifiers in turn, using a fixedlength code (cf. part (a) of Equation 2). In fact, this part of the encoding is independent of the model and has a constant length for a given dataset. Therefore, it does not impact the comparison and can be ignored. The number of row groups k (resp. column groups l) could be transmitted using a fixed-length code log 2 (m) (resp. log 2 (n)), since there cannot be more groups than there are rows (resp. columns). However, universal coding is used instead, to favour partitions with fewer groups. Then, assuming that the numbers of rows and of columns in the groups are sorted and transmitted by decreasing order, upper bounds m * i and n * j on m i and n j , respectively, can be derived given shared knowledge since already transmitted values constrain the remaining ones: m * i = k t=i m t − k + i, for i = 1, . . . k − 1 and n * j = l t=j n t − l + j, for j = 1, . . . l − 1. These upper bounds are used to transmit the numbers of rows and of columns in the groups with fixed-length codes (cf. part (b) of Equation 2). Also the number of ones in each block can be transmitted using a fixed-length code, since it takes value between zero and the number of entries in the block (cf. part (c) of Equation 2).
The data is then encoded using the model. This is done by listing the entries in each block with a prefix-free code adjusted to the probability distribution within the block (cf. part (d) of Equation 2).
In summary, the overall description length can be computed as nb. cols in each group log 2 (m i · n j + 1) nb. ones in each block listing entries in each block . (2) In Figure 4, shades of orange are used to represent probability distributions within the blocks, with more intense shades representing higher probabilities of ones. In this example, the single-block model and the six-block model yield an overall description length of 118.204 bits and 113.433 bits, respectively. By partitioning the dataset into blocks that are particularly dense or particularly sparse, the latter results in a shorter description length, and one can therefore conclude that it constitutes a better model for the dataset, according to the MDL criterion.
Removing the partition constraint and allowing overlaps between the block patterns would require to modify the encoding since we could no longer assume that each row and each column belongs to a single group.

Algorithms
The MDL principle provides a basis for designing a score for patterns, but no way to actually find the best collection of patterns with respect to that score. The space of candidate patterns, and even more so the space of candidate pattern sets, is typically extremely large, if not infinite, and rarely possesses any useful structure. Hence, exhaustive search is generally infeasible, heuristic search algorithms are employed, and one must be satisfied with finding a good set of patterns rather than an optimal one. Mainly, algorithms can be divided between (i) approaches that generate a large collection of patterns then (iteratively) select a small set out of it, and (ii) approaches that generate candidates on-the-fly, typically in a levelwise manner, possibly in an anytime fashion. The former approaches are typically less efficient, since they generate many more candidates than necessary, but constitute a useful basis for building a proof-of-concept. Because recomputing costs following the addition or removal of a candidate pattern is often prohibitively expensive, an important component of the heuristic search algorithms consists in efficiently and accurately bounding these costs.
With a dictionary-based strategy, the simplest model consists of all separate basic elements. The search for candidates can thus start from this model and progressively combine elements. Vice versa, with a block-based strategy, the simplest model consists of a single block covering the entire dataset. The search for candidates can thus start from this model and progressively split elements. Intuitively, the two main strategies lend themselves respectively to bottom-up and top-down iterative exploration algorithms.
In summary, to apply the MDL principle to a pattern mining task, one might proceed in three main steps. First, define the pattern language, deciding what constitutes a structure of interest given the data and application considered. Second, define a suitable encoding scheme, designing a system to encode the patterns and to encode the data using such patterns. Third, design a search algorithm, allowing to identify in the data a collection of patterns that yields a good compression under the chosen encoding scheme.
Our main focus here is on the second step, the design of an encoding scheme for the different types of patterns, which is the core of the MDL methodology and its distinctive ingredient. For the most part, we do not discuss search algorithms and are not concerned by issues of performance, which are more generic aspects of pattern mining methodologies.

Theoretical and conceptual background
The MDL principle is maybe the most popular among several similar principles rooted in information theory. The 1948 article by Shannon is widely seen as the birth certificate of information theory. The textbooks by Stone (2013) and by Cover and Thomas (2012) provide respectively an accessible introduction and a more detailed account of information theory and related concepts. The textbook of M. Li and Vitányi (2019), on the other hand, focuses on Kolmogorov complexity. The tutorial by Csiszár and Shields (2004) covers applications of information theory in statistics and discusses the MDL principle in its last chapter. Shannon, Claude E. (1948) (2019). An Introduction to Kolmogorov Complexity and Its Applications. 4th ed.

The Minimum Description Length (MDL) principle
The introduction of the Minimum Description Length principle can be dated back to the seminal paper by . Works collected in  present the theoretical foundations of the principle, as well as later advances and applications. The textbook by Grünwald (2007) is often regarded as the major reference about the MDL principle.
More recently,  present the MDL principle from the perspective of probabilistic modeling, without resorting to information theory. They review recent theoretical developments, which allow to see MDL as a generalisation of both penalised likelihood and Bayesian approaches. Vitányi and M. Li (2000) (also M. Vitányi and M. Li, 1999) draw also parallels between the Bayesian framework and the MDL principle.
The MDL principle was also partly inspired from the Minimum Message Length (MML) principle, introduced by Wallace and Boulton in 1968 (also Wallace, 2005). The two principles have some important conceptual differences but often lead to similar results. They are discussed and compared by Lanterman (2001), together with other related approaches. Several applications and discussions of the principles are presented as contributions to the 1996 conference on Information, Statistics and Induction in Science (Dowe, Korb, and Oliver, 1996).

General considerations on simplicity, parsimony, and modeling
Several authors have contributed to the discussion on conceptual issues related to complexity in modeling.
Among them, Davis (1996), and later Robinet, Lemaire, and Gordon (2011), examine issues of model selection and parsimony, in relation to human cognition. Rathmanner and Hutter (2011) propose a rather broad but not overly technical discussion of induction in general and Solomonoff's induction in particular, as well as several associated topics. Lattimore and Hutter (2013) discuss algorithmic information theory in the context of the no free lunch theorem, which, simply put, posits that an algorithm that performs well on particular problems must pay for it with reduced performance on other problems, and Occam's razor, a general rule favouring simplicity. Domingos (1999) exposes misconceptions and misuses of Occam's razor in model selection. Bloem, Rooij, and Adriaans (2015) (also Bloem, 2016) suggest to use sophistication as an umbrella term for various concepts related to the amount of information contained in data and discuss different approaches that have been proposed for measuring it.

Compression and Data Mining (DM)
Various approaches using practical data compression as a tool for data mining have been proposed and discussed.
For instance, Keogh, Lonardi, and Ratanamahatana (2004) (also Keogh, Lonardi, Ratanamahatana, et al., 2007) present the Compression-based Dissimilarity Measure (CDM), that evaluates the relative gain when compressing two strings concatenated rather than separately. Inspired from Kolmogorov complexity, the measure is used while mining timeseries, to fight against large numbers of parameters in the algorithms. Similarly, Cilibrasi and Vitányi (2005) define a Normalized Compression Distance, which they then use for clustering.
Simovici (2013) (also Simovici et al., 2015) proposes to evaluate the presence of patterns using practical data compression. The aim is to detect whether patterns are present, not to find them.
From a more conjectural perspective, Faloutsos and Megalooikonomou (2007) argue that the strong connection between data mining, compression and Kolmogorov complexity means there is little hope for a unifying theory of data mining.
The term compression as used by Chandola and Kumar (2007) is a metaphor rather than a practical tool. The proposed approach for mining patterns from tables of categorical attributes relies on a pair of ad-hoc scores that intuitively measure how much the considered collection of patterns allows to compact the data, and how much information is lost, respectively. Keogh, Eamonn, Stefano Lonardi, and Chotirat Ann Ratanamahatana (2004). "Towards parameter-free data mining". In: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD'04.

Some early uses of MDL in Machine Learning (ML)
The MDL principle has been applied early on in machine learning. Examples include evaluating hypotheses in inductive logic programming , learning Bayesian networks , decision trees Robnik-Sikonja and Kononenko, 1998), rules , or other related models Kilpeläinen, Mannila, and E. Ukkonen, 1995), as well as feature engineering Derthick, 1991;Pfahringer, Kit, Chunyu (1998). "A Goodness Measure for Phrase Learning via Compression with the MDL Principle". In: Proceedings of the 1998 European Summer School in Logic, Language and Information, ESSLLI'98, student session, pp. 175-187. Li, Hang and Naoki Abe (1998a). "Generalizing case frames using a thesaurus and the MDL principle". In: Computational Linguistics 24.2, pp. 217-244. -(1998b). "Word clustering and disambiguation based on co-occurrence data". In: Proceedings of the 36th An-

Mining and selecting itemsets
Mining frequent patterns is a core task in data mining, and itemsets are probably the most elementary and best studied type of pattern. Soon after the introduction of the frequent itemset mining task (Agrawal, Imieliński, and Swami, 1993), it became obvious that beyond issues of efficiency (Agrawal and Srikant, 1994;Mannila, Toivonen, and Verkamo, 1994), the problem of selecting patterns constituted a major challenge to tackle, lest the analysts drown under the deluge of extracted patterns.
Various properties and measures have been introduced to select itemsets . They include identifying representative itemsets (Bastide et al., 2000), using user-defined constraints to filter itemsets (De Guns, Nijssen, and De Raedt, 2013), considering dependencies between itemsets (Jaroszewicz and Simovici, 2004;X. Yan et al., 2005; and trying to evaluate the statistical significance of itemsets Tatti, 2010;, also looking into alternative approaches to explore the search space . Initially focused on individual itemsets, approaches were later introduced to evaluate itemsets collectively, trying for instance to identify and remove redundancy , also in an iterative manner . The goal hence moved from mining collections of good patterns to mining good collections of patterns, which is also the main objective when applying the MDL principle. Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami (1993). "Mining association rules between sets of items in large databases".

Mining itemsets with Krimp & Co.
A transactional database consists of a collection of sets, called transactions, over a universe of items. The prototypical example for this type of data comes from market-basket analysis, which is also where some of the terminology is borrowed from. Alternatively, a transactional database can be represented as a binary matrix. Frequent itemset mining, that is, finding items that frequently co-occur in a transactional database, is a central task in data mining (cf. Section 3.7).

Krimp
The introduction by Siebes, Vreeken, and van Leeuwen in 2006 of a MDL-based algorithm for mining and selecting small but high-quality collections of itemsets sparked a productive line of research, including algorithmic improvements, adaptations to different tasks, and various applications of the original algorithm. The algorithm, soon dubbed Krimp ( van Leeuwen, Vreeken, and Siebes, 2006), Dutch for "to shrink", is a prime example of a dictionary-based strategy (cf. Section 2.3), illustrated in Figure 3.
Through an evaluation on a classification task, van Leeuwen, Vreeken, and Siebes (2006) show that the selected itemsets are representative. Specifically, considering a labelled training dataset, Krimp is applied separately on the transactions associated to each class to mine a code table. A given test transaction can then be encoded using each of the code tables, and assigned to the class that corresponds to the shortest code length.

Algorithmic improvements
Works building on Krimp include several algorithmic improvements. In particular, the Slim algorithm of Smets and Vreeken (2012)

Finding differences and anomalies
The analysis of differences between databases and the detection of anomalies are derivative tasks that have attracted particular attention. Vreeken, van Leeuwen, and Siebes (2007a) use Krimp to measure differences between two databases, by comparing the length of the description of a database obtained with a code table induced on that database versus one induced on the other database. The coverage of individual transactions by the selected itemsets, and the specific code tables obtained are also compared. The DiffNorm algorithm introduced by Budhathoki and Vreeken (2015) aims to encode multiple databases at once without redundancy, and allows to investigate the differences and similarities between the databases by inspecting the obtained code table. As a major contribution, Budhathoki and Vreeken (2015) improve the encoding by replacing the Shannon-Fano code used in the original Krimp by a prequential plug-in code (cf. Section 2.1). (2011) use Krimp for outlier detection by looking at how many bits are needed to encode a transaction. If this number is much larger than expected, the transaction is declared anomalous. In addition, the encodings of transactions can be scrutinised to obtain further insight into how they depart from the rest. Akoglu, Tong, Vreeken, et al. (2012) design an algorithm that detects as anomalies transactions having a high encoding cost. Their proposed algorithm mines multiple code tables, rather than a single one in Krimp, and handles categorical data. Bertens, Vreeken, and Siebes (2015) (also Bertens, Vreeken, and Siebes, 2017) propose a method to detect anomalous co-occurrences based on the Krimp/Slim code tables.

Mining rule sets
Going beyond itemsets, a closely related task is to mine rules. van Leeuwen and Galbrun (2015) propose to mine association rules across a two-view transactional dataset, such that one view can be reconstructed from the other, and vice versa. Fischer and Vreeken (2019) instead consider a unique set of items and aim to mine associations rules that allow to reconstruct the dataset, enabling corrections in order to increase the robustness of the results. They then apply the approach to learn rules about activation patterns in neural networks (Fischer, Oláh, and Vreeken, 2021). Aoga et al. (2018) present a method to encode a binary label associated to each transaction, using the original transactions and a list of rules, each associated to a probability that the target variable holds true. Proença and van Leeuwen (2020b) (also Proença and van Leeuwen, 2020a) consider a similar task, but with multiple classes and targeted towards predictive rather than descriptive rules, then looking for rules that capture deviating groups of transactions, i.e. dealing with the subgroup discovery task (Proença, Grünwald, et al., 2020;Proença, Grünwald, et al., 2021). Beside binary attributes, i.e. items, Proença, Grünwald, et al. (2020) also consider nominal and numerical attributes, and aim to predict a single numerical target, modeled using normal distributions, instead of a binary target. Proença, Bäck, and  further extend the approach to more complex nominal and numerical targets by resorting to normalised maximum likelihood (NML) and Bayesian codes, respectively.
Whereas the former two output a set of rules, these latter methods return a list of rules, such that at most one rule, the first valid rule encountered in the list, applies to any given transaction. In all cases, the dataset, or part of it, is assumed to be given and the aim is to reconstruct a target variable, or the rest of the dataset. van Leeuwen, Matthijs and Esther Galbrun (2015). "Association Discovery in Two-View Data"  Fischer, Jonas, Anna Oláh, and Jilles Vreeken (2021). "What's in the Box? Explaining Neural Networks with Robust Rules". In: Proceedings of the 38th International Conference on Machine Learning, ICML'21 . Proença, Hugo M., Thomas Bäck, and Matthijs . Robust subgroup discovery. arXiv: 2103.13686. Proença, Hugo M., Peter D. Grünwald, et al. (2021). Discovering outstanding subgroup lists for numeric targets using MDL. arXiv: 2006.09186.

Other adaptations
Further work also includes extending the Krimp approach to more expressive patterns, such as patterns mined from relational databases (Koopman and Siebes, 2008;Koopman and Siebes, 2009), and using Krimp for derivative tasks. van Leeuwen and Siebes (2008) use Krimp to detect changes in the distribution of items in a streaming setting. If a code table induced from an earlier part of the stream no longer provides good compression as compared to a code table induced from a more recent part of the stream, it is a signal that the distribution has changed. Bonchi, van Leeuwen, and A. Ukkonen (2011) extend the approach to the probabilistic setting, where the occurrence of an item in a transaction is associated to a probability, aiming to find a collection of itemsets that compress the data well in expectation.
Given an original database, Vreeken, van Leeuwen, and Siebes (2007b) use Krimp to generate a synthetic database similar to the original one, for use in the context of privacy-preserving data mining. The code table is induced on the original database, itemsets are then sampled from it and combined to generate synthetic transactions. Vreeken and Siebes (2008) use Krimp for data completion. In a way, this turns the MDL principle on its head. Starting from an incomplete database, rather than looking for the patterns that compress the data best, the proposed approach looks for the data that is best compressed by the patterns and considers that to be the best completion. Instead of a single code table, Siebes and Kersten (2011) look for a collection of code tables, that capture the structure of the dataset at different levels of granularity. Representing a categorical dataset as transactional data by mapping each value of an attribute to a distinct item implies that each transaction contains exactly one item for each categorical attribute, and hence that all transactions have the same length. Siebes and Kersten (2012) consider the problem of smoothing out the small scale structure from such datasets. That is, they aim to replace entries in the data so that its large scale structure is maintained but it can be compressed better.

Tabular data (continued)
Transactional data can be seen as a binary matrix or table, and is hence a form of tabular data. Data tables might also contain categorical or real-valued attributes, which can be either turned into binary attributes as a pre-processing or handled directly with dedicated methods.

More itemsets
Beside Krimp and algorithms inspired from it, different approaches have been proposed to mine itemsets from binary matrices using the MDL principle. Heikinheimo et al. (2009) focus on finding collections of low-entropy itemsets, typically even more compact than those obtained with Krimp. On the other hand, the main objective of Mampaey and Vreeken (2010) is to provide a summary of the dataset in the form of a partitioning of the items. For each partition, codewords are associated to the different combinations of the items that comprise the partition, and used to encode the corresponding subset of the data. In the Pack algorithm, proposed by Tatti and Vreeken (2008), the code representing an item is made dependent on the presence/absence of other items in the same transaction. This can be represented as decision trees whose intermediate nodes are items and leaves contain code tables for other items. Fischer and Vreeken (2020) introduce a rich pattern language that can capture both co-occurrences and mutual exclusivity of items. Mampaey, Vreeken, and Tatti (2012) present a variant of their MTV itemset mining algorithm (cf. Section 9.2) where the MDL principle is used to choose the collection of itemsets that yields the best model, as an alternative to the Bayesian Information Criterion (BIC). Tatti, Nikolaj and Jilles Vreeken (2008

Blocks in binary data
A different type of patterns that can be mined from binary matrices consists of blocks defined by a collection of rows and columns with a homogeneous density of ones, sometimes known as tiles or as biclusters, constituting the basis of block-based strategies (cf. Section 2.3).  propose a method to partition the rows and columns of the matrix into subsets that define homogeneous blocks. The example presented in Figure 4 follows this approach. Tatti and Vreeken (2012a) introduce the Stijl algorithm to mine a hierarchy of tiles from a binary matrix. The adjectives geometric or spatial typically refer to cases where the order of the rows and columns of the data matrix is meaningful (e.g. Papadimitriou, Gionis, et al., 2005;Faas and van Leeuwen, 2020

Formal Concept Analysis (FCA)
The field of Formal Concept Analysis focuses on basically the same problem as itemset mining, but from a slightly different perspective and with its own formalism. Some steps have been taken towards applying and further developing MDL-based methods within this framework (Otaki and Yamamoto, 2015;Yurov and Ignatov, 2017). In particular, as a topic in her doctoral dissertation, Makhalova (2021) extensively studied MDL-based itemset mining methods from the perspective of FCA (also Makhalova, Kuznetsov, and Napoli, 2018a;Makhalova, Kuznetsov, and Napoli, 2018b;Makhalova, Kuznetsov, and Napoli, 2019b;Makhalova, Kuznetsov, and Napoli, 2021

Boolean Matrix Factorisation (BMF)
Factorising a matrix consists in identifying two matrices, the factors, so that the original matrix can be reconstructed as their product. One challenge is to find the right balance between the complexity of the decomposition (generally measured by the size of the factors) and the accuracy of the reconstruction, which is where the MDL principle comes in handy ( (2019) do not allow factors to cover zero entries, considering so-called "from-below" factorisations (also Makhalova and Trnecka, 2021). Lucchese, Orlando, and Perego (2014) (also Lucchese, Orlando, and Perego, 2010b; Lucchese, Orlando, and Perego, 2010a) propose a MDL-based score to compare factors of a fixed size.

2017). Whereas the other approaches permit reconstruction errors in both values, Makhalova and Trnecka
An example of a factorisation of the toy binary dataset from Figure 2 is provided in Figure 5. The model consists of a pair of factor matrices whereas the data is encoded as a mask of corrections. To reconstruct the original data, the Boolean matrix product of the factors is computed and the mask of corrections is applied to the resulting matrix.
When applied to a Boolean matrix, factorisation shares some similarities with itemset mining, as it aims to identify items that occur (or do not occur) frequently together. That is, the two factor matrices can be interpreted as specifying itemsets and indicators of occurrence, respectively. On the other hand, the factors can also be interpreted as specifying possibly overlapping fully dense blocks. A mask applied to the reconstructed data provides a global error correction mechanism. Thus, Boolean matrix factorisation can be seen as a hybrid approach.

Categorical data
One way to mine datasets involving categorical attributes is to binarise them and then apply, for instance, an itemset mining algorithm. However, binarisation entails a loss of information and dedicated methods can hence offer a better alternative. Mampaey and Vreeken (2013) introduce an approach to detect correlated attributes, i.e. such that the different categories occur in a coordinated manner, whereas X. He, Feng, Konte, et al. (2014) present a subspace clustering method, i.e. look not only for groups of attributes, but also corresponding groups of rows where coordinated behaviour occurs. Mampaey, Michael and Jilles Vreeken (2013). "Summarizing categorical data by clustering attributes". In

Numerical data
A numerical data table containing m columns can be seen as a collection of points in a m-dimensional space. In pattern mining, numerical data is often handled by applying discretisation, which requires to partition the data into coherent blocks. While it can be seen as a data exploration task in its own right, providing an overview of the dataset and the distribution of values, discretisation often constitutes a pre-processing task to allow applying algorithms that can handle only discrete input data. Yet, choosing good parameters for the discretisation can be difficult, and its quality can have a major impact on later processing. Unsupervised discretisation, where no side information is available, is in contrast to supervised discretisation, that takes into account class labels and often precedes a machine learning task. Here we focus on the former. Kontkanen and Myllymäki (2007) propose a histogram density estimation method that relies on the MDL principle, formalised using the normalised maximum likelihood (NML) distribution. This method is employed by Kameya (2011) to discretise time-series seen as a collection of two-dimensional time-measurement data points, and extended by Yang, Baratchi, and  to two-dimensional numerical data, more in general. Along similar lines, Nguyen, Müller, et al. (2014) aim to automatically identify a high-quality discretisation that preserves the interactions between attributes. Witteveen et al. (2014) extend the Kraft inequality (cf. Section 2.1) to numerical data and introduce an approach to find hyperintervals, i.e. multidimensional blocks. Makhalova, Kuznetsov, and Napoli (2019a) consider the problem of mining interesting hyperrectangles from discretised numerical data, and aim to design an encoding that accommodates overlaps between patterns (Makhalova, Kuznetsov, and Napoli, 2020;Makhalova, Kuznetsov, and Napoli, 2022). Lakshmanan et al. (2002) formalise mining OLAP data, i.e. multidimensional datasets, as a problem of finding a cover in a multidimensional array containing positive, negative and neutral cells. The aim is then to find the most compact set of hyperrectangles that covers all positive cells, none of the negative cells, and no more than a chosen number of the neutral cells. The score is presented as a generalised MDL due to the tolerance on neutral cells. However, coverings are evaluated by simply counting cells, which does not actually adhere with the principle, generalised or otherwise.

Graphs
In this section, we consider approaches for mining graphs. At their simplest, graphs are undirected and unlabelled, but they can also come with directed edges, with node or edge labels, or be dynamic, that is, time-evolving. The main tasks consist in identifying nodes that have similar connection patterns to group them into homogeneous blocks and in finding recurrent connection substructures. These correspond respectively to block-based and dictionary-based strategies (cf. Section 2.3).
For illustrative purposes, we consider a toy graph, shown in Figure 6, and delineate approaches that follow either strategy. The example shown in Figure 7 illustrates the block-based strategy and follows the work of Chakrabarti (2004) (cf. Section 6.1), whereas the example shown in Figure 8 illustrates the dictionary-based strategy and follows the work of Bariatti, Cellier, and Ferré (2020b) (cf. Section 6.5).
Looking at the corresponding adjacency matrix, a simple unlabelled graph can be represented as a binary table. Approaches from Sections 4 and 5 can thus readily be used for mining graphs. On one hand, the problem of grouping nodes into blocks that constitute particularly dense subgraphs, or communities, is closely related to identifying particularly dense tiles in a binary matrix. On the other hand, approaches that follow a dictionarybased strategy and aim to identify substructures in the graphs share similarities with their counterparts for binary tabular data. However, it is not enough to simply replace the subgraph patterns by their assigned codewords. The information about how the subgraphs are connected also needs to be encoded, requiring more complex encoding schemes.
In: ACM Computing Surveys 51.3, 62:1-62:34. doi: 10.1145/3186727.  present an information-theoretic framework to identify community structure in networks by grouping nodes and propose to use the MDL principle to automatically select the number of groups in which to arrange the nodes.  proposes to compress the adjacency matrix of a graph by grouping nodes into homogeneous blocks (see Figure 7), with a top-down procedure to search for a good partition.  similarly propose to build graph summaries by grouping nodes into supernodes, but with a bottom-up search procedure. A superedge linking two supernodes represents edges between all pairs of elementary nodes from either supernodes (hence a supernode with a loop represents a clique). When reconstructing the original graph, after expanding the supernodes and superedges, some corrections must be performed, to add and remove spurious edges.  let the cost of encoding a graph equal the number of superedges and edge corrections, ignoring the cost of the assignment of nodes to supernodes.

Grouping nodes into blocks
Khan, Nawaz, and Y.-K. Lee (2015b) (also Khan, Nawaz, and Y.-K. Lee, 2015a) work with essentially the same encoding, using locality-sensitive hashing (LSH) to identify candidates for merger, additionally considering node labels (Khan, Nawaz, and Y.-K. .  also aim at grouping nodes while taking into account node attributes. Figure 7: Block-based strategy, example on the toy graph of Figure 6. Ignoring labels and considering its adjacency matrix, the graph can be encoded in a very similar way as a binary tabular dataset (see Section 2.3 and Figure 4). Furthermore, since the graph is undirected, and its matrix hence symmetric, it is enough to encode the lower triangular part of the matrix (depicted with solid lines and colour fill), from which the upper triangular part (depicted with dotted lines and hatch pattern) can be reconstructed. More intense shades of orange represent higher probabilities of ones within the corresponding block.  Figure 6. The idea is similar to the one for binary tabular datasets (see Section 2.3 and Figure 3). However, it is not enough to simply replace the subgraph patterns by their assigned codewords (depicted as blue blocks), the information about how the subgraphs are connected also needs to be encoded. Here, this is done through ports associated to the patterns and their assigned codewords (depicted as tan blocks).
blocks are arranged into a hierarchy, while J.  consider k-partite graphs.
In order to compress the adjacency matrix of an input graph more efficiently, X.  look for nodes with similar connection patterns, corresponding to similar rows in the matrix, and encode the differences, possibly in a recursive manner. The approach is used to spot nodes with unusual connections patterns, that do not lend themselves to grouping.  propose a supernode summary involving superedge weights that represent the probability that an edge exists for each pair of nodes in the incident supernodes. In one variant of the problem, the MDL principle is used to choose the number k of supernodes that strikes the best balance between model complexity (k) and fit to the data (reconstruction error). K.  also consider summarising graphs by grouping nodes together, but fix a maximum length for the description of the model, i.e. the hypernodes, and look for the summary that minimises the reconstruction error, measured as the length of the description of edge corrections.  use a MDL-inspired score to learn graph embeddings. That is, the aim is to project the nodes into a multi-dimensional space, so that the structure of the graph is preserved as much as possible and, more specifically, such that connected nodes are placed close to each other. Therefore, the distance between any pair of nodes in the embedding is used to compute a probability that the nodes are connected, which, in turn is used to encode the presence or absence of the corresponding edge. Then, the quality of an embedding can be measured by how much it allows to compress the adjacency matrix.

Finding hyberbolic communities
Instead of looking for blocks of uniform density and motivated by the observation that node degrees in real-world networks often follow a power-law distribution, Araujo, Günnemann, Mateos, et al. (2014) propose the model of hyperbolic communities. The name refers to the fact that when nodes in such communities are ordered by degree, edges in the adjacency matrix mostly end up below a hyperbola. Faloutsos (2011) (also Lim, Kang, and decompose the input graph into hubs and spokes, with superhubs connecting the hubs recursively, and introduce a cost to evaluate how well the decomposition allows to compress the graph. This type of decomposition is proposed as an alternative to a decomposition into cliques, referred to as "cavemen communities". Kang, U and Christos Faloutsos (2011 Rosvall and Bergstrom (2008) propose a method to reveal the important connectivity structure of weighted directed graphs. The approach assigns codes to nodes in such a way that random walks over the graph can be described succinctly. Furthermore, nodes are partitioned into modules so that the codes for nodes are unique within each module but can be reused between modules. A walk over the graph can then be described using a combination of codes indicating transitions between modules and lists of the successive nodes encountered within each module. The resulting two-level summary of the graph maps its main structures and the connections between and within them, and the approach is therefore referred to as the Map Equation (Rosvall, Axelsson, and Bergstrom, 2009) or the Infomap algorithm. 4

The Map Equation
Later, refinements and extensions of the method were proposed, to study changes in the connectivity structure over time (Rosvall and Bergstrom, 2010), reveal multi-level hierarchical connectivity structure (Rosvall and Bergstrom, 2011), support overlaps between modules (Viamontes Esquivel and Rosvall, 2011), among others (Bohlin et al., 2014;De Domenico et al., 2015;Edler, Bohlin, and Rosvall, 2017;Emmons and Mucha, 2019;Calatayud et al., 2019). In particular, the method has found application in the analysis of ecological communities (Edler, Guedes, et al., 2017;Blanco et al., 2021;Rojas et al., 2021). Rosvall, Martin and Carl T. Bergstrom (2008). "Maps of random walks on complex networks reveal community structure". In: Proceedings of the National Academy of Sciences 105.4, pp. 1118-1123. doi: 10.1073/pnas.0706851105. Rosvall, Martin, D. Axelsson, and Carl T. Bergstrom (2009) Cook and Holder (1994) (also Ketkar, Holder, and Cook, 2005) propose the Subdue algorithm to mine substructures from graphs, possibly with labels, using the MDL principle. A substructure of the graph can be encoded and its occurrences in the graph be replaced by a single node. This can be done recursively, generating a hierarchical summary of the original graph. There are two shortcomings to the approach. First, replacing the substructure by a single node does not preserve the complete information about the connections to neighbours. Second, the matching of substructures is done in an approximate way, with an arbitrary fixed cost, rather than a proper encoding of the reconstruction errors (using the MDL principle for this evaluation is left for future work). Substructures are scored individually rather than in combination. The Subdue algorithm is used by Jonyer, Holder, and Cook (2004) for the induction of context-free grammars, and by Bloem (2013) in comparative experiments against practical data compression with the GZIP algorithm. Bloem and Rooij (2018) (also Bloem and Rooij, 2020) propose to use the MDL principle when evaluating the statistical significance of the presence of substructures in a graph.

Identifying substructures
The VoG algorithm presented by Koutra et al. (2014) (also  allows to decompose the graph into basic primitives such as cliques, stars, and chains, which can overlap on nodes (but not on edges). Error corrections are then applied, to add and remove spurious edges. This can be seen as a global use of primitives. Liu, Shah, and Koutra (2015) (also Liu, Safavi, and Shah, 2016) use the MDL principle to compare the ability of VoG and graph clustering methods to generate graph summaries. Liu, Safavi, Shah, and Koutra (2018) build on VoG and address some of the shortcomings, such as the bias towards star structures, the inability to exploit edge overlaps, and the dependency on candidate order. Goebl et al. (2016) introduce a similar approach, with some of the same primitives, but prohibiting overlaps with the aim to make visualisation and interpretation easier. The approach presented by Bariatti, Cellier, and Ferré (2020b) removes the limitation to a predefined set of primitives and considers labelled graphs (see Figure 8). The authors later upgraded the approach by generating candidates on-the-fly, thereby providing an anytime mining algorithm (Bariatti, Cellier, and Ferré, 2021), and proposed a visualisation tool for the obtained graph patterns (Bariatti, Cellier, and Ferré, 2020a). This work forms the basis of a doctoral dissertation (Bariatti, 2021).
The approach proposed by Coupette and Vreeken (2021) aims to highlight similarities and differences between graphs, and is akin in spirit to the DiffNorm algorithm for transactional data (cf. Section 4.3). It looks for a common model consisting of basic primitives, like those used in VoG, such that each graph can be reconstructed based on these primitives adjusted through parameters specific to the graph, as well as additional structures, where necessary.
6.6 Identifying substructures in dynamic graphs Shah, Koutra, Zou, et al. (2015) (also Shah, Koutra, Jin, et al., 2017) extend the VoG approach to dynamic graphs. More specifically, they incorporate the temporal aspect of substructures appearing only at given time steps, across a range of contiguous time steps, periodically, or in a flickering fashion. Therefore, in addition to decomposing the graph into basic structures, one needs to indicate when these structures appear. The Mango algorithm by Saran and Vreeken (2019) also looks for predefined structures in a dynamic graph, aiming more specifically at tracking their evolution through time. Shah, Neil, Danai Koutra, Tianmin Zou, et al. (2015).  Shah, Neil, Danai Koutra, Lisa Jin, et al. (2017). "On Summarizing Large-Scale Dynamic Graphs". In: IEEE Data Engineering Bulletin 40.3, pp. 75-88. Saran, Divyam and Jilles Vreeken (2019). Summarizing Dynamic Graphs using MDL. Technical report. Saarland University.

Finding pathways between nodes
Given a large graph, Akoglu, Chau, et al. (2013) consider the problem of identifying a set of marked nodes. This can be done by listing the node identifiers or by navigating between nodes. The latter strategy requires to choose between the limited number of neighbours of each traversed node, rather than among all possible nodes in the graph, potentially leading to shorter descriptions. In particular, the problem formulated by Akoglu, Chau, et al. (2013) consists in finding the best collection of trees spanning the marked nodes in the graph. The graph as such is not encoded, it is regarded as shared knowledge. Prakash, Vreeken, and Faloutsos (2014) similarly assume shared knowledge of the graph. The aim is then to transmit the starting points and spread of an epidemic through the graph over a sequence of time steps, assuming a "susceptible-infected" (SI) epidemic model. Akoglu, Leman, Duen Horng Chau, et al. (2013). "Mining Connection Pathways for Marked Nodes in Large

Temporal data
In this section, we look at data where the attribute values come as a sequence, i.e. in a specific order. In particular, this order might correspond to time, in which case the data is called temporal. In some cases only the order matters, whereas in other cases absolute positions are associated to the values, such as timestamps in temporal data. In addition to time, spatial dimension(s) might be associated to the values, resulting in spatio-temporal data. The terms sequential data and sequence are sometimes used to refer more narrowly to sequences of discrete attributes or items, which are typically called events. On the other hand, the term timeseries is generally used to refer to real-valued attributes sampled at regular or irregular time intervals. Text and genetic data (such as DNA or RNA sequences) fall into the former category. More specifically, such data generally comes in the form of strings, that is, as sequences of characters that represent occurrences of single items where the order is meaningful, not the positions. The data might consist of a single long sequence or of a database of multiple, typically shorter, sequences.
As with other types of data, most of the work on mining sequential data can be divided into two main tasks, namely segmentation and frequent pattern mining, corresponding to block-based and dictionary-based strategies, respectively (cf. Section 2.3). In segmentation problems (a.k.a. change point detection), the aim is to divide the input data into homogeneous blocks or segments, each associated to specific occurrence probabilities of the different events. On the other hand, in frequent pattern mining, the aim is to find recurrent substructures, which are commonly referred to as episodes and motifs when considering sequences and timeseries, respectively.
For illustrative purposes, we consider a toy sequence, shown in Figure 9, and delineate approaches that follow either strategy. The example shown in Figure 10 illustrates the block-based strategy and follows the work of Kiernan and Terzi (2008) (cf. Section 7.2), whereas the example shown in Figure 11 illustrates the dictionary-based strategy and follows the work of Tatti and Vreeken (2012b) (cf. Section 7.5). While similar to their counterparts for binary tabular data, approaches for temporal data must account for the order that the special dimension of time imposes on the occurrences.

Finding haplotype blocks
A haplotype, or haploid genotype, is a group of alleles of different genes on a single chromosome, which are closely linked and typically inherited as a unit. Several works have been dedicated to the problem of finding haplotype block boundaries, i.e. identifying block structure in genetic sequences. This requires jointly partitioning multiple aligned strings. Koivisto, Mikko et al. (2002). "An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries". In:

Segmenting sequences
Several approaches have also been developed for segmenting event sequences more in general.
The method introduced by Kiernan and Terzi (2008) partitions a sequence into time segments, then partitions the events of each segment into groups (see Figure 10). The proposed algorithm is then extended to allow overlaps and gaps between segments (Kiernan and Terzi, 2009a) and a tool to visualise the obtained segmentation is proposed (Kiernan and Terzi, 2009b). P.  further aim to model dependencies between segments.
The algorithm proposed by  partitions the alphabet into subsets, then separately encodes the sequence projected on each subset of symbols, as well as a sequence that maps each position to the corresponding subset. Chen, Amiri, and Prakash (2018) adopt a generic point of view on sequence segmentation, considering that the input data can be either univariate or multivariate, consist of categorical or real-valued variables, with no assumption on the underlying distribution.  aim to segment a sequence in such a way that each segment contains a collection of recurrent "signature" events. In particular, they consider retail data and apply the approach to the sequences of transactions of individual customers, in order to analyse their shopping behaviour. Kiernan, Jerry and Evimaria Terzi (2008). "Constructing Comprehensive Summaries of Large Event Sequences".
In x y y x z x y x y z z z z x z y x z z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 x y z Figure 9: A toy sequence, with nineteen consecutive occurrences over three distinct events x, y and z (top), also represented as a binary matrix (bottom) indicating which event (row) occurs at a given time step (column).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Figure 10: Block-based strategy, example on the toy sequence of Figure 9. The sequence can be encoded in a very similar way as a binary tabular dataset (see Section 2.3 and Figure 4). In this case, however, the order of the columns corresponds to time and is therefore fixed, but the events (i.e. rows) can be arranged into different groups in the different time segments (i.e. column groups). More intense shades of orange represent higher probabilities of ones within the corresponding block.
data encoded with CT Figure 11: Dictionary-based strategy, example on the toy sequence of Figure 9. The idea is similar to the one for binary tabular datasets (see Section 2.3 and Figure 3). Episode patterns are assigned codewords (depicted as blue blocks). The original sequence is reconstructed by reading codewords from the pattern stream (P) and inserting the events of the corresponding episode into the sequence, in order. In addition, each multi-event episode in the code table is associated to an additional pair of codewords (depicted as bronze blocks). These codewords, read from the gap stream (G), indicate whether to insert the next element from the current episode (1) or to leave a gap where to insert the next episode (0). Wu, Daoping et al. (2020). "Modeling Piece-Wise Stationary Time Series". In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech andSignal Processing, ICASSP'20. IEEE Computer Society, pp. 3817-3821. doi: 10.1109/ICASSP40776.2020.9053470.

Mining substrings
The algorithm devised by Evans, Markham, et al. (2006) (also Markham et al., 2009) searches for the best set of substrings to encode an input string according to the proposed Optimal Symbol Compression Ratio (OSCR) (Evans, Saulnier, and Bush, 2003). The algorithm, which has been applied primarily to analyse genetic sequences (Evans, Kourtidis, et al., 2007), is iterative, at each step picking the substring that compresses most and replacing it by a temporary code. Selected substrings can be recursive, in the sense that they contain previously selected substrings. In the end, the selected substrings are assigned codes using Huffman coding. Evans, Scott, Gary Saulnier, and Stephen F Bush (2003). "A New Universal Two Part Code for Estimation of String Kolmogorov Complexity and Algorithmic Minimum Sufficient Statistic". In: Proceedings of the DIMACS Workshop on Complexity and Inference. Evans, Scott, T. Stephen Markham, et al. (2006). "An Improved Minimum Description Length Learning Algorithm for Nucleotide Sequence Analysis".   propose to encode timestamped sequences with absolute positioning. That is, the positions of covered occurrences are listed separately for each singleton event or selected subsequence. A fixedlength code is used, so all elements (event or position) cost the same and, in particular, occurrences can appear arbitrarily far apart with no penalty. Follow-up work (Lam, Mörchen, et al., 2014) focuses on strings. The proposed algorithms have the same names (SeqKrimp and GoKrimp), but use a different encoding mechanism. Specifically, having constructed a dictionary mapping subsequences to codewords, each match of a selected subsequence is replaced by its associated codeword, followed by Elias codes indicating the gaps between occurrences of the successive events of the subsequence. For a given subsequence, a subroutine is proposed to find the matches with minimum gap cost.  consider a similar problem in a streaming setting. The proposed encoding points back to the previous occurrence of the subsequence, with a flag to indicate when an extended subsequence should be recorded as new, that is, added to the dictionary.

Mining episodes from sequences
The Sqs ("squeeze") algorithm of Tatti and Vreeken (2012b) follows a dictionary-based strategy and is similar to Krimp but for sequences (see Figure 11). Each selected subsequence, or episode, is assigned a codeword representing it, as well as a pair of codewords representing gap (move to next position) and fill (insert event) operations. Gaps are allowed but not interleaving. In other words, gaps must be filled by singletons. This work is then extended in multiple ways, to take into account an ontology over the events, resulting in algorithm Nemo  and by adding support for rich interleaving and choice of events in patterns, resulting in algorithm Squish .
After focusing on the analysis of seismic data, aiming to cluster and compare seismograms represented as multiple aligned sequences , Bertens, Vreeken, and Siebes (2016a) consider multivariate event sequences more in general and propose algorithm Ditto, which can be seen as an extension of Sqs to handle multivariate patterns. The work constitutes the basis of a dissertation focused on detecting anomalies and mining multivariate event sequences, also in combination, i.e. employing Ditto to detect anomalies in such sequences (Bertens, 2017, Chapter 7).  use compression and the Sqs algorithm to analyse the similarities between sequence databases in terms of occurring sequential patterns, focusing mostly on text data. The proposed algorithm, called SqsNorm, provides for sequential data the type of analysis that DiffNorm allows for transactional data (cf. Section 4.3).
The approach proposed by Wiegand, Klakow, and Vreeken (2021) is clearly related to Sqs and Squish but aims to summarise entire complex event sequences, rather than capturing fragmentary behaviour. The models considered resemble Petri nets or finite state machines and specify conditional transitions between events. The data is then represented as a succession of instructions that, when fed through the model, allow to reconstruct the original event log. The authors later present a similar model called event-flow graph (Wiegand, Klakow, and Vreeken, 2022). Instead of pattern nodes, this model involves rules defined over attribute vectors associated to the sequences.  aim to identify sequential patterns that reliably predict the impending occurrence of an event of interest. In other words, they look for a set of rules, but in sequential rather than transactional data (cf. Section 4.4). Along similar lines, Bourrand et al. (2021a) (also Bourrand et al., 2021b) aim to discover a compact set of sequential rules from a single long event sequence.  propose a generative probabilistic model of sequence databases. The authors discuss the connection between probabilistic modeling and description length, and compare their proposed algorithm, which is not based on the MDL principle, to Sqs and GoKrimp. Bourrand, Erwan et al. (2021a). "Discovering Useful Compact Sets of Sequential Rules in a Long Sequence". In: Proceedings of the 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence, ICTAI'21. IEEE Computer Society, pp. 1295-1299. doi: 10.1109/ICTAI52525.2021.00204. -(2021b. Discovering Useful Compact Sets of Sequential Rules in a Long Sequence. arXiv: 2109.07519. Wiegand, Boris, Dietrich Klakow, and Jilles Vreeken (2021). "Mining Easily Understandable Models from Complex Event Logs". In: Proceedings of the 2021 SIAM International Conference on Data Mining, SDM'21. SIAM, pp. 244-252. doi: 10.1137/1.9781611976700.28. -(2022. "Mining Interpretable Data-to-Sequence Generators". In: Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI'22. Association for the Advancement of Artificial Intelligence. 7.6 Mining motifs from timeseries Tanaka and Uehara (2003) (also Tanaka, Iwamoto, and Uehara, 2005) look for motifs in timeseries, representing discretised values as integers using a fixed-length code. Shokoohi-Yekta et al. (2015) consider the problem of extracting rules from timeseries, aiming to match shapes rather than the precise values. The proposed score is used to evaluate the consequent of candidate rules, allowing to compare consequents of different lengths. The score evaluates the compression gain resulting from specifying the motif once and then listing the errors for each occurrence, instead of listing the actual values for each occurrence. This is applied to evaluate candidates individually, not as a set. Tanaka, Yoshiki and Kuniaki Uehara (2003). "Discover motifs in multi-dimensional time-series using the principal component analysis and the MDL principle". In: Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition, MLDM'03. Springer, pp. 252-265. Tanaka, Yoshiki, Kazuhisa Iwamoto, and Kuniaki Uehara (2005)

Mining periodic patterns
Exploiting regularities not only about what happens, that is, finding coordinated event occurrences, but also about when it happens, that is, finding consistent inter-occurrence time intervals, can allow to further compress the data.
In the context of "smart homes" and health monitoring, Heierman, Youngblood, and Cook (2004) look for periodically repeating events or sets of events in a sequence, with a MDL criterion to identify interesting candidates, which are then used to automatically construct a Markov model (HPOMDP) (also Heierman and Cook, 2003;Das and Cook, 2004;Youngblood et al., 2005). The work of Rashidi and Cook (2013) shares the same context and goal, further accounting for discontinuities in the repetitions and variations in the order of the events. Galbrun et al. (2018) introduce patterns involving nested periodic recurrences of different events and a method for constructing them by combining simple cycles into increasingly complex and expressive patterns.
Heierman, Edwin O. and Diane J. Cook (2003). "Improving home automation by discovering regularly occurring device usage patterns". In: Proceedings of the 3rd IEEE International Conference on Data Mining, ICDM'03. IEEE Computer Society, pp. 537-540. doi: 10.1109/ICDM.2003.1250971. Das, Sajal K. and Diane J. Cook (2004. "Health Monitoring in an Agent-Based Smart Home". In: Proceedings of the International Conference on Smart Homes and Health Telematics, ICOST'04. IOS Press, pp. 3-14. Heierman, Edwin O., G. Michael Youngblood, and Diane J. Cook (2004). "Mining temporal sequences to discover interesting patterns". In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'19. ACM. Youngblood, G. Michael et al. (2005) Galbrun, Esther et al. (2018). "Mining Periodic Patterns with a MDL Criterion". In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML/PKDD'18, pp. 535-551. Phan et al. (2013) consider data that represent a collection of moving objects, each associated at each timestamp with a geospatial position. As a pre-processing step, the objects must be clustered based on proximity, separately for each timestamp. An object is allowed to belong to several clusters at any given timestamp. The goal is then to find a sequence of clusters (at most one per timestamp) having objects in common. The result is called a swarm and intended to represent objects moving together. When encoding the trajectory of an object (as a sequence of clusters) patterns can be used whole, or from/to an intermediate position.

Trajectories
Also considering spatio-temporal data but in a different scenario, Zhao et al. (2019) aim to mine frequent patterns in trajectories over a road network. Mapping the trajectories to the corresponding road segments effectively turns the problem into a frequent sequence mining problem. The MDL principle is used to formulate a problem of trajectory spatial compression, addressed using a dictionary-based strategy. Phan, Nhat Hai et al. (2013)

Discussion
Our goal here is to open a discussion on issues relevant to MDL-based methods for pattern mining. In particular, the design of the encoding is a crucial ingredient when developing such a method. We consider different questions that might be raised by the involved choices, regarding, in particular, conformity and suitability with respect to the MDL principle. To illustrate the discussion, we point to various works listed in the previous sections.

Encoding in question
Alleged infractions to the MDL principle can be of different kinds and degrees of severity. They include cases where (i) the assignment of codewords ignores information theory, (ii) the proposed encoding is not functional due to some information missing, and (iii) the proposed encoding clearly cannot achieve a good compression due to the presence of unnecessary unjustified terms.
Several methods assign the same, typically unit, cost to all encoded elements (which might be items, nodes, edges, events, timestamps, etc. depending on the case, or even entire patterns), so that the description length is simply the number of encoded elements (see for instance  in Section 6.1, and Phan et al., 2013;Zhao et al., 2019 in Section 7.8). Some authors motivate this choice by the need to avoid penalising large values, or to circumvent other encoding issues (see for instance Lam, Mörchen, et al., 2012 in Section 7.5). In the method proposed by Y.  (cf. Section 7.5), for instance, the cost of a pattern is first defined to be the number of characters used to represent it, that is, the number of events plus the number of timestamps. In a second version of the method, this cost is then defined to be equal to one for all patterns, reportedly to avoid bias against patterns involving more events.
Transmitting the dataset through a binary communication channel as efficiently as possible is a thought experiment of sorts that motivates the score. If short codewords are assigned to specific elements because they are deemed more valuable and useful, then other elements will have to be assigned longer codewords, because not everything can be transmitted cheaply. One can think of it as the fundamental limits of information theory, through this compression scenario, forcing the designer of the method to make choices as to what he considers important and interesting. One might argue that using unit costs corresponds to using a fixed-length code and rescaling everything for convenience. This indeed simplifies the design of the encoding, as it avoids making decisions, and in this sense short-circuits the principle.
Small coding elements, such as for example required to delimit patterns in the code table, are often omitted. This is sometimes done deliberately, putting forth, in particular, the use of a pre-defined framework to be filled with the relevant values, which is common to all models and can therefore be ignored (see for instance Vreeken, van Leeuwen, and Siebes, 2011 in Section 4, and van Leeuwen and Galbrun, 2015 in Section 4.4), but is sometimes left unexplained and might seem accidental. In the approach proposed by Lam, Mörchen, et al. (2014) (cf. Section 7.5), it is unclear how the receiver knows where the codewords end when decoding the dictionary. On the other hand, Tanaka and Uehara (2003) (cf. Section 7.6) use a fixed-length code to encode values from a set, but the number of distinct values, which varies for different models, is not transmitted, so that the receiver cannot deduce the codeword length, and hence cannot decode the message.
More substantial pieces might also be missing. For instance, the encodings proposed by  (cf. Section 7.2) and by  (cf. Section 7.3) do not account for the transmission of the assignment of symbols to subsets and of the mapping of offset values to codewords for the corrections, respectively, which are needed to reconstruct the data.
Explanations about the encoding are sometimes kept at the level of intuitions, and the details provided can be insufficient to properly understand how it works (see for instance Khan, Nawaz, and Y.-K. Lee, 2015b in Section 6.1, Faloutsos, 2014 in Section 7.3, Heierman, Youngblood, andRashidi andCook, 2013 in Section 7.7, andPhan et al., 2013 in Section 7.8).
Arguably, ensuring decodability would in some cases require only minor modifications of the encoding scheme, and would likely have no major impact on the results. Furthermore, how much effort should be spent ensuring that the proposed encoding works is debatable, since it will never be used in practice.
The choice of encoding can sometimes seem sub-optimal, ill-suited or introduce undesirable bias. For instance, Navlakha, Rastogi, and Shrivastava (2008) (cf. Section 6.1) list edge corrections that should be applied to the reconstructed graph, indicating for each one the sign of the correction. It seems, however that this information is unnecessary, as it can be inferred from the reconstructed graph, by checking whether the edge is present (must be deleted) or absent (must be added).  (cf. Section 7.3) encode a list of value corrections using Huffman coding, meaning that having few distinct but recurrent error values is rewarded, not necessarily small ones. Using a universal code for the corrections would instead encourage small error values, which might be more intuitive. In any case, it is advisable to lay bare and motivate the potential biases introduced by the choice of encoding, whenever possible.
Considering a sequence,  (cf. Section 7.5) encode the occurrence of an event or subsequence by pointing back to the position of the first occurrence. Pointing back, instead, to the position of the last encountered occurrence would require to encode smaller values and might lead to savings. Keeping track of the order in which the patterns were last encountered and referring to the position in that list, so that repetitions of the same pattern do not fill up the list, is another alternative. There are often different ways to achieve the same purpose, not necessarily with a clear overall best choice. In addition to pointing back to previous occurrences,  maintain a dictionary of patterns. It is unclear whether the dictionary is actually needed for the encoding, or is primarily used to recover the encountered patterns.
What is part of the encoding of the model and what is part of the encoding of the data given the model is sometimes not entirely obvious. For example, the algorithm of  (cf. Section 7.2) encodes a sequence by partitioning the alphabet and considering separately the subsequences over each subset of symbols. The authors present the term that corresponds to the assignment of positions to subsets as part of the encoding of the model. Debatably, it can be considered instead as part of the encoding of the data given the model, while the assignment of symbols to subsets, which is ignored, would belong to the encoding of the model. Besides, encodings often actually consist of three terms, (i) a description of the set of patterns (the model), (ii) information to reconstruct the data using these patterns, and (iii) a list of corrections to apply to the reconstructed data in order to recover the original data, with the latter two together representing the data given the model.

Code of choice
Prequential plug-in codes, and refined codes more in general, provide means to avoid unwanted bias arising from arbitrary choices in the encoding (cf. Section 2).
For instance, Budhathoki and Vreeken (2015) use prequential coding for the itemset occurrences in the DiffNorm algorithm (cf. Section 4.3). The choice is especially relevant in this scenario where the goal is to contrast the itemset make-up of different datasets, and not to inspect the usage of itemsets in a particular dataset.  as well as Wiegand, Klakow, and Vreeken (2021) use prequential coding for the streams that contain information about pattern occurrences (cf. Section 7.5). Other recent works (Faas and Makhalova, Kuznetsov, and Napoli, 2020;Bloem and Rooij, 2020, cf. Sections 5.2, 5.6 and 6.5, respectively) also use prequential coding, while Bertens, Vreeken, and Siebes (2016a) and Hinrichs and Vreeken (2017) (cf. Section 7.5) both explicitly suggest upgrading the current encoding with a prequential code, as a di-rection for future work. Going further, Proença, Bäck, and  improved on their earlier work (Proença and van Leeuwen, 2020a) (cf. Section 4.4) by replacing prequential coding, which is only asymptotically optimal, with normalised maximum likelihood (NML), which is optimal for fixed sample sizes, employing similar techniques as Kontkanen and Myllymäki (2007) (cf. Section 5.6).
However, modern Bayesian and NML codes can be challenging to compute, or even downright infeasible. Furthermore, one-part codes can be less intuitive than two-part codes, and do not provide as direct an access to information about pattern usage. For instance, Mampaey and Vreeken (2010) (cf. Section 5.1) compare two encodings, with and without prequential coding, and, obtaining similar results, choose to proceed with the latter as it is more intuitive. All in all, modern refined codes have improved theoretical properties, but using them to build better methods comes with some challenges.

The letter or the spirit
Some approaches use the MDL principle to score and compare individual candidate patterns, rather than evaluating them in combination (see for instance Cook and Holder, 1994 in Section 6.5, Shokoohi-Yekta et al., 2015 in Section 7.6, as well as Heierman, Youngblood, and Cook, 2004;Rashidi and Cook, 2013 in Section 7.7).
Considering a two-view dataset, i.e. a dataset consisting of two tables, the approach proposed by van Leeuwen and Galbrun (2015) assumes knowledge of one table to encode the other, and vice versa. Arguably, this approach does not correspond to a practical encoding, like other MDL-based approaches, but also not to a realistic compression scenario, yet it serves as a reasonable motivation for the proposed score.
The proposed score might actually be entirely ad-hoc, in the sense that it does not correspond to the length of an encoding that could be used to represent the data (see for instance Makhalova, Kuznetsov, and Napoli, 2019a in Section 5.6). One might reasonably devise and justify an evaluation measure suited to the problem at hand, but labelling it as following the MDL principle is arbitrary and inappropriate, short of an explanation of how this corresponds to encoding, and can only lead to confusion.
Authors sometimes approach the topic with caution and include disclaimers stating that their proposed methods are inspired by or in the spirit of the MDL principle (see for instance Shokoohi-Yekta et al., 2015 in Section 7.6). This can be seen as a way to allow oneself to take some liberties with the principle, indeed considering it as a source of inspiration rather than as law, but also as a way to preventively fend off criticism and accusations of heresy. There is indeed a range of opinions about how closely one must conform to the MDL principle and to information theory.

Making comparisons
How to make meaningful comparisons between compression-based scores and between corresponding results requires careful consideration. For instance, one might ponder whether the compression achieved for a dataset is an indication of how much structure is present in it, or at least how much could be detected, and to what extent it can serve as a measure of the performance of the algorithm.
Does it make sense to compare the length of a dataset encoded with the proposed scheme to the original unencoded data? And is it a problem if the latter is shorter? Keeping in mind that compression is used as a tool for comparing models, rather than for practical purposes, we answer both questions in the negative. The compression achieved with the simplest model, be it the code table containing only elementary patterns such as the singleton itemsets, known as the standard code table in Krimp (Vreeken, van Leeuwen, and Siebes, 2011 in Section 4), for dictionary-based approaches (cf. Figure 3(i)) or the single-block model for block-based approaches (cf. Figure 4(i)), is often considered as a basis for comparison. The ratio of the compression achieved with a considered model to the compression achieved with the elementary model, known as the compression ratio, is then computed and used to compare different models, with lower compression ratios corresponding to better models. This is a way to normalise the scores and allow more meaningful comparisons and evaluations.
A direct comparison of the raw description lengths, in terms of numbers of bits, of the same data encoded with different methods is typically not meaningful. For instance, it does not make sense to compare the description lengths reported in Figure 3 to those reported in Figure 4. Comparing compression ratios across different methods is not really meaningful either in general. Indeed, an easy way to win this contest would be to design an artificial encoding that penalises very heavily the use of elementary patterns. If the different methods handle compatible pattern languages, comparing the compression ratios achieved when considering as model, in turn, the set of patterns selected by each method and applying either encoding can be of interest, and might shed some light on the respective biases of the methods. If the pattern languages are not compatible, then no quantitative comparison can be devised easily and great care must be taken to choose suitable encodings. Qualitative evaluations of obtained patterns are valuable, despite being subjective and domain dependent. In the end, finding a good set of interesting and interpretable patterns is what matters.

Beyond mining patterns with MDL
In this penultimate section, we highlight approaches that do not fall strictly within the category of MDL-based pattern mining methods, yet are clearly related, constitute recently active and fruitful research topics, and might therefore be of interest to the reader. First, we highlight studies of correlation and causality that build on algorithmic information theory in general and, for a few of them, on MDL-based pattern mining techniques more in particular. Second, we outline a framework for pattern mining that relies on a different modeling approach, namely on maximum entropy modeling.

Correlation and causality
A core data analysis problem consists in detecting the presence, measuring the strength, and inferring the direction of dependencies between variables in an observational dataset. Various methods have been proposed to discover correlated variables and infer the causal structure of a dataset (Pearl, 2009). In particular, efforts have focused on applying the tools of algorithmic information theory (cf. Section 3.2) to these questions , aiming to increase the scalability of developed methods and reduce their reliance on assumptions about the underlying probabilities and the shape of the relationship linking the variables.
Simply put, looking at how much can be saved by compressing two objects together rather than separately can be used to measure the strength of their correlation. Furthermore, given a pair of objects, comparing how well the first can be compressed given the second, and vice versa, provides an indication about the direction of causality between the objects. More formally, a central principle in causal inference states that if x causes y, it is easier to describe y using x than the other way around (Pearl, 2009). This principle can be formalised in terms of the Kolmogorov complexity (cf. Section 3). Specifically, the conditional Kolmogorov complexity of object x given object y, denoted K(x | y), is the length of the shortest program that generates x and halts, having access to the information in y. Then, if x causes y we expect that there exists a shorter algorithm to describe y given x than the other way around, and hence K(y | x) < K(x | y). However, the Kolmogorov complexity is not computable, and various practical instanciations have been proposed, including based on the cumulative and Shannon entropies Vreeken, 2015) or on the MDL principle, for instance.
In particular, Budhathoki and Vreeken (2017a) propose two algorithmic correlation measures and present practical instanciations based on the MDL principle, using the Slim and Pack algorithms (cf. Sections 4.2 and 5.1, respectively). Budhathoki and Vreeken (2018c) introduce the Origo algorithm to infer the direction of causality between binary variables, also relying on the Pack algorithm (cf. Section 5.1) to instantiate the MDL score.
Several methods have been proposed to infer the direction of causality between pairs of variables using a MDL score, for discrete variables with refined MDL (Budhathoki and Vreeken, 2017b), as well as using classification and regression trees  or global and local regression functions (Marx and Vreeken, 2019c). Given a collection of variables X 1 , . . . , X m , and Y ,  aim to tell whether the X variables jointly cause Y , or whether there is an unobserved confounding variable, the real parent, using probabilistic principal component analysis (PCA) and a MDL score.  present a method for learning causal graphs where all edges are directed, using multivariate regression and a MDL score. Marx and Vreeken (2022) aim to elucidate the link between MDL-based estimators and the postulate of algorithmic independence of conditionals that underpins this line of approaches.
Other methods have been presented that rely not on MDL but on other information-theoretic scores such as the Shannon entropy or mutual information and aim to infer the direction of causality between pairs of variables Budhathoki and Vreeken, 2018a), to detect functional dependencies between variables (Mandros, Boley, and Vreeken, 2020;, as well as to detect correlations between subspaces , to discover causal rules from observational data  or to detect correlations between categorical variables without assumptions on the distribution . Considering temporal data and Granger causality, Budhathoki and Vreeken (2018b) aim to infer the direction of causality between two event sequences using a sequential normalised maximal likelihood (NML) score, while Hlaváčková-Schindler and  aim to detect causality between timeseries that follow a Poisson distribution, using graphical models and a minimum message length (MML) criterion.

Maximum entropy modeling
Considering the core task of mining itemsets and association rules, it became quickly obvious that finding items that frequently co-occur is not enough, and that one needs to consider statistical dependencies between the items (see for instance Silverstein, Brin, and Motwani, 1998 in Section 3.7). In particular, more insight can be obtained by estimating the expected frequency of items co-occurrence and comparing it to the observed frequency. Various probabilistic models can be used for the estimation (Pavlov, Mannila, and Smyth, 2003), including the maximum entropy distribution.
Several approaches are proposed that rely on the maximum entropy distribution to estimate the occurrence frequency of itemsets based on their subsets and contrast this estimate to filter them C. Wang and Parthasarathy, 2006;. Statistics of the dataset, such as marginal counts, can also be used to constrain the model .
Going beyond local models,  define maximum entropy models for the dataset, which allows them to iteratively select a collection of itemsets that summarises the data well. They implement their approach as the MTV algorithm (see also Section 5.1).  look for a collection of patterns and partition of the transactions into components, such that patterns might be relevant only to a subset of components. The actual mining is done by alternating between two algorithms; DISC refines the assignment of transactions to components given a collection of patterns, whereas DESC discovers patterns given a partitioning of the data. The latter is essentially a improved variant of the MTV algorithm as it optimises the same score but can additionally deal with different data components.
Simply put, maximum entropy modeling for pattern mining works as follows. Given some properties of the dataset, a probability distribution is computed over datasets possessing these properties in expectation. The maximum entropy distribution is chosen because this distribution makes no additional assumptions beyond the considered properties and is therefore the least biased. The probability of observing each of the different candidate patterns under this distribution, that is, the probability that the pattern occurs in a dataset with the considered properties, is then evaluated. The lower this probability, the more unexpected and surprising the pattern is considered to be, and hence the more interesting it is deemed. Selected patterns can be seen as discovered properties of the dataset. They can be incorporated as constraints and the probability distribution updated, thereby supporting an iterative, potentially interactive, mining process.
Whereas when following the MDL principle we aim to describe the whole dataset as compactly as possible, the goal when using maximum entropy models is to select the most informative patterns. Typically, selecting all non-redundant patterns that convey information would still produce a large output. Therefore, a criterion must be used to decide when to stop, putting in balance the information content of the patterns and the model complexity.
The constraints imposed on the distribution might capture measured properties of the dataset at hand, but might also reflect the expertise and (possibly incorrect) assumptions of the analyst with respect to the data. The evaluation of the patterns is thus designed to take into account the current experience and understanding of the analyst, albeit in a limited manner. For this reason, the resulting interestingness measure is often called "subjective".
De  (also ) introduce a framework for data mining based on maximum entropy modeling, sometimes referred to as the FORSIED framework, for Formalising Subjective Interestingness. They start with the task of mining tiles from a binary database considering assumptions on the row and column marginals , then derive models for different types of assumptions , data and patterns, such as real-valued tabular data  and various kinds of subgraphs ( , also in a visual interactive exploratory setting ).
An intuitive difference between the two families of approaches is that, following the MDL principle, what is most frequent, most expected, results in the most efficient compression. Instead, in maximum entropy modeling, what is most unexpected, deviates most from assumptions, is generally considered most interesting. However, going too far in either direction can be dangerous. Conforming too much to expectations can lead to rather boring results, while very unexpected results can be startling and difficult to interpret.
Choosing the type of patterns of interest and designing the encoding allows to incorporate background information by favouring some patterns over others, yet this is somewhat implicit, indirect and static. Maximum entropy approaches instead require to model assumptions about the data more explicitly. They tend to be fairly computationally intensive, though much less so than randomisation approaches (see for instance  in Section 3.7), that need to explicitly generate, and possibly mine, a large number of randomised copies of the dataset to achieve comparably precise evaluation. Yet, as with randomisation approaches, formulating anything but simple assumptions about the distribution can be difficult. On the other hand, unlike MDL-based approaches, most methods relying on the maximum entropy distribution naturally allow for updates, incorporating feedback, and support interactive analysis. That is, in theory, maximum entropy models allow to incorporate diverse background constraints, in a flexible and potentially interactive manner. However, in practise, this is limited by the fact that constraints can quickly render the optimisation unfeasible. As a step towards alleviating this limitation, Dalleiger and Vreeken (2020b) propose an algorithm that dynamically factorises the joint distribution in order to effectively and efficiently approximate the maximum entropy distribution.
After giving an outline of relevant concepts from information theory and coding, and an aperçu of related theoretical and conceptual contributions, we reviewed MDL-based methods for mining various types of data and patterns. In particular, we focused on aspects related to the design of an encoding scheme, rather than on algorithmic issues for instance, since the former constitutes the most distinctive ingredient of MDL methodologies, but also a major stumbling block and source of contention. We pointed out two main strategies that underpin the majority of approaches and that can be used to categorise them. Namely, we distinguished dictionary-based approaches from block-based approaches. Then, we considered some discussion points pertaining to the use of MDL in pattern mining, and highlighted related problems that constitute promising directions for future research. Indeed, there