Gibbs Sampling Subjectively Interesting Tiles

. The local pattern mining literature has long struggled with the so-called pattern explosion problem: the size of the set of patterns found exceeds the size of the original data. This causes computational problems (enumerating a large set of patterns will inevitably take a substantial amount of time) as well as problems for interpretation and usability (trawling through a large set of patterns is often impractical). Two complementary research lines aim to address this problem. The ﬁrst aims to develop better measures of interestingness, in order to reduce the number of uninteresting patterns that are returned [6,10]. The second aims to avoid an exhaustive enumeration of all ‘interesting’ patterns (where interestingness is quantiﬁed in a more traditional way, e.g. frequency), by directly sampling from this set in a way that more ‘interest-ing’ patterns are sampled with higher probability [2]. Unfortunately, the ﬁrst research line does not reduce computational cost, while the second may miss out on the most interesting patterns. In this paper, we combine the best of both worlds for mining interesting tiles [8] from binary databases. Speciﬁcally, we propose a new pattern sampling approach based on Gibbs sampling, where the probability of sampling a pattern is proportional to their subjective interest-ingness [6]—an interestingness measure reported to better represent true interestingness. Theexperimental evaluation conﬁrms the theory, but also reveals an important weakness of the proposed approach which we speculate is shared with any other pattern sampling approach. We thus conclude with a broader discussion of this issue, and a forward look.


Introduction
Pattern mining methods aim to select elements from a given language that bring to the user "implicit, previously unknown, and potentially useful information from data" [7].To meet the challenge of selecting the appropriate patterns for a user, several lines of work have been explored: (1) Many constraints on some measures that assess the quality of a pattern using exclusively the data have been designed [4,12,13]; (2) Preference measures have been considered to only retrieve patterns that are non dominated in the dataset; (3) Active learning systems have been proposed that interact with the user to explicit her interest on the patterns and guide the exploration toward those she is interested in; (4) Subjective interestingness measures [6,10] have been introduced that aim to take into account the implicit knowledge of a user by modeling her prior knowledge and retrieving the patterns that are unlikely according to the background model.
The shift from threshold-constraints on objective measures toward the use of subjective measures provides an elegant solution to the so-called pattern explosion problem by considerably reducing the output to only truly interesting patterns.Unfortunately, the discovery of subjectively interesting patterns with exact algorithms remains computationally challenging.
In this paper we explore another strategy that is pattern sampling.The aim is to reduce the computational cost while identifying the most important patterns, and allowing for distributed computations.There are two families of local pattern sampling techniques.
The first family uses Metropolis Hastings [9], a Markov Chain Monte Carlo (MCMC) method.It performs a random walk over a transition graph representing the probability of reaching a pattern given the current one.This can be done with the guarantee that the distribution of the considered quality measure is proportional on the sample set to the one of the whole pattern set [1].However, each iteration of the random walk is accepted only with a probability equal to the acceptance rate α.This can be very small, which may result in a prohibitively slow convergence rate.Moreover, in each iteration the part of the transition graph representing the probability of reaching patterns given the current one, has to be materialized in both directions, further raising the computational cost.Other approaches [5,11] relax this constraint but lose the guarantee.
Methods in the second family are referred to as direct pattern sampling approaches [2,3].A notable example is [2], where a two-step procedure is proposed that samples frequent itemsets without simulating stochastic processes.In a first step, it randomly selects a row according to a first distribution, and from this row, draws a subset of items according to another distribution.The combination of both steps follows the desired distribution.Generalizing this approach to other pattern domains and quality measures appeared to be difficult.
In this paper, we propose a new pattern sampling approach based on Gibbs sampling, where the probability of sampling a pattern is proportional to their Subjective Interestingness (SI) [6].Gibbs sampling -described in Sect. 3 -is a special case of Metropolis Hastings where the acceptance rate α is always equal to 1.In Sect.4, we show how the random walk can be simulated without materializing any part of the transition graph, except the currently sampled pattern.While we present this approach particularly for mining tiles in rectangular databases, applying it for other pattern languages can be relatively easily achieved.The experimental evaluation (Sect.5) confirms the theory, but also reveals a weakness of the proposed approach which we speculate is shared by other direct pattern sampling approaches.We thus conclude with a broader discussion of this issue (Sect.6), and a forward look (Sect.7).

Notation
Input Dataset.A dataset D is a Boolean matrix with m rows and n columns.For i ∈ 1, m and j ∈ 1, n , D(i, j) ∈ {0, 1} denotes the value of the cell corresponding to the i-th row and the j-th column.For a given set of rows I ⊆ 1, m , we define the support function supp C (I) that gives all the columns having a value of 1 in all the rows of I, i.e., supp 1 shows a toy example of a Boolean matrix, where for I = {4, 5, 6} we have that supp C (I) = {2, 3, 4}.

The Interestingness of a Tile
In order to assess the quality of a tile τ , we use the framework of subjective interestingness SI proposed in [6].We briefly recapitulate the definition of this measure for tiles, denoted SI(τ ) for a tile τ , and refer the reader to [6] for more details.SI(τ ) measures the quality of a tile τ as the ratio of its subjective information content IC(τ ) and its description length DL(τ ): Tiles with large SI(τ ) thus compress subjective information in a short description.Before introducing IC and DL, we first describe the background model-an important component required to define the subjective information content IC.
Background Model.The SI is subjective in a sense that it accounts for prior knowledge of the current data miner.A tile τ is informative for a particular user if this tile is somehow surprising for her, otherwise, it does not bring new information.The most natural way for formalizing this is to use a background distribution representing the data miner's prior expectations, and to compute the probability Pr(τ ∈ D) of this tile under this distribution.The smaller Pr(τ ∈ D), the more information this pattern contains.Concretely, the background model consists of a value Pr(D(i, j) = 1) associated to each cell D(i, j) of the dataset, and denoted p ij .More precisely, p ij is the probability that D(i, j) = 1 under user prior beliefs.In [6], it is shown how to compute the background model and derive all the values p ij corresponding to a given set of considered user priors.
Based on this model, the probability of having a tile τ = (I, J) in D is: Information Content IC .This measure aims to quantify the amount of information conveyed to a data miner when she is told about the presence of a tile in the dataset.It is defined for a tile τ = (I, J) as follows: Thus, the smaller Pr(τ ∈ D), the higher IC(τ ), and the more informative τ .Note that for Description Length DL.This function should quantify how difficult it is for a user to assimilate the pattern.The description length of a tile τ = (I, J) should thus depend on how many rows and columns it refers to: the larger are |I| and |J|, the larger is the description length.Thus, DL(τ ) can be defined as: where a and b are two constants that can be handled to give more or less importance to the contributions of |I| and |J| in the description length.

Problem Statement
Given a Boolean dataset D, the goal is to sample a tile τ from the set of all the tiles T present in D, with a probability of sampling P S proportional to SI(τ ), that is: τ ∈T SI(τ ) .A naïve approach to sample a tile pattern according to this distribution is to generate the list {τ 1 , ..., τ N } of all the tiles present in D, sample x ∈ [0, 1] uniformly at random, and return the tile However, the goal behind using sampling approaches is to avoid materializing the pattern space which is generally huge.We want to sample without exhaustively enumerating the set of tiles.In [2], an efficient procedure is proposed to directly sample patterns according to some measures such as the frequency and the area.However, this procedure is limited to only some specific measures.Furthermore, it is proposed for pattern languages defined on only the column dimension, for example, itemset patterns.In such language, the rows related to an itemset pattern F ⊆ {1, ..., n} are uniquely identified and they correspond to all the rows containing the itemset, that are supp R (F ).In our work, we are interested in tiles which are defined by both columns and rows indices.In this case, it is not clear how the direct procedure proposed in [2] can be applied.
For more complex pattern languages, a generic procedure based on Metropolis Hasting algorithm has been proposed in [9], and illustrated for subgraph patterns with some quality measures.While this approach is generic and can be extended relatively easily to different mining tasks, a major drawback of using Metropolis Hasting algorithm is that the random walk procedure contains the acceptance test that needs to be processed in each iteration, and the acceptance rate α can be very small, which makes the convergence rate practically extremely slow.Furthermore, Metropolis Hasting can be computationally expensive, as the part of the transition graph representing the probability of reaching patterns given the current one, has to be materialized.
Interestingly, a very useful MCMC technique is Gibbs sampling, which is a special case of Metropolis-Hasting algorithm.A significant benefit of this approach is that the acceptante rate α is always equal to 1, i.e., the proposal of each sampling iteration is always accepted.In this work, we use Gibbs sampling to draw patterns with a probability distribution that converges to P S .In what follows, we will first generically present the Gibbs sampling approach, and then we show how we efficiently exploit it for our problem.Unlike Metropolis Hasting, the proposed procedure performs a random walk by materializing in each iteration only the currently sampled pattern.

Gibbs Sampling
Suppose we have a random variable X = (X 1 , X 2 , ..., X l ) taking values in some domain Dom.We want to sample a value x ∈ Dom following the joint distribution P (X = x).Gibbs sampling is suitable when it is hard to sample directly from P but known how to sample just one dimension The idea of Gibbs sampling is to generate samples by sweeping through each variable (or block of variables) to sample from its conditional distribution with the remaining variables fixed to their current values.Algorithm 1 depicts a generic Gibbs Sampler.At the beginning, x is set to its initial values (often values sampled from a prior distribution q).Then, the algorithm performs a random walk of p iterations.In each iteration, we sample ) (while fixing the other dimensions), then we follow the same procedure to sample x 2 , ..., until x l .
Algorithm 1: Gibbs sampler The random walk needs to satisfy some constraints to guarantee that the Gibbs sampling procedure converges to the stationary distribution P .In the case of a finite number of states (a finite space Dom in which X takes values), sufficient conditions for the convergence are irreducibility and aperiodicity: Irreducibility.A random walk is irreducible if, for any two states x, y ∈ Dom s.t.P (x) > 0 and P (y) > 0, we can get from x to y with a probability > 0 in a finite number of steps.I.e. the entire state space is reachable.Aperiodicity.A random walk is aperiodic if we can return to any state x ∈ Dom at any time.I.e.revisiting x is not conditioned to some periodicity constraint.
One can also use blocked Gibbs sampling.This consists in growing many variables together and sample from their joint distribution conditioned to the remaining variables, rather than sampling each variable x i individually.Blocked Gibbs sampling can reduce the problem of slow mixing that can be due to the high number of dimensions used to sample from.

Gibbs Sampling of Tiles with Respect to SI
In order to sample a tile τ = (I, J) with a probability proportional to SI(τ ), we propose to use Gibbs sampling.The simplest solution is to consider a tile τ as m + n binary random variables (x 1 , ..., x m , ..., x m+n ), each of them corresponds to a row or a column, and then apply the procedure described in Algorithm 1.In this case, an iteration of Gibbs sampling requires to sample from each column and row separately while fixing all the remaining rows and columns.The drawback of this approach is the high number of variables (m + n) which may lead to a slow mixing time.In order to reduce the number of variables, we propose to split τ = (I, J) into only two separated blocks of random variables I and J, we then directly sample from each block while fixing the value of the other block.This means that an iteration of the random walk contains only two sampling operations instead of m+n ones.We will explain in more details how this Blocked Gibbs sampling approach can be applied, and how to compute the distributions used to directly sample a block of rows or columns.

Algorithm 2: Gibbs
Algorithm 2 depicts the main steps of Blocked Gibbs sampling for tiles.We start by initializing (I, J) (0) with a distribution q proportional to the area (|I| × |J|) following the approach proposed in [2].This choice is mainly motivated by its linear time complexity of sampling.Then, we need to efficiently sample from P (I = I | J = J) and P (J = J | I = I).In the following, we will explain how to sample I with P (I = I|J = J), and since the SI is symmetric w.r.t.rows and columns, the same strategy can be used symmetrically to sample a set of columns with P (J = J | I = I).

Sampling a Set of Rows I Conditioned to Columns J.
For a specific J ⊆ {1, ..., n}, the number of tiles (I, J) present in the dataset can be huge, and can go up to 2 m .This means that naïvely generating all these candidate tiles and then sampling from them is not a solution.Thus, to sample a set of rows I conditioned to a fixed set of columns J, we propose an iterative algorithm that builds the sampled I by drawing each i ∈ I separately, while ensuring that the joint distribution of all the drawings is equal to P (I = I|J = J).I is built using two variables: R 1 ⊆ {1, ..., m} made of rows that belong to I, and R 2 ⊆ {1, ..., m} \ R 1 that contains candidate rows that can possibly be sampled and added to R 1 .Initially, we have R 1 = ∅ and R 2 = supp R (J).At each step, we take i ∈ R 2 , do a random draw to determine whether i is added to R 1 or not, and remove it from R 2 .When R 2 = ∅, the sampled set of rows I is set equal to R 1 .To apply this strategy, all we need is to compute , the probability of sampling i considering the current sets R 1 , R 2 and J: Complexity .Let's compute the complexity of sampling I with a probability P (I = I|J = J).Before starting the sampling of rows from R 2 , we first compute the value of IC({i}, J) for each i ∈ R 2 (in O(n • m)).This will allow to compute in O( 1) the values of IC that appear in In addition to that, sampling each element i ∈ R 2 requires to compute the corresponding values of f (x, y).These values are computed once for the first sampled row i ∈ R 2 with a cost of O(m), and then they can be updated directly when sampling the next rows, using the following relation: This means that the overall cost of sampling the whole set of rows I with a probability Following the same approach, sampling J conditionned to I is done in O(n • m).As we have p sampling iterations, the worst case complexity of the whole Gibbs sampling procedure of a tile τ is Convergence Guarantee.In order to guarantee the convergence to the stationary distribution proportional to the SI measure, the Gibbs sampling procedure needs to satisfy some constraints.In our case, the sampling space is finite, as the number of tiles is limited to at most 2 m+n .Then, the sampling procedure converges if it satisfies the aperiodicity and the irreducibility constraints.The Gibbs sampling for tiles is indeed aperiodic, as in each iteration it is possible to remain in exactly the same state.We only have to verify if the irreducibility property is satisfied.We can show that, in some cases, the random walk is reducible, we will show how to make Gibbs sampling irreducible in those cases.
Theorem 1.Let us consider the bipartite graph G = (U, V, E) derived from the dataset D, s.t., U = {1, .., m}, V = {1, ..., n}, and If the bipartite graph G is connected, then the Gibbs sampling procedure on tiles of D is irreducible.
Proof.We need to prove that for all pair of tiles τ 1 = (I 1 , J 1 ), τ 2 = (I 2 , J 2 ) present in D, the Gibbs sampling procedure can go from τ 1 to τ 2 .Let G τ1 , G τ2 be the complete bipartite graphs corresponding to τ 1 and τ 2 .As G is connected, there is a path from any vertex of G τ1 to any vertex of G τ2 .The probability that the sampling procedure walks through one of these paths is not 0, as each step of these paths constitutes a tile present in D. After walking on one of these paths, the procedure will find itself on a tile τ ⊆ τ 2 .Reaching τ 2 from τ is probable after one iteration by sampling the right rows and then the right columns.Thus, if the bipartite graph G is connected, the Gibbs sampling procedure converges to a stationary distribution.To make the random walk converge when G is not connected, we can compute the connected components of G, and then apply Gibbs sampling separately in each corresponding subset of the dataset.

Experiments
We report our experimental study to evaluate the effectiveness of Gibbs-SI.Java source code is made available1 .We consider three datasets whose characteristics are given in Table 2. mushrooms and chess from the UCI repository2 are commonly used for evaluation purposes.kdd contains a set of SIGKDD paper abstracts between 2001 and 2008 downloaded from the ACM website.Each abstract is represented by a row and words correspond to columns, after stop word removal and stemming.For each dataset, the user priors that we represent in the SI background model are the row and column margins.In other terms, we consider that user knows (or, is already informed about) the following statistics: j D(i, j) for all i ∈ I, and i D(i, j) for all j ∈ J. Empirical Sampling Distribution.First, we want to experimentally evaluate how the Gibbs sampling distribution matches with the desired distribution.We need to run Gibbs-SI in small datasets where the size of T is not huge.Then, we take a sufficiently large number of samples so that the sampling distribution can be created.To this aim, we have synthetically generated a dataset containing 10 rows, 10 columns, and 855 tiles.We run Gibbs-SI with three different numbers of iterations p: 1k, 10k, and 100k, for each case, we keep all the visited tiles, and we study their distribution w.r.t.their SI values.Figure 1 reports the results.For 1k sampled patterns, the proportionality between the number of sampling and SI is not clearly established yet.For higher numbers of sampled patterns, a linear relation between the two axis is evident, especially for the case of 100k sampled patterns, which represents around 100 times the total number of all the tiles in the dataset.The two tiles with the highest SI are sampled the most, and the number of sampling clearly decreases with the SI value.Characteristics of Sampled Tiles.To investigate which kind of patterns are sampled by Gibbs-SI, we show in Fig. 2 the distribution of sampled tiles w.r.t their number of rows, columns, and their SI, for each of the three datasets given in Table 2.For mushrooms and chess, Gibbs-SI is able to return patterns with a diverse number of rows and columns.It samples much more patterns with low SI than patterns with high SI values.In fact, even if we are sampling proportionally to SI, the number of tiles in T with poor quality are significantly higher than the ones with high quality values.Thus, the probability of sampling one of low quality patterns is higher than sampling one of the few high quality patterns.For kdd, although the number of columns in sampled tiles varies, all the sampled tiles unfortunately cover only one row.In fact, the particularity of this dataset is the existence of some very large transactions (max = 180).

Quality of the Sampled Tiles.
In this part of the experiment, we want to study whether the quality of the top sampled tiles is sufficient.As mining exhaustively the best tiles w.r.t.SI is not feasible, we need to find some strategy that identifies high quality tiles.We propose to use LCM [14] to retrieve the closed tiles corresponding to the top 10k frequent closed itemsets.A closed tile τ = (I, J) is a tile that is present in D and whose I and J cannot be extended anymore.Although closed tiles are not necessarily the ones with the highest SI, we make the hypothesis that at least some of them have high SI values as they maximize the value of IC function.For each of the three real world datasets, we compare between the SI of the top closed tiles identified with LCM and the ones identified with Gibbs-SI.In Table 3, we show the SI of the top-1 tile, and the average SI of the top-10 tiles, for each of LCM and Gibbs-SI.
Unfortunately, the scores of tiles retrieved with LCM are substantially larger than the ones of Gibbs-SI, especially for mushrooms and chess.Importantly, there may exist tiles that are even better than the ones found by LCM.This means that Gibbs-SI fails to identify the top tiles in the dataset.We believe that this is due to the very large number of low quality tiles which trumps the number of high quality tiles.The probability of sampling a high-quality tile is exceedingly small, necessitating a practically too large sample to identify any.

Discussion
Our results show that efficiently sampling from the set of tiles with a sampling probability proportional to the tiles' subjective interestingness is possible.Yet, they also show that if the purpose is to identify some of the most interesting patterns, direct pattern sampling may not be a good strategy.The reason is that the number of tiles with low subjective interestingness is vastly larger that those with high subjective interestingness.This imbalance is not sufficiently offset by the relative differences in their interestingness and thus in their sampling probability.As a result, the number of tiles that need to be sampled in order to sample one of the few top interesting ones is of the same order as the total number of tiles.
To mitigate this, one could attempt to sample from alternative distributions that attribute an even higher probability to the most interesting patterns, e.g. with probabilities proportional to the square or other high powers of the subjective interestingness.We speculate, however, that the computational cost of sampling from such more highly peaked distributions will also be larger, undoing the benefit of needing to sample fewer of them.This intuition is supported by the fact that direct sampling schemes according to itemset support are computationally cheaper than according to the square of their support [2].
That said, the use of sampled patterns as features for downstream machine learning tasks, even if these samples do not include the most interesting ones, may still be effective as an alternative to exhaustive pattern mining.

Conclusions
Pattern sampling has been proposed as a computationally efficient alternative to exhaustive pattern mining.Yet, existing techniques have been limited in terms of which interestingness measures they could handle efficiently.
In this paper, we introduced an approach based on Gibbs sampling, which is capable of sampling from the set of tiles proportional to their subjective interestingness.Although we present this approach for a specific type of pattern language and quality measure, we can relatively easily follow the same scheme to apply Gibbs sampling for other pattern mining settings.The empirical evaluation demonstrates effectiveness, yet, it also reveals a potential weakness inherent to pattern sampling: when the number of interesting patterns is vastly outnumbered by the number of non-interesting ones, a large number of samples may be required, even if the samples are drawn with a probability proportional to the interestingness.Investigating our conjecture that this problem affects all approaches for sampling interesting patterns (for sensible measures of interestingness) seems a fruitful avenue for further research.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material.If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Fig. 1 .
Fig. 1.Distribution of sampled patterns in synthetic data with 10 rows and 10 columns.

Table 1 .
Example of a binary dataset D.

Table 3 .
The SI of the top-1 tile, and the average SI of the top-10 tiles, found by LCM and Gibbs-SI in the studied datasets.