Efficient mixture model for clustering of sparse high dimensional binary data

In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on EM algorithm, SparseMix: -is especially designed for the processing of sparse data, -can be efficiently realized by an on-line Hartigan optimization algorithm, -is able to automatically reduce unnecessary clusters. We perform extensive experimental studies on various types of data, which confirm that SparseMix builds partitions with higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.


Introduction
Sparse high-dimensional binary representations became very popular in various domains including text mining, cheminformatics, biology, etc. [23,25,17]. Binary features are usually used to indicate whether an object contains a predefined pattern or satisfies a given rule, e.g. one can represent a sentence by its words (set-ofwords) [3] or identify a chemical compound by its chemical structures (fingerprint) [16]. Therefore, efficient processing and learning from such data is of great practical importance. In this paper we introduce a version of model-based clustering, SparseMix, which efficiently processes high-dimensional sparse binary data 1 . Our model splits the data into groups which can be efficiently compressed by a collection of coding algorithms; each algorithm is designed for encoding elements within a single cluster (Section 3). Since the code-lengths directly depend on the associated probability distribution, the elements generated from similar distributions are grouped together and consequently we obtain the effect of model-based clustering. In contrast to classical mixture models using Bernoulli variables or latent trait models, our model is designed for sparse data and can be optimized by an on-line Hartigan algorithm, which converges faster and finds better solutions than batch procedures like EM (Section 4).
SparseMix builds a bridge between mixture models and centroid-based clustering, and describes every cluster by its representative (a single vector characterizing the most popular cluster patters) and probability distribution modeling dispersion from this representative. The relationship between the form of the representative and the associated cluster distribution is controlled by an additional parameter of the model. By placing a parameter selection problem on solid mathematical ground, we show that we can move from a model providing the best compression rate of data to the one obtaining high speed performance (Section 4 and Theorem 3.1). Our method can automatically discover the number of groups by introducing a well-justified cost of cluster identification. We present a theoretical and experimental analysis how the number of clusters depends on the main characteristics of the data set (Example 3.1).
The paper contains extensive experiments performed on various types of data, including text corpora, chemical and biological data sets, as well as the MNIST image database. The results show that SparseMix gives higher compatibility with reference partition than existing methods based on mixture models and similarity measures. Most of clusters representatives obtained by SparseMix on MNIST data set correspond to distinct digits, which confirms high quality of its results, see Figure  1. Its running time is significantly lower than in related model-based algorithms and comparable to methods implemented in the Cluto package, which are optimized for processing large data sets.
The paper is organized as follows. The next section contains a brief description of related clustering methods. The SparseMix model is introduced in Section 3. Section 4 presents a practical implementation of our method. Experiments are included in Section 5 and finally the conclusion is given.

Related work
A lot of clustering methods are expressed in a geometric framework, where similarity between objects is defined with the use of the Euclidean metric, e.g. k-means [31]. Although a geometric description of clustering can be insightful for continuous data, it becomes less informative in the case of high dimensional binary (or discrete) vectors, where the Euclidean distance is not natural. To adapt these approaches to binary data sets, the authors consider, for instance, k-medoids or k-modes [21,10] with dissimilarity measure designed for this special type of data, such as Hamming, Jaccard or Tanimoto measures [28]. Evaluation of possible dissimilarity metrics for categorical data can be found in [14,2]. To obtain a more flexible structure of clusters one can also use hierarchical methods [43] or density-based clustering [40].
Model-based clustering [32], where data is modeled as a sample from a parametric mixture of probability distributions, is commonly used for grouping continuous data using Gaussian models, but has also been adapted for processing binary data. In the simplest case, the probability model of each cluster is composed of a sequence of independent Bernoulli variables (or multinomial distributions) describing the probabilities on subsequent attributes [9,23]. Since many attributes usually are statistically irrelevant and independent of true categories, they may be removed or associated with small weights [19,6]. This partially links mixture models with subspace clustering of discrete data [41,11]. Since the use of multinomial distributions formally requires independence of attributes, different smoothing techniques were proposed, such as applying Dirichlet distributions as a prior to the multinomial [7]. Another version of mixture models for binary variables tries to maximize the probability that data points are generated around cluster centers with the smallest possible dispersion [9]. This technique is closely related to our approach, however, our model allows for using any clusters representatives (not only cluster centers), is significantly faster due to the use of sparse coders and can automatically deduce the number of clusters.
A mixture of latent trait analyzers is a specialized type of mixture model for categorical data, where a continuous univariate of a multivariate latent variable is used to describe the dependence in categorical attributes [37,18]. Although this technique recently received high interest in the literature [26,8], it is potentially difficult to fit the model, because the likelihood function involves an integral that cannot be evaluated analytically. Moreover, its use is computationally expensive for large high dimensional data sets [35].
Information-theoretic clustering relies on minimizing the entropy of a partition or maximizing the mutual information between data and its discretized form [29,36,13]. Although both approaches are similar and can be explained as a minimization of coding cost, the first creates "hard clusters", where an instance is classified to exactly one category, while the second allows for soft assignments [34]. Informationtheoretic clustering was combined with model selection criteria as MDLP (minimum description length principle) [33] or MML (minimum message length) [38] to add a model complexity to the clustering objective function [4]. The authors of [9] established a close connection between entropy-based techniques for discrete data and model-based clustering using Bernoulli variables. In particular, entropy criterion can be formally derived using a classification maximum likelihood.
To the best authors knowledge, neither model-based clustering nor informationtheoretic methods have been optimized for processing sparse high-dimensional binary data. Our method can be seen as a combination of k-medoids with modelbased clustering (in the sense that it describes a cluster by a single representative and a multivariate probability model), which is efficiently realized for sparse highdimensional binary data. It is worth to mention the Cluto package [24], which is a practical clustering software designed especially for processing sparse high dimensional data. The Cluto package is built on a sophisticated multi-level graph partitioning engine and offers many different criteria that can be used to derive both partitional, hierarchical and spectral clustering algorithms.

Clustering model
The goal of clustering is to split the data into groups that contain elements characterized by similar patters. In our approach the elements are similar if they can be efficiently compressed by the same algorithm. We begin this section with presenting a model for compressing elements within a single cluster. Next, we combine these partial encoders and define a final clustering objective function.

Compression model for a single cluster
Let us assume that X ⊂ {0, 1} D is a data set (cluster) containing D-dimensional binary vectors. We implicitly assume that X represents sparse data, i.e. the number of positions occupied by non-zero bits is relatively low. A typical way for encoding such data is to directly remember the values at each coordinate [4,29]. Since, in practice, D is often large, this straightforward technique might be computationally inefficient. Moreover, due to the sparsity of data, positions occupied by zeros are not very informative while the less frequent non-zero bits carry substantial knowledge. Therefore, instead of remembering all the bits of every vector, it might be more convenient to encode positions occupied by non-zero values. It occurs that this strategy can be efficiently implemented by on-line algorithms.
To realize the aforementioned goal, we first select a representative (prototype) m ∈ {0, 1} D of a cluster X. Next, for each data point x = (x 1 , . . . , x D ) ∈ X we construct a corresponding vector xor(x, m) = (|x 1 − m 1 |, . . . , |x D − m D |) ∈ {0, 1} D of differences with m. If a representative is chosen as the most probable point of a cluster (mean of a cluster), then the data set of differences will be sparser on average than the original data set X. An efficient way for storing such sparse data relies on encoding the numbers of coordinates with non-zero bits. Concluding, the original data X is compressed by remembering a representative and encoding resulting vectors of differences in an efficient manner, see Figure 2.
We now precisely follow the above idea and calculate the cost of coding in this scheme, which will be the basis of our clustering criterion function. Let the distribution at the i-th coordinate of x ∈ X be described by a Bernoulli random variable taking value 1 with a probability p i ∈ [0, 1] and 0 with a probability (1 − p i ), i.e. p i = P (x i = 1). For a fixed T ∈ [0, 1], we consider a representative m = m(T ) = (m 1 , . . . , m D ) defined by Although a representative m(T ) depends on T , we usually discard this parameter and simply write m, when T is known from the context. The i-th coordinate of X is more likely to attain value 1, if p i > 1 2 and, in consequence, for T = 1 2 the representative m coincides with the average vector (most probable point) of X.
Given a representative m, we consider the differences xor(x, m), for x ∈ X, and denote such a data set by The probability q i = q i (T ) of bit 1 at the i-th position in D m (X) equals Let us notice that q i ≤ p i , for T ≥ 1 2 , which makes D m (X) sparser than X, see Figure 3.
To design an efficient coder for sparse data, let us consider a probability distribution Q = Q(T ) = {Q 1 , . . . , Q D } on the set of coordinates {1, . . . , D} describing a distribution of positions with non-zero bits in D m (X). A number Q i is a conditional probability that the i-th position holds value 1 if at least one coordinate in a vector is non-zero, i.e.
where Z = Z(T ) = D j=1 q j is a normalization factor. In practice, it is convenient to restrict the attention to the support of Q (non-zero probabilities), because we usually have Q i = 0 for most i.
The Shannon entropy theory states that the code-lengths in an optimal prefixfree coding depend strictly on the associated probability distribution [12]. Given a distribution Q of positions with bit 1 it is possible to construct at most D codes, each with the length 2 − log Q i (we generate codes only for Q i > 0). The short codes correspond to the most frequent symbols, while the longest ones are related with rare objects. Given an arbitrary element d = (d 1 , . . . , d D ) ∈ D m (X) we encode its non-zero coordinates and obtain that its compression requires This leads to the SparseMix objective function for a single cluster, which gives the average cost of compressing a single element of X by our sparse coder: Definition 3.1. (one cluster cost function) Let X ⊂ {0, 1} D be a data set and let T ∈ [0, 1] be fixed. SparseMix objective function for a single cluster is given by 3 : Observe that, given probabilities p 1 , . . . , p D , the selection of T determines the form of m and D m (X).
Remark 3.1. To be able to decode the initial data set, we would also need to remember the probabilities p 1 , . . . , p D determining the form of the representative m and the corresponding probability Q used for constructing the codewords. These are the model parameters, which in practical coding scheme are stored once in the header. Since they do not affect the asymptotic value of data compression, we do not include them in the final cost function 4 .
Moreover, to reconstruct the original data we should distinguish the encoded representations of subsequent vectors. It could be realized by reserving an additional symbol for separating two encoded vectors or by remembering the number of nonzero positions in every vector. Although this is necessary in the coding task, it is less important for clustering and therefore we decided not to include it in the definition. 2 in the limiting case 3 We put: 0 · log 0 = 0 4 Nevertheless. these probabilities should be accounted in model selection criteria as AIC or BIC.
The following theorem shows that T = 1 2 provides the best compression rate of a single cluster.
are two representatives and the mean number of non-zero bits in D m 1 (X) is not lower than 1, i.e. Z(T 1 ) ≥ 1, then: Proof. See A Although the best model for compression is given by T = 1 2 , the alternative choices for T might sometimes yield better clustering results. In particular, the greater T is, the faster updates in clustering algorithm we obtain (see Section 4), which might be a crucial issue in practical usage.

Clustering criterion function
A single encoder allows us to compress simple data. To efficiently encode real data sets, which usually origin from several sources, it is profitable to construct multiple coding algorithms, each designed for one homogeneous part of data. Finding an optimal set of algorithms leads to a natural division of data, which is a basic idea behind our model. Below, we describe the construction of our clustering objective function, which combines partial cost functions of clusters.
Let us assume that we are given a partition of X containing k groups X 1 , . . . , X k (pairwise disjoint subsets of X such that X = X 1 ∪ . . . ∪ X k ), where every subset X i is described by its own coding algorithm. Observe that to encode an instance x ∈ X i by such a multiple coding model one should remember its group identifier and the code of x defined by the i-th encoder, i.e., Such a strategy enables unique decoding, because a retrieved coding algorithm allows subsequently for discovering an instance (see Figure 4). The compression procedure should find a division of X and design k coding algorithms, which minimize the expected length of code given by (2). The coding algorithms for each cluster are designed as described in previous subsection. More precisely, let p i = (p i 1 , . . . , p i D ) be a vector, where p i j is a probability that the j-th coordinate in the i-th cluster is non-zero, for i = 1, . . . , k. Next, given a fixed T , for each cluster X i we construct a representative m i = (m i 1 , . . . , m i D ) and calculate the associated probability distributions q i = (q i 1 , . . . , q i D ) and Q i = encoder (cluster) identification codes defined by i-th encoder Figure 4: Multi-encoder model.
The average code-length for compressing a single vector in the i-th cluster is given by (see (1)): To remember clusters identifiers, we again follow Shannon's theory of coding. Given a probability P i = P (x ∈ X i ) of generating an instance from a cluster X i (the prior probability), the optimal code-length of the i-th identifier is given by Since the introduction of any new cluster increases the cost of clusters identification, it might occur that maintaining a smaller number of groups is more profitable. Therefore, this model will have a tendency to variate the sizes of clusters and, in consequence, some groups might finally disappear (can be reduced).
The SparseMix cost function combines the cost of clusters identification with the cost of encoding their elements. To add higher flexibility to the model, we introduce an additional parameter β, which allows to weight the cost of clusters identification. Specifically, if the number of clusters is known a priori, we should put β = 0 to prevent from reducing any groups. On the other hand, to encourage the model to remove clusters we can increase the value of β. By default β = 1, which gives a typical coding model: (clustering cost function) Let X = {0, 1} D be a data set of Ddimensional binary vectors and let X 1 , . . . , X k be a partition of X into pairwise disjoint subsets. For fixed T ∈ [0, 1] and β ≥ 0 the SparseMix clustering objective function equals: where P i is the probability of a cluster X i , cost(i) is the cost of encoding its identifier (4) and cost T (X i ) is the average code-length of compressing elements of X i (3).
As can be seen, every cluster is described by a single representative and a probability distribution modeling dispersion from a representative. Therefore, our model can be interpreted as a combination of k-medoids with model-based clustering. It is worth to comment that for T = 1, we always get a representative m = 0. In consequence, D 0 (X) = X and a distribution in every cluster is directly fitted to original data.
The cost of clusters identification allows us to reduce unnecessary clusters. To get more insight into this mechanism, we present the following example. For simplicity, we use T = β = 1.
To visualize the situation we can arrange a data set in a matrix, where rows correspond to instances generated by the mixture components, while the columns are related to their attributes: The matrix entries show the probability of generating bit 1 at a given coordinate belonging to one of four matrix regions. The parameter α determines how similar are the instances generated from the underlying distributions. For α = 1 2 , both components are identical, while for α ∈ {0, 1} we get their perfect distinction.
We compare the cost of using a single cluster for all instances with the cost of splitting the data into two optimal groups (clusters are perfectly fitted to the sources generating the data). For the reader's convenience, we put the details of the calculations in B. The analysis of the results is presented below. We consider three cases: 1. Sources are characterized by the same number of bits. The influence of ω and α on the number of clusters, for a fixed d = 1 2 D, is presented in Figure 5(a). Generally, if sources are well-separated, i.e. α / ∈ (0.2, 0.8), then Figure 5: Optimal number of clusters for data generated by the mixture of sources given by (7). Blue regions show the combinations of mixture parameters which lead to one cluster, while white areas correspond to two clusters. 5(a) presents the case when every source is characterized by the same number of bits d = 1 2 D, 5(b) corresponds to the situation when each source produces the same number of instances, while 5(c) is the combination of both previous cases.
SparseMix will always create two clusters regardless of the mixing proportion. This confirms that SparseMix is not sensitive to unbalanced sources generating the data if only they are distinct.
2. Sources contain the same number of instances. The Figure 5(b) shows the relation between d and α when the mixing parameter ω = 1 2 . If one source is identified by a significantly lower number of attributes than the other (d << D), then SparseMix will merge both sources into a single cluster. Since one source is characterized by a small number of features, it might be more costly to encode the cluster identifier than its attributes. In other words, the clusters are merged together, because the cost of cluster identification outweighs the cost of encoding the source elements.

Fast optimization algorithm
In this section, we present an efficient on-line algorithm for optimizing the SparseMix cost function. Before that, let us first show how to estimate the probabilities involved in formula (5).

Estimation of cost function
We assume that a data set X ⊂ {0, 1} D is split into k groups X 1 , . . . , X k , where n = |X| and n i = |X i |. Let us denote by To calculate the formula for cost T (X i ), we first estimate the probability q i j of bit 1 at the j-th coordinate in D m i (X i ),

If we denote by
the number of vectors in D m i (X i ) with the j-th coordinate occupied by bit 1 and by N i j the total number of non-zero entries in D m i (X i ), then we can estimate the probability Q i j as: This allows us to rewrite the cost function for a cluster X i as Finally, since the probability P i of the i-th cluster can be estimated as P i = n i n , then the optimal code-length of a cluster identifier equals cost(i) = − log n i n .
In consequence, the overall cost function is computed as:

Optimization algorithm
To obtain an optimal partition of X, the SparseMix cost function has to be minimized. Since it is not practically feasible to calculate its global minimum, one can use some iterative algorithms to find one of its local minima. In the present paper we adapt a modified version of the Hartigan procedure, which is commonly applied in an on-line version of k-means [20]. Although the complexity of a single iteration of Hartigan algorithm is often higher than in batch procedures such as EM, it converges in significantly lower number of iterations and usually finds better minima (see Section 5).
The minimization procedure consists of two steps: initialization and iteration. In the initialization stage, k ≥ 2 nonempty groups are formed in an arbitrary manner. In the simplest case, it could be a random initialization, but to obtain better results one can also apply a kind of k-means++ seeding. In the iteration step the elements are reassigned between clusters in order to minimize the value of the criterion function. Additionally, due to the cost of clusters identification some groups may lose their elements and finally disappear. In practice, a cluster is reduced if its size falls below a given threshold ε · |X|, for a fixed ε > 0.
A detailed algorithm is presented below (β and T are fixed): for all x ∈ X do 12: update probability models of Y x and Y 17: if |Y x | < ε · |X| then 18: delete cluster Y x and assign its elements to these clusters which minimize the SparseMix cost function 19: end if 20: end if 21: end for 22: until no switch for all subsequent elements of X The outlined algorithm is not deterministic and depend on the initial partition. Therefore, the algorithm should be restarted multiple times to avoid getting stuck in local minima.
An efficient implementation of this algorithm requires fast updates of cluster models and recalculation of the SparseMix cost function after switching elements between clusters (see lines 13 and 16). Below, we discuss the details of an efficient recalculation of this cost.
We start with showing how to update cost T (X i ), when we add x to a cluster X i , i.e. how to compute cost T (X i ∪ {x}) given cost T (X i ). The situation when we remove x from a cluster is analogous. The updating of n i j and n i is immediate (by a symbol with a hatŷ we denote the updated value of a variable y): In particular, n i j only changes its value on these positions j, where x j is non-zero. Recalculation of N i j is more complex, since it is calculated by using one of two formulas involved in (8), depending on the relation between n i j n i and T . We consider four cases: 1. If n i j ≤ (n i + 1)T − 1, then before and after the update we use the first formula of (8):N i j = N i j + x j . Moreover, this values changes only when x j = 1.
2. If n i j > (n i + 1)T , then before and after the update we use the second formula: It is changed only when x j = 0.
3. If x j = 0 and n i j ∈ (n i T, (n i + 1)T ] then we switch from the second to the first formula andN i j = n i j . Otherwise, it remains unchanged. 4. If x j = 1 and n i j ∈ ((n i + 1)T − 1, n i T ] then we switch from the first to the second formula andN i j = n i − n i j . Otherwise, it remains unchanged.
Due to the sparsity of X there are only a few coordinates of x satisfying x j = 1. In consequence, the complexity of updates in the cases 1 and 4 depends only on the number of non-zero bits in X. On the other hand, although x j = 0 happens often, the situation when n i j > n i T is rare (for T ≥ 1 2 ), because X is sparse. Since the cases 2 and 3 cover a small number of coordinates, then they do not affect greatly on the complexity of the algorithm. Clearly, S i changes only if N j i is changed as well.
Finally, to get the new cost of a cluster, we need to recalculate D j=1 N i j (− log N i j ). If we remember its old value h(N i 1 , . . . , N i D ) = D j=1 N i j (− log N i j ), then it is sufficient to update it on coordinates j such that N j i =N j i by: We now analyze the computational cost of switching an element from one cluster to another. As discussed above, given x ∈ X the recalculation of N i j , for j = 1, . . . , D, dominates the cost of updating any other quantity. Namely, we need to make updates on c i (x) coordinates, where: c i (x) = c i,T (x) = |{j : n i j ∈ ((n i + 1)T − 1, n i T ] and x j = 1}| +|{j : n i j ∈ (n i T, (n i + 1)T ] and x j = 0}| +|{j : n i j ≤ (n i + 1)T − 1 and x j = 1}| +|{j : n i j > (n i + 1)T and x j = 0}| ≤ |{j : Therefore, c i (x) is bounded from the above by the number of non-zero bits in x and the number of coordinates where the probability p i j of bit 1 exceeds the threshold T − 1−T n i . For T = 1 2 , this threshold equals n i −1 2n i , while for T = 1 it attains value 1 and, in consequence, c i (x) is exactly the number of coordinates with non-zero bits in x. It is also easy to see that c i,T 1 (x) ≥ c i,T 2 (x) if 1 2 ≤ T 1 < T 2 , i.e. the updates are faster for higher T .

Experiments
In this section we evaluate the performance of our algorithm and analyze its behavior in various clustering tasks. We compare its performance with related state-of-theart methods. To denote our method we write SparseMix(β, T ), where β and T are the parameters of its cost function (3.2).
We considered two types of Bernoulli mixture model. The first one is a classical mixture model, which relies on the maximum likelihood principle (ML) [15]. We used the R package "mixtools" [5] for its implementation. The second method is based on classification maximum likelihood (CML) [29]. While ML models every data point as a sample from a mixture of probability distributions, CML assigns every example to a single component. CML coincides with applying entropy as a clustering criterion [9]. We also used two distance-based algorithms. The first one is k-medoids [21], which focuses on minimizing the average distance between data points and corresponding clusters' medoids (generalization of mean). We used R package "cluster" with Jaccard similarity measure 5 . We also considered Cluto software [24], which is an optimized package for clustering large data sets. We ran the algorithm "direct" with a cosine distance function, which means that the package will calculate the final clustering directly, rather than bisecting the data multiple times.
Since all of the aforementioned methods are non-deterministic and optimized in an iterative manner, each one was run 50 times with different initial partitions and the best result was selected according to the method's inner metric.

Quality of clusters
In this experiment we evaluated our method over various binary data sets, summarized in Table 1, and compared its results with related methods listed at the beginning of this section. Since we considered classification data sets, we compared obtained clustering with reference partition. Their agreement was measured by Adjusted Rand Index (ARI), which is a well-known external validation index [22]. It attains maximal value 1 for identical partitions, while for random clustering 6 ARI = 0. We used five text data sets: 20newsgroups, Farm-ads, SMS Spam Collection, Sentiment Labeled Sentences retrieved from UCI repository [1] and Questions dataset taken from [30]. Each data set was encoded in a binary form with use of the setof-words representation. Set-of-words is one of the simplest vector representations of text. Given a dictionary of words, a document (or sentence) is represented as a binary vector, where coordinates indicate the presence or absence of words from a dictionary.
We considered a real data set containing chemical compounds acting on 5-HT 1A receptor ligands [39]. This is one of the proteins responsible for the regulation of the central nervous system. This data set was manually labeled by the expert in a hierarchical form. We narrowed that those classification tree down to 5 classes: tetralines, alkylamines, piperidines, aimides and other piperazines. Each compound was represented by its Klekota-Roth fingerprint, which encodes 4860 chemical patterns in a binary vector [42].
We also took a molecular biology data set (splice), which describes primate splice-junction gene sequences. Moreover, we used data set containing mushrooms described in terms of physical characteristics, where the goal is to predict whether a mushroom is poisonous or edible. Both data sets were selected from UCI repository. Finally, we evaluated all methods on the MNIST data set [27], which is a collection of handwritten digits made into black-and-white images.
All methods were run with the correct number of groups. Since the expected number of groups was given, SparseMix was run with β = 0 to prevent from clusters reduction. We examined its two parametrizations: (a) T = 1 2 , where a cluster representative is taken as the most probable point; (b) T = 1, where a representative is a zero vector.
The results presented in Table 2 shows significant disproportions between two best performing methods (SparseMix and Cluto) and other examined algorithms. The highest differences can be observed in the case of 20newsgroups, farm-adds and  Figure 6: Confusion matrix and clusters representatives returned by applying SparseMix to the MNIST data set. Rows correspond to reference digits, while columns correspond to clusters produced by SparseMix.
SMS data sets. In the case of the questions and sentiment data sets, neither method showed results significantly better than a random partitioning. Let us observe that these sets are extremely sparse, which could make the appropriate grouping of their examples very difficult. For the mushroom example, on the other hand, all methods seemed to perform equally good. Slightly higher differences can be observed on MNIST and chemical data sets, where ML and CML obtained good results. Finally, SparseMix with T = 1 2 significantly outperformed other methods for splice. Although SparseMix, ML and CML focus on optimizing similar cost functions, they use different algorithms, which could be the main reason for differences in their results. SparseMix applies an on-line Hartigan procedure, which updates clusters parameters at every switch, while ML and CML are based on EM algorithm and perform updates after the entire iteration. On-line updates allow for better model fitting and, in consequence, lead to finding better local minimums. This partially explains the more accurate clustering results of SparseMix compared to related mixture models.
To further illustrate the effects of SparseMix we present its detailed results obtained on the MNIST data set. Figure 6 shows a confusion matrix and clusters representatives (first row) produced by SparseMix with T = 1 2 . It is clear that most of the clusters representatives resemble actual hand-drawn digits. It can be seen that SparseMix had trouble distinguishing between the digits 4 and 9, mixing them up a bit in their respective clusters. The digit 5 also could not be properly separated, resulting in its scatter among other clusters. The digit 1 occupied two Figure 7: Ranking of examined methods averaged over all data sets. separate clusters, once for being written vertically and once for being written diagonally. Nevertheless, this example showed that SparseMix is able to find reasonable clusters representatives that reflect their content in a strictly unsupervised way.
To summarize the results, we ranked the methods on each data set (the best performing method got rank 1, second best got rank 2, etc.). Figure 7 presents a box plot of ranks averaged over all data sets. The vertical lines show the range of the ranks, while the horizontal line in the middle denotes the median. It can be seen that both variants of SparseMix were equally good and outperformed other methods. Although the median rank of cluto was only slightly worse, its variance was significantly higher. This means that this model was not well suited for many data sets.

Time comparison
In real-world applications, the clustering algorithm has to process large portions of data in a limited amount of time. In consequence, high computational complexity may disqualify a method from practical usage. In this experiment we focus on comparing the evaluation time of our algorithm with other methods. We tested the dependence on the number of data points as well as on the number of attributes. For the illustration, we considered the chemical data set from previous subsection.
In the first scenario, we randomly selected a subset of data containing a given percentage of instances, while in the second simulation, we chose a given percentage of attributes. The clustering algorithms were run on such prepared subsets of data. The results presented in Figure 8 show that both versions of SparseMix were as fast as the Cluto package, which is an optimized software for processing large data sets. The other algorithms were significantly slower. It might be caused both by a specific clustering procedure as well as by an inefficient programming language used for their implementations.
The interesting thing is that SparseMix with T = 1 2 was often slightly faster than SparseMix with T = 1, which at first glance contradicts the theoretical analysis of our algorithm. To investigate this observation, we counted the number of iterations needed for convergence of both methods. It is clear from the Figure  9 that SparseMix with T = 1 2 needed less iterations to find a local minimum than with T = 1, which fully explains the relation between their running times. SparseMix with T = 1 2 needed less than 20 iterations to converge. Since the scale of the graph is logarithmic, the differences in its cost decreased exponentially. Such a fast convergence follows from that fact that the SparseMix cost function can be optimized by applying on-line Hartigan algorithm (this is computationally impossible to use an on-line strategy for CML or ML models).

Clustering stability
In this experiment, we examined the stability of considered clustering algorithms upon the changing of the data. More precisely, we tested whether a method was able to preserve clustering results when some data instances or attributes were removed. In practical application high stability of an algorithm can be used to speed up the clustering procedure. If a method does not change its result using a lower number of instances or attributes, we can safely perform clustering on a reduced data set and assign the remaining instances to the nearest clusters. We again used the chemical data set for this experiment. In this simulation we only ran SparseMix with T = 1 2 (our preliminary studies showed that parameter T does not visibly influence overall results).
First, we investigated the influence of the number of instances on the clustering results. For this purpose, we performed the clustering of the whole data set X and randomly selected p percentage of its instances X p (we considered p = 0.1, 0.2, . . . , 0.9). Stability was measured by calculating ARI between the clusters X p 1 , . . . , X p k created from the selected fraction of data X p and from the whole data set (restricted to the same instances), i.e. (X 1 ∩ X p ), . . . , (X k ∩ X p ). To reduce the effect of randomness, this procedure was repeated 5 times and the final results were averaged. The results presented in Figure 10(a) show that for a small number of data points Cluto gave the highest stability, but as the number of instances grows SparseMix performed better.
In the second part of the experiment, we examined how the clustering results changed when a smaller number of attributes were taken into account. The procedure was analogical to the previous one: we compared the clustering results obtained on the whole data set with the ones produced on data set with randomly selected p percentage of attributes (as before we considered p = 0.1, 0.2, . . . , 0.9). One can observe in Figure 10(b) that SparseMix obtained the highest stability on all subsets of data. The performance of Cluto was significantly worse than previously -in particular, ML showed higher stability.

Sensitivity to imbalanced data
In the following section, we examined sensitivity of the clustering algorithms to data imbalance. This extends theoretical analysis presented in Example 3.1.
First, we examined whether the algorithm is able to detect clusters of different sizes. For this purpose, we considered a data set X ⊂ {0, 1} D , for D = 100 and |X| = 1000, generated from a distribution ωP (p, α, d) where p = 0.1, α = 0.05 and d = D/2 were fixed and ω changed from 0 to 1. We refer the reader to Example 3.1 for the definition of distribution P and its interpretation. The mixing parameter ω induces the fraction of examples produced by these two sources. We would expect that a clustering method will be able to discover true distributions, so the resulting sizes of the clusters will be, roughly, ω|X| and (1 − ω)|X|. However, as ω approaches either to 0 or 1, the data becomes very imbalanced, which makes the task of separating them more difficult. We considered  Figure 11: The ratio of cluster sizes for data sets generated from imbalanced sources: we varied the number of instances generated from each source 11(a) and the number of attributes characteristic for each source 11(b).
SparseMix with β = 0 and β = 1 to account for different costs of maintaining clusters (our preliminary studies showed that parameter T does not visibly influence overall results and thus we used T = 1 2 ). Figure 11(a) reports the fraction of the data that belongs to the first cluster. The optimal solution is y = ω. We can see that the k-medoids method did not respond to the changes in ω. Other algorithms seemed to perform well on the mid section, but gradually steered off the optimal line as ω approached to 0 or 1. The highest robustness to imbalanced data was obtained by ML and SparseMix with β = 1 (cost of clusters identification was taken into account). If the cost of maintaining clusters is not considered (β = 0), then SparseMix tends to create more balanced groups. These results are consistent with a discussion outlined before Definition 3.2 and in Example 3.1.
In the second experiment, we investigated the influence of attributes imbalance on the clustering results. For this purpose we sampled a data set from the mixture of distributions given by: where p = 0.1, α = 0.05, |X| = 1000 and D = 100 were constants, while d ranged from 0 to D. When d < D, then the second source is identified by a smaller number of bits than the first one. Therefore, by changing the value of the parameter d we scale the number of features characteristic for components. This time we expect that the clusters will remain equally-sized regardless of the parameter d. Figure 11(b) presents the fraction of data that belongs to the first cluster (perfect solution is given by y = 1 2 ). It can be observed that SparseMix with β = 1 was very sensitive to attributes imbalance. According to conclusion given in Example 3.1, the cost of encoding elements within a cluster is outweighed by the cost of clusters identification, as α → 0 (or 1), which results in the reduction of the lighter group. Since the data is sampled from an underlying distribution and SparseMix flows to a local minimum, some attempts result in creating one group, while the others produce two clusters, which explains why the corresponding line is not equal to 1, for α < 0.2. This effect was not apparent when SparseMix used β = 0, because there was no cost of creating an additional cluster. Its results were comparable to ML and CML, which also do not use any cost of clusters identification.

Conclusion
In this paper, we proposed SparseMix, a new approach for clustering of sparse high dimensional binary data. Our results showed that SparseMix is not only more accurate than related model-based clustering algorithms, but also significantly faster. Its evaluation time is comparable to algorithms implemented in the Cluto package, the software optimized for processing large data sets, but its clusters quality is better. SparseMix provides a description of each cluster by its representative and the dispersion from this representative. Experimental results demonstrated that representatives obtained for the MNIST data set provide high resemblance with original representatives of handwritten digits. The model was theoretically analyzed.

A Proof of Theorem 3.1
We will show that cost T 2 (X) − cost T 1 (X) ≥ 0. We have: Observe that Z(T 1 ) ≤ Z(T 2 ) and thus log Z(T 2 ) Z(T 1 ) ≥ 0. Consequently, The above expression is non-negative if only the function: is non-negative for every T 1 ≤ p ≤ T 2 .

B Clusters reduction -details of Example 3.1
We compare the cost of using a single cluster for all instances with the cost of splitting the data into two optimal groups (first ω|X| examples are assigned to the first group while the remaining instances are assigned to the second cluster). For the convenience of calculations, we define the function: The conditional probability Q 1 i that the i-th position holds a non-zero value in the first cluster equals: while for the second group: The cost of one cluster can be written as follows: It is more profitable to use one cluster instead of two if (11) is lower than (10). Since it is difficult to analyze this relation in general, we consider three special cases: 1. Dimensions are balanced. We fix the dimension parameter d = 1 2 D. Then D(α, d) = D(ω, d) = D(β, d) = d and the formula (10) simplifies to: cost(X 1 , X 2 ) = pd (log d + h(α, 1 − α)) + h(ω, 1 − ω), while (11) equals: cost(X) = pd (h(β, 1 − β) + log d) .
3. Both dimensions and sources are balanced. For fixed d = 1 2 D and ω = 1 2 the cost of two clusters is given by cost(X 1 , X 2 ) = dp(h(α, 1 − α) + log d) + log 2, while for one cluster we have cost(X) = pd log D.