Constrained clustering with a complex cluster structure

In this contribution we present a novel constrained clustering method, Constrained clustering with a complex cluster structure (C4s), which incorporates equivalence constraints, both positive and negative, as the background information. C4s is capable of discovering groups of arbitrary structure, e.g. with multi-modal distribution, since at the initial stage the equivalence classes of elements generated by the positive constraints are split into smaller parts. This provides a detailed description of elements, which are in positive equivalence relation. In order to enable an automatic detection of the number of groups, the cross-entropy clustering is applied for each partitioning process. Experiments show that the proposed method achieves significantly better results than previous constrained clustering approaches. The advantage of our algorithm increases when we are focusing on finding partitions with complex structure of clusters.


Introduction
Clustering is one of the most important and efficient tools for processing massive amounts of data (Xu and Wunsch 2005).For this reason, it is commonly used in many fields of computer science including data mining, pattern recognition, machine learning and data compression (Pavel 2002;Jain et al. 1999;Hruschka et al. 2009;Śmieja and Tabor 2015a).Clustering does not have an access to class labels and its results depend only on the values of features describing each object (Collingwood and Lohwater 2004;Xu and Wunsch 2009).In real applications, groups that we want to extract can be too complex to be discovered by strictly unsupervised algorithms (Bar-Hillel et al. 2003).Therefore, in order to support analysis or visualization of data, the user often provides additional information to indicate the crucial values, parts of a graph, etc.Consequently, clustering process is supposed to take an advantage of such background knowledge to provide better results.
Constrained clustering is a part of semi-supervised learning (Basu et al. 2002).It incorporates equivalence constraints between some pairs of elements to enforce which of them belong to the same group (positive constraints or must-link constraints) and which do not (negative constraints or cannot-link constraints) (Wagstaff et al. 2001).It has been widely used in various real-world applications like GPS-based map refinement (Wagstaff et al. 2001) or landscape detection from hyperspectral data (Lu and Leen 2004).On the other hand, in semi-supervised classification, learning involves the use of only a small amount of labeled data together with a large amount of unlabeled elements (Bennett and Demiriz 1998).Basically, the pairwise constraints used in clustering cannot be directly transformed into class labels, and it makes a conceptual difference between semi-supervised clustering and classification.
Numerous clustering algorithms have been modified to aggregate additional information from equivalence constraints.Most of adopted methods, including k-means (Wagstaff et al. 2001), Gaussian mixture model (Shental et al. 2004;Melnykov et al. 2015), hierarchical algorithms (Klein et al. 2002), spectral methods (Li et al. 2009), generate partitions, which are fully consistent with imposed restrictions (they define hard-type of constraints).It is worth to mention that there are also methods in which the constraints come in the form of suggestions that can occasionally be violated (Bilenko et al. 2004;Lu and Leen 2004;Wang and Davidson 2010), however in this paper we do not follow such an approach.
A version of the Gaussian mixture model (GMM) proposed by Shental et al. (2004) is one of the most interesting clustering approaches which impose hard restrictions.Equivalence constraints are used to gather points into chunklets, i.e. sets of elements that are required to be included into the same clusters.Chunklets may be obtained by applying the transitive closure to the set of positive constraints, which generate equivalence classes.The algorithm fits a mixture of Gaussians to unlabeled data together with constructed chunklets by summing over assignments which comply with constraints.Unfortunately, this method does not handle well a situation, presented in Fig. 1.The equivalence constraints, Fig. 1b, enforce that "ears elements" of mouse-like set belong to the same group.However, a direct application of constrained GMM assigns to them one Gaussian model which ultimately groups "ears elements" together with some of "head elements", Fig. 1c.
Fig. 1 Clustering of mouse-like set (a).The equivalence constraints (positive and negative) specified on 10 % of the data elements determine that "ears elements" are included in one group and "head elements" in the second (b).In contrast to the constrained version of GMM (c), our algorithm has discovered an expected partition of this dataset (d) In this paper we propose a general constrained clustering algorithm, called Constrained clustering with a complex cluster structure (C4s), which incorporates equivalence constraints, both positive and negative, and deals well with the aforementioned problem, Fig. 1d.The idea of C4s relies on the observation that every chunklet can originate from complex model, e.g. a mixture of probability distributions.To find a detailed description of each chunklet, we cluster their elements individually at the initial stage of the algorithm, which yields small groups of data points.The obtained groups are used as the atomic parts of data for final clustering process.To ensure that none of negative constraints is violated during the clustering process, we formulate two strict conditions, see Theorems 2 and 3.
Various clustering algorithms can be applied to implement the C4s approach.However, since the number of components for each chunklet is not specified a priori, it is preferable to use an algorithm which detects the number of clusters automatically.Therefore, we combine our method with cross-entropy clustering (CEC) (Tabor and Spurek 2014;Spurek et al. 2013;Tabor and Misztal 2013;Śmieja andTabor 2013, 2015b) which can be seen as a model-based clustering (McLachlan and Krishnan 2008;Morlini 2012;Subedi and McNicholas 2014;Baudry et al. 2015) and determines the final number of groups.
In order to evaluate the performance of C4s, we applied it to a semi-supervised image segmentation and compared it with competitive constrained methods on several datasets.In particular, we used examples retrieved form UCI repository (Lichman 2013) as well as a real dataset including chemical compounds acting on central nervous system (Warszycki et al. 2013).Our experiments show that C4s gives comparable results to other constrained methods when the chunklets are represented by simple models.In the case of more complicated structure of restrictions, C4s significantly outperforms the competitive techniques.
The paper is organized as follows.The next section introduces a general C4s procedure (a combinatorial approach to constrained clustering with a complex cluster structure).In the third section, we show how to implement C4s algorithm with a use of CEC approach.The experimental results are presented in fourth section.Finally, a conclusion is given.

General C4S algorithm
In this section, we present a general form of proposed clustering procedure with equivalence constraints.We do not specify the clustering criterion here, e.g.cost function, dissimilarity measure, etc., but focus on defining generic steps which can be accomplished using any clustering method.In other words, we consider here the clustering problem as a combinatorial one, in which the goal is to construct clusters which comply with imposed restrictions.A specific implementation will be discussed in the next section.

Problem formulation
Let X = {1, . . ., n} be an n-element dataset of objects augmented by equivalence constraints between some pairs of its elements.Positive constraint enforces that two elements belong to the same group while negative constraint states that they are classified separately.The input restrictions are given in the form of a set of pairs (x, y) ∈ X × X , where the following notation is used: • x ∼ y denotes that a positive constraint is imposed on x and y (they have to be included into the same class), • x y denotes that a negative constraint is imposed on x and y (they have to be included into diverse classes).
Let us observe that the set of positive (or negative) constraints determines a relation on X × X .For this reason, we sometimes say that two elements are in positive (or negative) relation.In particular, the set of positive constraints defines an equivalence relation.In consequence, the input set of restrictions generates the equivalence classes (called chunklets by Shental et al. 2004).An equivalence class contains such elements of dataset, which are in positive relation (directly or transitively) and finally must be grouped together.On the other hand, negative relation is not transitive, which makes it usually harder to incorporate them into clustering algorithms.
We say that X 1 , . . ., X k , for k ∈ N, is a partition of X if it is a family of pairwise disjoint subsets of X such that X = k i=1 X i .Our goal is to construct such a partition of X , which complies with assumed constraints (the cardinality of partition is not specified).We assume that there exists at least one partition of X , which does not violate any constraint.This is the case when there is no pair of elements that belong to the same equivalence class, but have a negative relation.
In this paper we are interested in defining a flexible framework, which allows for finding the most appropriate clusters models for particular dataset and constraints.To illustrate this goal, we recall an example of the mouse-like set presented in Fig. 1.The application of standard unconstrained GMM results in detecting three spherically shaped clusters.However, after the specification of the equivalence relation the clustering method is supposed to discover such clustering configuration, which comply with the imposed restrictions and fits best to a dataset.In the case of model-based clustering we want to be able to automatically select two Gaussian components for describing the "ears of mouse" and one Gaussian for its "head", see Fig. 1d.
We start with a description of the clustering process in the case of positive constraints only (Sect.2.2).The entire procedure can be divided into four main steps: The incorporation of negative constraints, requires a modification of a global clustering step, in which we have to create a grouping that is consistent with all constraints after the merging operation.We formulate two conditions, which allow to verify if a given partition constructed in global clustering can be merged into a partition which does not violate any constraint, see Theorems 2 and 3.

Incorporating positive constrains
In this section, we assume that a dataset X is augmented by positive constraints only.
Proposed procedure relies on four steps, outlined in previous subsection and explained in the following paragraphs.
Aggregation of positive constraints.In the first step, we follow the arguments introduced by Shental et al. (2004) and consider the equivalence classes generated by the set of positive constraints.More precisely, for x ∈ X , we construct a set: [x] = {y ∈ X : x ∼ y}.
One element equivalence classes are created for the elements which were not concerned in any positive constraint.Since for every x, y ∈ X , it holds that where (C i ) k i=1 is a family of all pairwise disjoint equivalence classes in X .For a further use, each equivalence class C i will hereafter be referred as initial chunklet.
Inner clustering Obviously, all elements associated with an initial chunklet should be finally included into the same cluster.However, every group can have very complex structure, e.g. its elements might not be generated from a single Gaussian probability distribution, but from a mixture of Gaussian distributions, as presented in Fig. 1.Roughly speaking, inner clustering attempts to discover atomic groups in X that are described by simple models.
For this reason, we want to construct a separate partition of every initial chunklet.As mentioned, we do not specify a type of clustering method here, but assume that its choice is an independent task, which will be discussed in the next section.As a result of inner clustering, every initial chunklet C i , for i = 1, . . ., k, is split into We assume that the number of groups k i created for i-th initial chunklet is not greater than its cardinality.In particular, for a set with only one element we obtain a trivial partition containing just one singleton class.
Retrieved groups C (i) j are referred as final chunklets.Moreover, we say that a final chunklet C (i)  j is derived from an initial chunklet C i if C (i) j ⊂ C i .Observe that the set of final chunklets In order to illustrate the inner clustering process, the cross-entropy implementation of C4s (presented in the next section) was applied on Banana-like set, Fig. 2.This is an example of data arranged around two parabolic manifolds.This kind of sets is becoming increasingly popular due to the manifold hypothesis which states that real world data embedded in high dimensional spaces are likely to concentrate in the vicinity of nonlinear sub-manifolds of lower dimensionality, see Cayton (2005); Narayanan and Mitter (2010).The clustering grants detection of manifolds.Interestingly, diverse shapes, such as Banana-like, often appear in medical sciences, e.g. in muscle injury determination system (Ding et al. 2011).In order to determine muscle injury from ultrasonic image of a healthy and unhealthy muscle, a specific shape of fiber has to be detected.

Global clustering
Let C be a family of all final chunklets discovered in the inner clustering stage.As a global clustering, we understand the process of constructing a partition P of C. In consequence, each cluster P ∈ P will be a family of some final chunklets.If every final chunklet is described by one simple model, then we group similar chunklets together and in consequence, obtain clusters, which could follow the mixture models.
Let us observe that C is a family of subsets of X .Therefore, the clustering algorithm has to be adapted to process a dataset that consists of some subsets of X .To facilitate such a procedure one can represent every final chunklet as a single element of X and apply a clustering algorithm in a classical way.The idea can be best illustrated with an example: in the case of k-means one can represent a final chunklet by its mean with a weight depending on the cardinality of chunklet.On the other hand, in the Gaussian mixture model approach every set is naturally related to a probability model characterized by its sample mean and covariance matrix.These adaptations allow to apply weighted versions of classical clustering algorithms, without referring to C but its transformed form of representants.For more details, we refer the reader to the next section, in which CEC implementation is presented after such an adaptation.
Merging Let P be a partition of C created by a global clustering procedure, i.e., each P ∈ P is a family of some final chunklets.This splitting might be inconsistent with some of positive constraints, e.g. it is possible to assign C (i)  i 1 to P and C (i) i 2 to P , for P, P ∈ P such that P = P , whereas i 2 are derived from the same initial chunklet C i .In merging stage, we join two groups P, P ∈ P if they contain final chunklets, which are derived from the same initial chunklet.
For every P ∈ P, we want to encode the information of clusters, which must be joined with P to ensure that none of positive constraints is violated.Let cl : P → 2 P be a function such that: -P ∈ cl(P), for P ∈ P, if there exist two final chunklets C (i) A value cl(P) is a set of clusters P , such that P, P both contain final chunklets respectively, derived from the same C i .In other words, P is connected directly with P by some initial chunklet.
Let us observe that, if P ∈ cl(P) and P ∈ cl(P ), for P, P , P ∈ P, then P, P , P have to be finally included into the same cluster, even if P / ∈ cl(P).To encode such a transitive relation a sequence of functions (cl (t) ) t∈N is defined recursively by: cl (t) (P) := cl(P) for t = 1, {cl(P ) : P ∈ cl (t−1) (P)} for t > 1, for P ∈ P. We put cl ∞ (P) := ∞ t=1 cl (t) (P), for P ∈ P to denote all clusters that have to be joined together in the merge stage, see Fig. 3 for the illustration.One can interpret cl ∞ (P) as a closure of {P} with respect to the positive relation.
The above construction of cl ∞ provides that either: cl In other words, cl ∞ generates equivalence classes on P. If we denote by Q the family of all different equivalence classes generated by cl ∞ , then Q is a partition of C. To obtain a partition X = {X Q } Q∈Q of X , which corresponds to Q, we transform every Q ∈ Q by: We see that X is consistent with all positive constraints, as outlined in the following theorem: Theorem 1 Let P be a partition of the family of final chunklets C and let Q be a family of all equivalence classes generated by cl ∞ (P), for P ∈ P. Then X = {X Q } Q∈Q defined by: is a partition of X which coincides with all positive constraints.
Figure 4 demonstrates the results of global clustering and merging for the Bananalike set.

Incorporating negative constraints
In this section we assume that both positive and negative constraints are defined on selected pairs of X .The procedure proposed in previous subsection does not use the information contained in negative constraints.To apply this algorithm in the case of negative constraints one has to modify a global clustering step.For this purpose, we formulate two conditions, which allow to verify if a given partition obtained in the global clustering stage can be merged to a partition which is consistent with the negative constraints.
For a further convenience, we say that two initial chunklets C i , C j ⊂ X are in negative relation, which we denote by C i C j , if there exist x ∈ C i , y ∈ C j such that x y.In other words, since all the elements of any initial chunklet have to be finally included into a single cluster (after the merge stage), then the negative constraints can be propagated and verified on the set of initial chunklets.Moreover, we say that a partition P of C is valid if in the merging stage it generates a partition of X which is consistent with all negative constraints.
The following result shows how to verify the validity of a partition P of C based on the equivalence classes generated by cl ∞ : Theorem 2 A partition P of C is not valid if and only if there exist P ∈ P and final chunklets C Proof Let us first assume that a partition P is not valid, i.e. a merge operation generates a partition X of X which is not consistent with at least one constraint.Therefore, there exist a cluster Y ∈ X and x, y ∈ Y such that x y.One can find two final chunklets Since both x and y belong to Y then there exists On the other hand, let us assume that there exist P ∈ P and final chunklets will be joined together in the merge stage-it does not lead to a valid partition because C i C j .
In many clustering algorithms such as k-means, we start with a fixed partition and focus on its iterative refinement by switching the elements between clusters.Clearly, one could use Theorem 2 to verify if a given reassignment leads to a valid partition.Nevertheless, this operation might be computationally inefficient for this type of algorithms.Therefore, we formulate a condition that states when we are permitted to change the membership of a final chunklet from one cluster to another to preserve the validity of a partition.
Let P be a fixed partition of C and let C (i) j ∈ P , for P ∈ P, be a final chunklet derived from an initial chunklet C i .If we change the membership of C (i) j from P to a cluster P ∈ P, then cl ∞ (P) will change if only cl ∞ (P ) ∩ cl ∞ (P) = ∅.If cl ∞ (P) = cl ∞ (P ) then such a reassignment has no effect on cl ∞ (P).Therefore, at each attempt of reassigning C (i)  j from P to P, we have to verify if there is any pair of clusters held in (cl ∞ (P ∪ {C (i) j })\cl ∞ (P)) × cl ∞ (P) which breaks any negative constraint, i.e. contain final chunklets derived from intial chunklets which are in negative relation.
The following lemma will be useful to establish the form of cl ∞ (P∪{C Lemma 1 Let P be a partition of a family of final chunklets C. We consider P, P ∈ P and a final chunklet j has no effect on cl ∞ (P) and in consequence cl ∞ (P ∪ {C (i)  j })\cl ∞ (P) = ∅.
We put: l is a final chunklet derived from C i , where l = j}, which can be considered as a boundary of j .An illustrative explanation of the above definitions is presented in Fig. 3.
The following result allows to check out if a given reassignment operation preserves the validity of partition: Theorem 3 Let P be a valid partition of C. We assume that P, P ∈ P are fixed and we consider a final chunklet C (i) i 1 ∈ P .Let Q be a partition generated from P by changing the membership of C (i)  i 1 from P to P, i.e.
Partition Q is valid if one the following conditions is satisfied: 2. for all pairs of final chunklets (C l 1 ) such that C ( j) j 1 ∈ {P : P ∈ cl ∞ (P)} and C (l)  l 1 ∈ {P : P ∈ ∂(C (i 1 ) j 1 )}, the intial chunklet C j is not in negative relation with C l , where C ( j) j 1 and C (l)  l 1 are derived from initial chunklets C j , C l , respectively.
Proof Clearly, if cl ∞ (P) = cl ∞ (P ) then Q is valid because P is valid.
Let us suppose indirectly that condition 3 holds and partition Q is not valid.Since P is valid then for all P ∈ P neither {P : P ∈ cl ∞ (P)} nor {P : P ∈ cl ∞ (P )} contain final chunklets which were derived from initial chunklets that are in negative relation.Therefore, there exist C If we perform a global clustering stage employing a clustering algorithm that iteratively switches elements of dataset between clusters, then Theorem 3 determines all acceptable reassignments.We start with any valid partition.At each reassigning step we verify if it leads to a valid partition and only then consider a possible change of membership.One can use the following pseudocode to perform the reassigning operation.for all C ( j) To reduce the complexity of the above algorithm it is enough to collect in each cluster P ∈ P the family i 1 ∈ P : C i has any negative constraint} of all final chunklets which are derived from initial chunklets having any negative constraint.The iterations given in lines 2.3 and 2.3 are then performed only through final chunklets from L(P).In consequence the reassignment of C ∈ P to P takes ( P ∈cl ∞ (P) |L(P )|) • ( P ∈∂(C) |L(P )|) operations.Since the cardinalities of c ∞ (P) and ∂(C) depend on the number of positive constraints while the cardinality of L(•) is proportional the number of negative constraints, then one may say that the cost of verification the reassignment operation can be approximated by the total number of negative and positive constraints.

Implementation with use of model-based clustering and cross-entropy
In this section we present an implementation of proposed C4s procedure that employs cross-entropy clustering method (CEC).This is an a technique based on information theoretical concepts, which has similar effects as classical model-based clustering.Moreover, it automatically detects the final number of groups, which is extremely important in our procedure due to the presence of inner and global clustering phases.
In this section we assume that X = {x 1 , . . ., x n } ⊂ R N is a dataset of N -dimensional real-valued vectors.We start by presenting CEC method and its comparison with classical model-based clustering.Then, we show how to implement the C4s procedure with the help of CEC.

Mixture of Gaussian models
The idea of model-based clustering comes from Wolfe (1963) and has become increasingly popular across diverse applications (Bellas et al. 2013;McNicholas and Murphy 2010;Xiong et al. 2002;Samuelsson 2004).Although a variety of finite mixture models has been extensively studied and developed in the literature (Baudry et al. 2015;Subedi and McNicholas 2014;Lee and McLachlan 2013;Morris et al. 2013), the Gaussian case has received a special attention (Morlini 2012;Nguyen and McLachlan 2015;Hennig 2010;Scrucca and Raftery 2015).
Basically, model-based clustering focuses on a density estimation of a dataset X by the mixture of simple densities.It aims to find and f 1 , . . ., f k ∈ F, where k ∈ N is fixed and F is a parametric (usually Gaussian) family of densities such that the convex combination estimates unknown probability distribution on a dataset X (McLachlan and Krishnan 2008).This is a fuzzy-type clustering, where the probability of assigning x ∈ X to i-th clusters equals p i f i (x).A locally optimal solution, which minimizes a negative log-likelihood function: where |X | = n is a cardinality of X , can be found by applying the EM algorithm.The goal of CEC is similar: it aims at finding numbers p 1 , . . ., p k that satisfy (1) and densities f 1 , . . ., f k ∈ F, which minimize the following cost function: If F is a family of Gaussian densities, then f = max( p 1 f 1 , . . ., p k f k ) is not a density, but a subdensity, i.e.R N f (x)dx ≤ 1.
The formula (3) is known as the cross-entropy (Rubinstein and Kroese 2004) of dataset X with respect to f .If we define a partition X 1 , . . ., X k of X by One can understand the CEC formula as a mean cost of encoding a symbol from a dataset X by a model consisting of k encoders: the term − log p i is a code-length of encoder identifier while − log f i (x) determines a code-length of x when using i-th coding algorithm.An immediate consequence of the above formula is that the clusters do not "cooperate" one with each another to estimate a density of X (it is a hard-type of clustering) and as a result it is enough to define an optimal description for each cluster separately.Let us observe that, each cluster has set its individual cost given by − log p i , which allows to regularize a clustering model.While the introduction of one more group usually improves the likelihood function, it also increases the complexity of the model.This is the reason why CEC tends to reduce unnecessary clusters.
We assume that F is a family of Gaussian densities N (m, ).Let X 1 , . . ., X k be a fixed partition of X .By mi and ˆ i we denote the sample mean and covariance matrix calculated within a group X i as: The infimum of CEC cost function (3) for a partition X 1 , . . ., X k taken over all acceptable p i and f i , for i = 1, . . ., k, equals: where . To find a partition, which minimizes (4), one can adopt an iterative Hartigan algorithm, which is commonly used in the case of k-means (Jain 2010).The idea of the Hartigan method is to proceed over all elements of X , switching the membership of particular elements to those clusters which would maximally decrease the cost function (Telgarsky and Vattani 2010;Hartigan and Wong 1979).It can be proven that this algorithm refines a given partition and finally finds a locally optimal solution.
The entire procedure can be summarized in the following steps: 1. Let X 1 , . . ., X k be an initial partition of X .In the simplest case it can be a random grouping.
2. Iterate over all x ∈ X and execute the following steps until no cluster membership has been changed: (a) Find a membership of x ∈ X i to this cluster X j for which the decrease of the cost function ( 4) is maximal.To evaluate the change of the cost function after the reassignment from X i to X j , we have to temporally recalculate the probabilities, means and covariances of these clusters.(b) If an optimal cluster membership X j = X i , then switch x from X i tp X j and update the parameters of these clusters permanently.Otherwise, no reassignment is performed.(c) Reduce a cluster X i , if p i < ε (for most application ε ≤ 2 % provides satisfactory results) and assign its elements to different clusters according to point (a); Though it may seem that the recalculation of the cluster model in the above procedure involves high computational complexity, it does not.The following formulas show that the time of these updates does not depend on the cardinality of data, but only on the dimension of dataset.Every cluster has only to remember its actual sample mean and covariance.
Observation 1 (Tabor and Spurek 2014, Theorem 4.3) Let U, V be two subsets of X ⊂ R N with sample means mU , mV , covariance matrices ˆ U , ˆ V and associated prior probabilities p(U ), p(V ) ≥ 0 such that p(U ) + p(V ) ≤ 1 [the role of p(U ), p(V ) is the same as numbers p i in (3)].
• If we assume that U ∩ V = ∅ then the sample mean and the covariance of U ∪ V equals: • If we assume that V ⊂ U then the sample mean and covariance of U \V equals:

Cross-entropy C4S
Let X ⊂ R N be a dataset augmented by the set of positive and negative equivalence constraints.We discuss the application of CEC in C4s algorithm, in particular in realizing the inner and the global clustering stages.We consider a Gaussian version of CEC, i.e. every cluster is modeled as Gaussian probability distribution.For a convenience, we use a notation and a terminology introduced in Sect. 2.

Inner clustering
In the inner clustering we extract atomic parts of every initial chunklet.In the case of CEC, we try to discover final chunklets represented by Gaussian distributions.To run CEC, the maximum number of groups gr (i) max , for i = 1, . . ., k, must be specified for each initial chunklet.Since the constraints usually cover a small number of examples then gr (i) max should also be small.The application of CEC to every initial chunklet C i , for i = 1, . . ., k, produces a partition into final chunklets C = {C (i) ), with a sample mean and a covariance matrix calculated within this chunklet as well as the associated weight p (i) j .
Global clustering An input to global clustering is a family of final chunklets C and the maximum number of groups gr max .To adapt a clustering method to process such a dataset, we assume that each final chunklet is represented by a probability model calculated during inner clustering stage.More precisely, every final chunklet . Moreover, every model has an attached weight p |X | proportional to the cardinality of the chunklet.For one element chunklet This does not define a Gaussian model, but a Dirac measure condensed at x. Nevertheless, we keep the symbol f (i) j to denote such a probability model.In consequence, we focus on clustering a set of probability models To evaluate the CEC cost function (4) of a partition P of M(C) (which is now interpreted as a set of probability models), one has to know the covariance and probability coefficient of each cluster.This can be calculated using Observation 1.More precisely, the sample covariance matrix of a union of two final chunklets C i 1 , C i 2 is directly given by Observation 1.The sample covariance matrix of the union of l final chunklets C i 1 , . . ., C i l is calculated with use of a recursive formula: where we assume that mi 2 ,...,i l and ˆ i 2 ,...,i l are known.
If any negative constraint is introduced, then the global clustering must additionally preserve the validity of partition.Since CEC relies on iterative switching the elements between clusters, then it is sufficient to verify conditions given in Theorem 3 and succeeding pseudocode.It is enough to incorporate this pseudocode to step 3.1 of the CEC procedure described in previous subsection.

Experimental results
In this section the proposed C4s method is examined on the several datasets.First, it will be applied in a semi-supervised image segmentation, then it will be compared with other constrained clustering methods on selected examples retrieved from UCI repository and one real dataset of chemical compounds.A demonstration version of the C4s is publicly available from the link: http://ww2.ii.uj.edu.pl/~wiercioc/C4s/.Please contact the second author for further information.

Image segmentation
Let us consider a dog's image (Arbelaez et al. 2011) presented in the Fig. 5a.A natural question of the image segmentation is to discover dog's shape.As it can be seen in the Fig. 5b, the adaptation of classical CEC to images ( Śmieja and Tabor 2013) produces five parts-two of them form the dog's shape.
Since it is difficult to perform an unsupervised segmentation which detects only two parts-the dog's shape and the background, we ask for examples of elements which should be grouped together.Figure 6a presents graphically the imposed constraintspixels marked by hand in one color are restricted to be in the same group.
C4s reads these restrictions and in the first stage clusters individually elements with the same constraint.As a result, two groups from the first initial chunklet (elements marked in black) and three groups from the second initial chunklet (elements marked in white) were obtained.Then, the algorithm takes these five final chunklets and the rest of a dataset and performs the global clustering.Figure 6b shows the effect-the dog was discovered very well.All the clustering procedures were started with ten initial groups.
This semi-supervised scenario, in which a user indicates examples of objects to be extracted, often appears in real situations.In medical sciences a video capsule For each example from UCI repository we consider three variants of reference membership-original membership and two modified ones.Fourth column contains the reference classes.The classes marked with * are constructed from the original ones endoscopy is an examination where thousands of pictures are taken from inside of a gastrointestinal tract (Vyas et al. 2014).A doctor is not able to check manually an entire video of all patients in a search of lesions.In consequence, it might be more preferably to mark only a few interesting examples from pictures and then let the an application to discover the rest of them.On the other hand, biologists must analyze a great amount of microscopic images of cells which might be impossible in practice (Wu et al. 2008).They often use computer tools, like ilastic (Sommer et al. 2011), which can perform automated semi-supervised image segmentation.Finally, pairwise equivalence constraints facilitate the detection of a person walking or tracking missiles as they are carried on a moving vehicle (in the army).

UCI repository
To compare the performance of C4s with the constrained versions of GMM and hierarchical clustering (HC) (Shental et al. 2004;Klein et al. 2002), we have tested it on several examples of datasets selected from the UCI repository (Lichman 2013).
The results were evaluated with a use of adjusted Rand index (ARI) (Hubert and Arabie 1985) which is a well-known measure of agreement between two partitions.ARI assumes its maximum value 1 in the case of ideal agreement while for completely independent partitions it gives value 0.
In order to obtain side information, a teacher was employed.A teacher is given a random selection of M elements from a dataset and is then asked to partition this set of retrieved points into equivalence classes which are used as equivalence constraints.We carried out experiments with two criteria-when approximately 15 % of data points are constrained, and when approximately 30 % of data points are constrained.We tested all methods in two modes: • using only positive equivalence constraints; • using both positive and negative equivalence constraints.
Since GMM and C4s are nondeterministic we ran each of them 10 times and choose a result with the lowest value of cost function.
As mentioned in the previous section, our algorithm is intended to perform a clustering which discovers groups that are possibly generated from the mixture of models.In the present experiment, we consider three variants of the reference membership of each dataset: the first one is the original membership (defined by UCI) while the other two are modified by merging selected groups (except Ionosphere and B-C-W, where the partitions contain only two groups).In the second case some of the original groups are joined in order to obtain a reference partition where clusters are described by complex probability distributions.Table 1 provides detailed information connected with datasets and their modifications used in the simulations.
The following values were assumed as CEC parameters: gr = 3 * k, where k is a correct number of clusters, gr i = 4, ε = ε i = 1 %.It should be noted that GMM and HC algorithms do not detect the correct number of clusters.For this reason, the number of clusters for these methods in certain mode equaled the number of clusters returned by C4s.Several observations follow from the results reported in Fig. 7: • According to Fig. 7a-c, f, i, l, o the performance of all algorithms checked on the original reference partitions is almost identical.• The advantage of C4s method is evident in the case of modified reference memberships (see Fig. 7d,e,g,h,j,k,m,n,p,q).The internal structure of each group becomes too complex after joining the clusters to be described just by one model.In consequence, C4s provides significantly higher ARI than constrained GMM and HC.• After incorporating 30 % of random constraints, C4s gives the best value of agreement.Furthermore, in most cases adding negative constraints makes an improvement over results received when using only positive constraints.• Apart from that, the proposed algorithm detects quite precisely the right number of regions.

Chemical compounds
This experiment relies on grouping the selected set of chemical compounds with respect to their structural features.The set of compounds acting on central nervous system CNS (5-HT 1A receptor ligands) was chosen for this example (Olivier et al.

Tetrahydropyridinoindoles}
Fourth column shows which groups of the lowest level of the hierarchy (Fig. 9) are merged to obtain reference partitions 1999; Śmieja and Warszycki 2016).The results were compared to the partition created manually by the experts (Warszycki et al. 2013).Chemical compounds are usually represented by binary strings called fingerprints.The bit "1" means the presence of particular feature of compound while "0" denotes its absence (see Fig. 8).There are a lot of fingerprint representations since various features can be taken into account (Willett 2005).Our experiment uses Klekota-Roth Fig. 10 Adjusted Rand index of C4s, GMM and HC over dataset of chemical compounds acting on central nervous system.The results are shown for four cases: using 15 % of the data points in positive constraints (15 % p), using 15 % of the data points in both positive and negative constraints (15 % p & n), using 30 % of the data points in positive constraints (30 % p), using 30 % of the data points in both positive and negative constraints (30 % p & n) fingerprint which provides reasonably good description of compound (it contains 4860 bits) (Klekota and Roth 2008).
The reference partition has a hierarchical structure (Fig. 9).One can decide how many groups should be taken into account.In the experiment, four different levels of the hierarchy were chosen and therefore four different reference groupings were obtained.In consequence, partitions into 28, 24, 18 and 12 groups were considered.Table 2 shows which groups from the lowest level of the reference hierarchy were merged in order to create a reference partition.
As in the example of UCI, the cases of 15 and 30 % of constrained points (both positive as negative) were examined and similar values of parameters for C4s were used.Moreover, the number of groups returned by C4s was assumed as the input to constrained GMM and HC.
The results shown in Fig. 10 clearly indicate that the advantage of proposed method increases when a reference partition contains more complex groups.When a reference clustering into 28 groups is assumed, all three examined methods provide similar values of ARI (Fig. 10a).The more groups were combined into larger clusters, the higher differences between C4s and the two other ones were observed (Fig. 10b-d).Moreover, introduced method gives significantly better results for a greater number of elements with constraints.It follows from the fact that the inner clustering processes are performed on sets of elements with the same positive constraints; i.e. the more elements are taken for the inner clustering, the more accurate results are obtained.

Conclusion and future work
In this paper we proposed a novel semi-supervised clustering technique, C4s, which incorporates equivalence constraints.The idea of introduced method was indebted to work of Shental et al. (2004) who applied Gaussian mixture model to a clustering with strict pairwise constraints.The conceptual difference between these two algorithms lies in the number of components used to describe a cluster.C4s enables to understand a cluster as a complex structure which elements are generated by a mixture of simple models which is a novel concept in constrained clustering.
This reasoning is motivated by real-life examples where data is often classified in a hierarchical structure.Groups defined at the lowest level of hierarchy represent simple models, while their mixtures are used to describe clusters at higher levels.As an example one can consider an expert classification of chemical compounds (see Fig. 9).
The numerical results were consistent with an assumed theoretical model and confirmed that the proposed method is more suitable for data clustering when pairwise constraints suggest a complex structures of groups.It outperformed constrained GMM (Shental et al. 2004) as well as hierarchical clustering with equivalence constraints (Klein et al. 2002).
As mentioned in the paper, the introduced general algorithm can be implemented in combination with various clustering methods.This study assumed the existence of clusters described by Gaussian mixtures and applied CEC method.In the future, we plan to consider different techniques which are suited for particular form of data.Moreover, it would be also a challenge to modify this procedure to the case of soft constraints which can be occasionally violated during grouping (Bilenko et al. 2004;Lu and Leen 2004;Wang and Davidson 2010).

Fig. 2
Fig. 2 Presentation of the chunklets' construction for the Banana-like set.The expected grouping of Banana-like set, (a).After imposing the positive and negative constraints on 30 % of its elements two initial chunklets are created, (b).The cross-entropy implementation of the introduced algorithm generates six final chunklets, (c)

Fig. 5 Fig. 6
Fig. 5 Image of the dog (a) and unsupervised CEC clustering (b)

Fig. 7
Fig. 7Adjusted Rand index of C4s, constrained GMM and HC over seven datasets from the UCI repository: Ionosphere, Breast Cancer Wisconsin, E. coli, Segment, Glass, Wine, Yeast for three types of reference partitions for each set (except the first two sets).The results are shown for four cases: using 15 % of the data points in positive constraints (15 % p), using 15 % of the data points in both positive and negative constraints (15 % p & n), using 30 % of the data points in positive constraints (30 % p), using 30 % of the data points in both positive and negative constraints(30 % p & n)

Fig. 8
Fig. 8 Exemplary topological fingerprint of chemical compounds.Value 1 means presence, whereas value 0 means absence of predefined structural patterns

Table 1
Datasets used in the experiments

Table 2
Four cases of reference grouping