Scalable Active Constrained Clustering for Temporal Data

  • Son T. Mai
  • Sihem Amer-Yahia
  • Ahlame Douzal Chouakria
  • Ky T. Nguyen
  • Anh-Duong Nguyen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10827)

Abstract

In this paper, we introduce a novel interactive framework to handle both instance-level and temporal smoothness constraints for clustering large temporal data. It consists of a constrained clustering algorithm, called CVQE+, which optimizes the clustering quality, constraint violation and the historical cost between consecutive data snapshots. At the center of our framework is a simple yet effective active learning technique, named Border, for iteratively selecting the most informative pairs of objects to query users about, and updating the clustering with new constraints. Those constraints are then propagated inside each data snapshot and between snapshots via two schemes, called constraint inheritance and constraint propagation, to further enhance the results. Experiments show better or comparable clustering results than state-of-the-art techniques as well as high scalability for large datasets.

Keywords

Semi-supervised clustering Active learning Interactive clustering Incremental clustering Temporal clustering 

1 Introduction

In semi-supervised clustering, domain knowledge is typically encoded in the form of instance-level must-link and cannot-link constraints [9] for aiding the clustering process, thus enhancing the quality of results. Such constraints specify that two objects must be placed or must not be placed in the same clusters, respectively. Constraints have been successfully applied to improve clustering quality in real-world applications, e.g., identifying people from surveillance cameras [9] and aiding robot navigation [8]. However, current research on constrained clustering still faces several major issues described below.

Most existing approaches assume that we have a set of constraints beforehand, and an algorithm will use this set to produce clusters [2, 8]. Davidson et al. [6] show that the clustering quality varies significantly using different equi-size sets of constraints. Moreover, annotating constraints requires human intervention, an expensive and time consuming task that should be minimized as much as possible given the same expected clustering quality. Therefore, how to choose a good and compact set of constraints rather than randomly selecting them from the data has been the focus of many research efforts, e.g., [1, 15, 19].

Many approaches employ different active learning schemes to select the most meaningful pairs of objects and then query experts for constraint annotation [1, 15]. By allowing the algorithms to choose constraints themselves, we can avoid insignificant ones, and expect to have high quality and compact constraint sets compared to the randomized scheme. These constraints are then used as input for constrained clustering algorithms to operate. However, if users are not satisfied with the results, they are asked to provide another constraint set and start the clustering again, which is obviously time consuming and expensive.

Other algorithms follow a feedback schema which does not require a full set of constraints in the beginning [5]. They iteratively produce clusters with their available constraints, show results to users, and get feedback in the form of new constraints. By iteratively refining clusters according to user feedback, the acquired results fit users’ expectations better [5]. Constraints are also easier to select with an underlying cluster structure as a guideline, thus reducing the overall number of constraints and human annotation effort for the same quality level. However, exploring the whole data space for finding meaningful constraints is also a non-trivial task for users.

To reduce human effort, several methods incorporate active learning into the feedback process, e.g., [13, 14, 15, 19]. At each iteration, the algorithm automatically chooses pairs of objects and queries users for their feedback in terms of must-link and cannot-link constraints instead of leaving the whole clustering results for users to examine. Though these active feedback techniques are proven to be very useful in real-world tasks such as document clustering [13], they suffer from very high runtime since they have to repeatedly perform clustering as well as exploring all \(O(n^2)\) pairs of objects to generate queries to users each time.

In this paper and its preliminary version [18], we develop an efficient framework to cope with the above problems following the iterative active learning approach as in [13, 19]. However, instead of examining all pairs of objects, our technique, called Border, selects a small set of objects around cluster borders and queries users about the most uncertain pairs of objects. We also introduce a constraint inheritance approach based on the notion of \(\mu \)-nearest neighbors for inferring additional constraints, thus further boosting performance. Finally, we revisit our approach in the context of evolutionary clustering [4]. Evolutionary clustering aims to produce high quality clusters while ensuring that the clustering does not change dramatically between consecutive timestamps. This scheme is very useful in many application scenarios. For example, doctors want to track groups patients based on their treatment progresses each year. They may expect that existing groups do not change much over time if there are minor changes in the data. However, the clustering process should be able to reflex the changes if there are significant differences in the new data. Therefore, we propose to formulate a temporal smoothness constraint into our framework and add a time-fading factor to our constraint propagation.

Contributions. Our contributions are summarized as follows:
  • We introduce a new algorithm CVQE+ that extends CVQE [8] with weighted must-link and cannot-link constraints and a new object assignment scheme.

  • We propose a new algorithm, Border, that relies on active clustering and constraint inheritance to choose a small number of objects to solicit user feedback for. Beside the active selection scheme for pairs of objects, Border employs a constraint inheritance method for inferring more constraints, thus further enhancing the performance.

  • We present an evolutionary clustering framework which incorporates instance-level and temporal smoothness constraints for temporal data. To the best of our knowledge, our algorithm is the first framework that combines active learning, instance-level and temporal smoothness constraints.

  • Experiments are conducted for six real datasets to demonstrated the performance of our algorithms over state-of-the-art ones.

Outline. The rest of the paper is organized as follows. We formulate the problem in Sect. 2. Our framework is described in Sect. 3. Experiments are presented in Sect. 4. Section 5 discusses related works. Section 6 concludes the paper.

2 Problem Formulation

Let \(D = \{(d,t)\}\) be a set of |D| vectors \(d \in \mathbb {R}^p\) observed at time t. Let \(S=\{(S_s,D_s,ts_{s},te_{s})\}\) be a set of preselected |S| data snapshots. Each \(S_s\) starts at time \(ts_{s}\), ends at time \(te_{s}\) and contains a set of objects \(D_s = \{(d,t) \in D \; | \; ts_{s} \le t < te_s\}\). Two snapshots \(S_s\) and \(S_{s+1}\) may overlap but must satisfy the time order, i.e., \(ts_{s} \le ts_{s+1}\) and \(te_{s} \le te_{s+1}\). For each snapshot \(S_s\), let \(ML_s = \{(x,y,w_{xy}) | (x,y) \in D_s^2 \}\) and \(CL_s = \{(x,y,w_{xy}) | (x,y) \in D_s^2 \}\) be the set of must-link and cannot-link constraints of \(S_s\) with a degree of belief of \(w_{xy} \in [0,1]\). Initially, \(ML_s\) and \(CL_s\) can be empty.

In this paper, we focus on the problem of grouping objects in all snapshots into clusters. Our goals are (1) reduce the number of constraints thus reducing the constraint annotation costs (2) make the algorithm scale well with large datasets and (3) smooth the gap between clustering results of two consecutive snapshots, i.e., ensure temporal smoothness.

3 Our Proposed Framework

Figure 1 illustrates our framework which relies on two algorithms, Border and CVQE+. Our framework starts with a small (or empty) set of constraints in each snapshot. Then, it iteratively produces clustering results and receives refined constraints from users in the next iterations. This process is akin to feedback-driven algorithms for enhancing clustering quality and reducing human annotation effort [5]. However, instead of passively waiting for user feedback as in [5], our algorithm, Border, actively examines the current cluster structure, selects \(\beta \) pairs of objects whose clustering labels are the least certain, and asks users for their feedback in terms of instance-level constraints. Examining all possible pairs of objects to select queries is time consuming due the quadratic number of candidates. To ensure scalability, Border limits its selection to a small set of most promising objects. When there are new constraints, instead of reclustering from scratch as in [13, 19], our algorithm, CVQE+, incrementally updates the cluster structures for saving computation times. We also aim to ensure a smooth transition between consecutive clusterings [4]. We additionally introduce two novel concepts: (1) the constraint inheritance scheme for automatically inferring more constraints inside each snapshot and (2) the constraint propagation scheme for propagating constraints between different snapshots. These schemes help significantly reduce the number of constraints that users must enter into the systems for acquiring a desired level of clustering quality by automatically adding more constrained based on the annotated ones. To the best of our knowledge, Border is the first framework that combines active learning, instance-level and temporal smoothness constraints.
Fig. 1.

Our active temporal clustering framework

3.1 Constrained Clustering Algorithm

For each snapshot \(S_s\), we use constrained kMeans for grouping objects. Generally, any existing techniques such as MPCK-Means [2], CVQE [8] or LCVQE [17] can be used. Here we introduce CVQE+, an extension of CVQE [8] to cope with weighted constraints, to do the task.

The New Algorithm CVQE+. Let \(C = \{C_i\}\) be a set of clusters. The cost of \(C_i\) is defined as its vector quantization cost \(VQE_i\) and the constraint violation costs \(ML_i\) and \(CL_i\) (where \(ML_i \subseteq ML\) and \(CL_i \subseteq CL\) are the sets of must-link and cannot-link constraints related to \(C_i\)) as follows. Note that, our \(ML_i\) cost is symmetric compared to [8].
$$\begin{aligned}&Cost_{C_i} = Cost_{VQE_i} + Cost_{ML_i} + Cost_{CL_i}\\&\quad \quad \quad Cost_{VQE_i} = \sum _{x \in C_i} (c_i - x)^2 \nonumber \\&Cost_{ML_i} = \sum _{(a,b) \in ML_i \wedge vl(a,b)} w_{ij} (c_i - c_{\pi (a,b,i)})^2 \nonumber \\&\quad Cost_{CL_i}= \sum _{(a,b) \in CL_i \wedge vl(a,b)} w_{ij} (c_i - c_{\varphi (i)})^2 \nonumber \end{aligned}$$
(1)
where, vl(ab) is true for (ab) that violates must-link or cannot link constraints, \(c_i\) is the center of cluster \(C_i\), \(\pi (a,b,i)\) returns the center of clusters of a or b (not including cluster \(C_i\)), and \(\varphi (i)\) returns the nearest cluster center of \(C_i\). Note that, \(Cost_{ML_i}\) is symmetric compared to [8].
Taking the derivative of \(Cost_{C_i}\), the new center of \(C_i\) is updated as:
$$\begin{aligned}&\qquad \qquad \frac{dCost_{C_i}}{d_{c_i}} = \frac{dCost_{VQE_i}}{d_{c_i}} + \frac{dCost_{ML_i}}{d_{c_i}} + \frac{dCost_{CL_i}}{d_{c_i}}\\&c_i=\frac{{\displaystyle \sum _{x \in C_i} x + \sum _{(a,b) \in ML_i \wedge vl(a,b)} w_{ij} C_{\pi (a,b,i)} + \sum _{(a,b) \in CL_i \wedge vl(a,b)} w_{ij} C_{\varphi (i)}}}{ {\displaystyle |C_i| + \sum _{(a,b) \in ML_i \wedge vl(a,b)} w_{ij} + \sum _{(a,b) \in CL_i \wedge vl(a,b)} w_{ij}}}\nonumber \end{aligned}$$
(2)
For each constraint (ab), CVQE+ assigns objects to clusters by examining all \(k^2\) cluster combinations for a and b like CVQE. The major difference is that when we calculate the violation cost, we consider all constraints starting and ending at a and b instead of only the constraint (ab) as in CVQE [7] or LCVQE [7], which is very sensitive to the cost change when some constraints share the same objects (changing these objects affects all their constraints) as illustrated in Fig. 2. The assigning cost for (ab) will include the violation costs for (ax), (ay) and (bz) as well. Thus, this scheme is expected to improve the clustering quality of CVQE+ compared to CVQE and LCVQE.
Fig. 2.

Assigning a pair of constrained object (ab) to clusters in CVQE+. All constraints starting and ending at a and b will be considered

Complexity Analysis. Let n be the number of objects, m be the number of constraints, k be the number of clusters. CVQE+ has time complexity \(O(rkn + rk^2m^2)\) which is higher than \(O(rkn + rk^2m)\) of CVQE due to the fact that all related constraints must be examined while assigning a constraint, where r is the number of iterations of the algorithm. Since k and m are constants, CVQE+ is thus has linear time complexity to the number of objects n. It also require O(n) space for storing objects and constraints.

3.2 Active Constraint Selection

We introduce an active learning method called Border for selecting pairs of objects and query users for constraint types. The general idea is examining objects lying around borders of clusters since they are the most uncertain ones and choosing a block of \(\beta \) pairs of objects to query users until the query budget \(\delta \) is reached. Here, \(\beta \) and \(\delta \) are predefined constants.

Active Learning with Border. To avoid examining all pairs of objects, Border chooses a subset of \(m=min(O(\sqrt{n}),M)\) objects located at the boundary of the clusters as the main targets since they are the most uncertain ones, where M is a predefined constant (default as 100). This bound limits the number of pairs as a constant, thus reducing the number of pairs needed for examining in the subsequence steps. For each object a in cluster \(C_i\), the border score of a is defined as:
$$\begin{aligned} bor(a) = \frac{(a -c_i)^2}{(a-c_{\varphi (i)})^2(1+ml(a))(1+cl(a))} \end{aligned}$$
(3)
where ml(a) and cl(a) are the sums of weights of must and cannot-link constraints of a. Here, we favor objects that have fewer constraints for increasing constraint diversity. This also fits well with our constraint inheritance scheme. Moreover, by considering the distance to the second nearest cluster center \(c_{\varphi (i)}\), we focus more on objects that are close to the boundaries of two clusters rather than ones that are far away from other clusters, which may not bring much benefit to clarify the groups. For each cluster \(C_i\), we select \(m|C_i|/n\) top objects based on their border score distribution in \(C_i\). This can be done by building a histogram with \(O(\sqrt{|C_i|})\) bins (a well-known rule of thumb for the optimal histogram bin) [3]. Then, objects are taken sequentially from the outermost bins. This scheme ensures that all clusters are considered based on their current sizes. Bigger clusters contribute more objects than smaller ones since their changes will more likely affect the final clustering result. Moreover, by using histogram bins, we give equal changes to objects within a bin since these objects might have the same importances for clarifying the clusters.
For each selected object a, we estimate the uncertainty of a w.r.t. the current clustering result as:
$$\begin{aligned} sco(a)= ent(\mu nn(a)) + \frac{vl(ml(a)) + vl(cl(a))}{ml(a) + cl(a) + 1} \end{aligned}$$
(4)
where \(ent(\mu nn(a))\) is the entropy of class labels of \(\mu \) nearest neighbors of a and vl(ml(a)) and vl(cl(a)) are the sums of violated must-link and cannot-link constraints of a. A high score(a) means that a is in high uncertain areas with different mixed class labels and a high number of constraint violations. And thus, it should be focused on.

We divide \(m^2 = O(n)\) pairs of selected objects into two sets: the set of inside cluster pairs X and between cluster pairs Y, i.e., for all \((x,y) \in X : label(x) = label(y)\) and for all \((x,y) \in Y : label(x) \ne label(y)\). For a pair \((x,y) \in X\), it is sorted by \(val(x,y)=\frac{(x-y)^2 (1+sco(x))(1+sco(y))}{(1+ml(x)+cl(x))(1+ml(y)+cl(y))}\). For \((x,y) \in Y\), \(val(x,y)=\frac{(x-y)^2 (1+ml(x)+cl(x))(1+ml(y)+cl(y))}{(1+sco(x))(1+sco(y))}\). The larger val is, the more likely x and y belong to different clusters and vice versa. Moreover, in Y, we tend to select pairs with more related constraints to strengthen the current clusters, while we try to separate clusters in X by considering pairs with fewer related constraints. We choose top \(\beta /2\) non-overlapped largest val pairs of X and top \(\beta /2\) non-overlapped smallest pairs of Y in order to maximize the changes in clustering results (inside and between clusters). To be concrete, if a pair (ab) was chosen, all pairs starting and ending with a or b will not be considered for enhancing the constraint diversity, which can help to bring up better performance. If all pairs are excluded, we select the remainder randomly.

We show \(\beta \) pairs to users to ask for the constraint type and add their feedback to the constraints set and update clusters until the total number of queries exceeds a predefined budget \(\delta \) as illustrated in Fig. 1.
Fig. 3.

(A) Constraint inheritance from (pq) to (ab). (B) The effect of the object b on its neighbors

Constraint Inheritance in Border. For further reducing the number of queries to users, the general idea is to infer new constraints automatically based on annotated ones. Our inheritance scheme is based on the concept of \(\mu \) nearest neighbors below.

Let h be the distance between an object p and its \(\mu \) nearest neighbors. The influence of p on its neighbor x is formulated by a triangular kernel function \(\phi _h(p,x)\) centered at p as in Fig. 3. Given a constraint \((p,q,w_{pq})\), for all \(a \in \) \(\mu nn(p)\) and \(b \in \) \(\mu nn(q)\), we add \((a,b,w_{ab})\) to the constraints set, where \(w_{ab}\) is defined as:
$$\begin{aligned} w_{ab} = w_{pq} \phi _h(p,a) \phi _h(q,b) \end{aligned}$$
(5)
The general intuition is that the label of an object a tends to be consistent with its closest neighbors (which is commonly used in data classification such as nearest neighbor classification [12]). This scheme is expected to increase the clustering quality, especially when combined with the active learning approach of the algorithm Border described above.
During the inheritance scheme, if a pair of objects (ab) is inherited from two constraints (cd) and (ef) with inherited weights \(w_{1}\) and \(w_2\), respectively, its weight and type are determined as follows:
$$\begin{aligned} w_{ab}={\left\{ \begin{array}{ll} max(w_{1}, w_{2}) \ if \ type(c,d) = type(e,f) \\ |w_{1} - w_{2}| \ otherwise \\ \end{array}\right. } \end{aligned}$$
(6)
where type(cd) is the constraint type of (cd) (either must-link or cannot-link). And type(ab) is determined by type(cd) if \(w_{1} > w_{2}\) and vice versa. The general idea here is that if (ab) is influenced by two constraints with different kinds, it will follow the one with the highest influence. Note that, if (ab) belongs to the main constraint set, we exclude it from the constraint inheritance scheme since we consider it as annotated by users and thus it is confident.

Updating Clusters. At each iteration, instead of performing clustering again for updating the clustering result with the new set of constraints, we propose to update it incrementally for saving runtime. To do so, we only need to take the old cluster centers and update them following Eq. 1 with the updated constraints set. The intuition behind this is that new constraints is more likely to change clusters locally. Thus, starting from the current state might make the algorithm converges faster. In Sect. 4, we show that this updating scheme acquire the same quality but converge much faster than re-clustering from scratch.

Complexity Analysis. Similarly to CVQE+, Border has O(n) time and space complexity at each iteration and thus has \(O(\delta n / \beta )\) time complexity overall, where \(\delta \) is the budget limitation and \(\beta \) is the number of selected objects at each iteration described above.

3.3 Temporal Smoothness Constraints

The general idea of temporal smoothness [4] is that clusters not only have high quality in each snapshot but also do not change much between sequential time frames. It is useful in many applications where the transition between different snapshots is smoothed for consistency.

Temporal Smoothness. To ensure the smoothness, we re-define the cost of cluster \(C_i\) of snapshot \(S_s\) in Eq. 1 by enforcing a historical cost from its previous snapshot as follows:
$$\begin{aligned} TCost_{VQE_i} = (1 - \alpha ) Cost_{VQE_i} + \alpha Hist(C_i, S_{s-1}) \end{aligned}$$
(7)
where \(Hist(C_i, S_{s-1})\) is the historical cost of cluster \(C_{i}\) between two snapshots \(S_s\) and \(S_{s-1}\) and \(\alpha \) is a regulation factor to balance the current clustering quality and the historical cost. This cost keeps the new clusters do not deviate too much from clusters from previous snapshot while performing clustering. We define the historical cost as follows:
$$\begin{aligned} Hist(C_i, S_{s-1}) = (c_i - \psi (C_i, S_{s-1}))^2 \end{aligned}$$
(8)
where \(\psi (C_i, S_{s-1})\) returns the closest cluster center to \(C_i\) in snapshot \(S_{s-1}\). Obviously if two clustering results are too different, indicated by high historical cost, the penalty will be higher thus focing the algorithm to lower down the overall cost by creating clusters closer to those of the previous snapshot.
Taking the derivation of Eq. (7) as in Eq. (1), we can update the cluster centers as follows:
$$\begin{aligned} c_i = \frac{(1-\alpha )A+\alpha \psi (C_i, S_{s-1})}{(1-\alpha )B+\alpha } \end{aligned}$$
(9)
where A and B are respectively the numerator and the denominator given in Eq. 2 for updating clusters.
Constraint Propagation. Whenever we have a new constraint \((x,y,w_{xy})\) in snapshot \(S_s\), we propagate it to snapshots \(S_{s'}\) where \(s' > s\) if \(x, y \in S'\). The intuition is that if x and y are linked (either by must or cannot-link) in \(S_s\), they are more likely to be linked in \(S_{s'}\). Thus we add the constraint \((x,y,w'_{xy})\) to \(S_{s'}\) where:
$$\begin{aligned} w'_{xy} = w_{xy} \frac{te_s - ts_{s'}}{te_{s'}-ts_s} \end{aligned}$$
(10)
where \((te_s - ts_{s'})/(te_{s'}-ts_s)\) is a time fading factor. This scheme helps to increase the clustering quality by putting more constraints into the clustering algorithm like the inheritance scheme. Since propagated constraints are not considered as user annotated ones, we treat it as non-confident constraints in our model and will not build offspring for them like those in the main constraint set described in the inheritance scheme above.

4 Experiments

Experiments are conducted on a workstation with 4.0 Ghz CPU and 32 GB RAM using Java. We use 6 datasets Iris, Ecoli, Seeds, Libras, Optdigits, and Wdbc acquired from the UCI archives1. The numbers of clusters k are acquired from the ground truths. Constraint queries are also simulated from the ground truths by adding a must-link if two objects have the same labels or a cannot-link if they have different labels. We use Normalized Mutual Information (NMI) [16] for assessing the clustering quality. NMI score is in [0, 1] where 1 means a perfect clustering result compared to the ground truth and vice versa. All results are averaged over 10 runs.

4.1 Constrained Clustering

Performance of CVQE+. Figure 4 shows comparisons among CVQE+ and existing techniques including kMeans, MPCK-Means [2], CVQE [8] and LCVQE [17] over different sets of randomly selected constraints. CVQE+ consistently outperforms or acquires comparable results to CVQE and others for most datasets (except the Libras dataset where it is outperformed by LCVQE), especially when the number of constraints is large. This can be explained by the way CVQE+ assigns objects to clusters. By considering all related constraints while assigning cluster labels for objects, it can better optimize the overall cost function, thus leading to better clustering quality. Compared to its predecessor algorithm CVQE or LCVQE, it deals well with constraint overlap (constraints that share the same objects), which increases with the number of constraints. Note that, when the constraint set is empty, CVQE+ produces clustering in the similar way with k-Means. Thus, the clustering quality does not start from 0.
Fig. 4.

Performance of CVQE+ compared to others

Fig. 5.

Effect of noisy constraints on CVQE+ (left), CVQE (middle) and LCVQE (right) for the Iris and Wdbc datasets

Noise Robustness. For studying the effect of noisy constraints on CVQE+, we randomly choose some constraints and change them from must-link to cannot-link and vice versa. Figure 5 shows the clustering quality of different algorithms w.r.t. the percentages of noisy constraints from 2% to 8% for real datasets. As we can see, for all algorithms, when the number of noisy constraints increases, the clustering quality decreases accordingly. However, CVQE+ tends to be more affected by noise than its related techniques CVQE and LCVQE. Though its point assignment scheme helps to increase the clustering quality as discussed above, it makes CVQE+ more sensitive to noise since one noisy constraint will affect the assignment cost for all of its related constraints. Nevertheless, in our experimented data, CVQE+ still acquires better (or equivalent) clustering results than CVQE and LCVQE under the same noisy conditions in most cases as seen in Fig. 5. However, we only use maximum 500 constraints in our experiments. If the number of noisy constraints become bigger, CVQE+ may not be the winner. Developing an effective algorithm to cope with noisy constraints is thus an interesting target to pursue.
Fig. 6.

Effect of constraint types on CVQE+ (left), CVQE (middle) and LCVQE (right) for the Ecoli and Libras datasets

Effect of Constraint Types. Figure 6 shows the performance of CVQE+ and its related techniques CVQE and LCVQE when the number of must-link constraints increases from 20% to 80% of the constraint sets. The clustering quality of CVQE+ and CVQE increases with the number of must-link constraints, while that of LCVQE decreases. This can be explained by the ways they calculate the constraint violation costs for the must-link and especially the cannot-link constraints. LCVQE treats violated cannot-link constraints more properly than CVQE and CVQE+. Thus, it deals well with higher number of those constraints.

4.2 Active Constraint Selection

We study the performance of Border in comparison with other state-of-the-art active learning techniques. Unless otherwise stated, the budget limitation \(\delta \) is set to 200, the query size \(\beta =10\) and the neighborhood size \(\mu =4\).
Fig. 7.

Comparison among different active learning techniques

Fig. 8.

Runtimes of different techniques

Active Constraint Selection. Figure 7 shows comparisons between Border, NPU [19], Huang [19] (a modified version of [13] for working with non-document data), Min-max [15], Explorer-Consolidate [1], and a randomized method (Huang and Consolidate are removed from Fig. 7 for readability). Border acquires better results than others on Libras, Wdbc and Optidigits, comparable results on Iris and Ecoli. For the Seeds dataset, it is outperformed by NPU. The difference is because Border tends to strengthen existing clusters by fortifying both the cluster borders and inter connectivity for groups of objects rather than connecting a single object to existing components like NPU and Huang. Moreover, since it iteratively studies the clustering results for selecting constraints, it has better performance than non-iterative methods like Consolidate and Min-max.

Runtime Comparison. For studying the runtime of Border on large-scale datasets, we create five synthetics datasets of sizes 2000 to 10000 consisting of 5 Gaussian clusters and measure the time for acquiring 100 constraints. The results are shown in Fig. 8. Border is orders of magnitude faster than other methods in selecting pairs to query. For 1000 objects, it takes Border 0.1 s while NPU and Min-max need 439.4 and 3.0 s, which is 4394 and 30 times slower than Border. For 10000 objects, Border, NPU and Min-max consumes 0.18, 5216.3 and 18.2 s, respectively. It is due to the fact that Border does not evaluate all pairs of objects at each iteration. Thus, it does much less works than others and faster. Besides, NPU and Min-max are implemented in Matlab which is slower than Border in Java. Nevertheless, the higher the number of objects and constraints, the higher the runtime differences. For 10000 objects, Border is around 28979.4 and 101.1 times faster than NPU and Min-max, respectively. Hence, its runtime performance makes Border an effective technique to cope with very large datasets.

Cluster Update. Figure 9 shows the NMI and the number of iterations of our algorithm for the Ecoli dataset. The NMI scores are comparable, while it takes fewer iterations for our algorithm to converge in its update mode.

Effect of the Block Size \(\beta \). Figure 10 shows the performance of Border when the query block size \(\beta \) varies from 10 to 30. As we can see, the smaller the value of \(\beta \) is, the better the performance of Border since the cluster structure is assessed more frequently, thus leading to better constraints to be selected at each iteration.

Effect of the Constraint Inheritance Scheme. Figure 11 shows the effect of the parameter \(\mu \) on our algorithm Border via the inheritance scheme. Typically, its performance will increase with \(\mu \) until it reaches the peak and then decreases as shown for the dataset Iris. This can be explained by the neighborhood influence scheme of Border. When \(\mu \) is large enough, the number of wrong constraints will be increased, thus lower down the performance of Border. However, the peak value of \(\mu \) is actually dataset dependence and thus is very hard to predict. Taking the dataset Optdigits as an example, the performance of Border still increases when \(\mu =5\). However, with \(\mu =3\), Border starts perform worse on the dataset Seeds. Unfortunately, the value of \(\mu \) is highly data dependent and is hard to select. In our experiments, we observe that the value of \(\mu \) around 2 to 4 is overall good for most datasets. Thus, we choose \(\mu =4\) as a default value.
Fig. 9.

Update vs. fully reclustering for the Ecoli dataset

Fig. 10.

The effect of the query block size \(\beta \) on the performance of Border

Fig. 11.

The effect of the neighborhood size \(\mu \) on the performance of Border

4.3 Temporal Clustering

For studying the temporal clustering result, we divide the datasets into different snapshots and measuring the clustering quality using the ground truths provided for the full datasets.
Fig. 12.

Temporal clustering on the dataset Optdigits

Temporal Clustering. Figure 12 shows the active temporal clustering results for three snapshots of the Optdigits dataset (we set \(\alpha =0.5\)). As we see, our active learning scheme can help boost clustering quality inside each snapshot compared to the original kMeans or a randomized constraint selection method. With the constraint propagation scheme (Border-Propagation), the clustering results are further boosted compared to Border. For example, in Snapshot 2 and 3, Border-propagation performs much better than Border without the constraint propagation scheme. Since we only consider forward propagation, the clustering result in Snapshot 3 will be more affected than Snapshot 2 and Snapshot 1. For example, in Snapshot 3 the difference between Border and Border-Propagation is much higher than in Snapshot 2. In case of interest, we can easily extend the algorithm for a backward propagation scheme.

5 Related Work

Constraint Clustering. There are many proposed constrained clustering algorithms such as MPC-kMeans [2], CVQE [8] and LCVQE [17]. These techniques optimize an objective function consisting of the clustering quality and the constraint violation cost like our algorithm CVQE+. CVQE+ is an extension of CVQE [8], where we extend the cost model to deal with weighted constraints, make the must-link violation cost symmetric and change the way each constraint is assigned to clusters by considering all of its related constraints. This makes cluster assignment more stable, thus enhancing the clustering quality. Interested readers are referred to [7] for a comprehensive survey on constrained clustering methods.

Active Learning. Most existing techniques employ active learning for acquiring a desired constraints set before or during clustering. In [1], the authors introduce the Explorer-Consolidating algorithm to select constraints by exploiting the connected-components of must-link ones. Min-max [15] extends the Consolidation phase of [1] by querying most uncertain objects rather than randomly selecting them. These techniques produce constraints sets before clustering. Thus, they cannot exploit the cluster labels for further enhancing performance. Huang et al. [13] introduce a framework that iteratively generates constraints and updates clustering results until a query budget is reached. However, it is limited to a probabilistic document clustering algorithm. NPU [19] also uses connected-components of must-link constraints as a guideline for finding most uncertain objects. Constraints are then collected by querying these objects again existing connected components like the Consolidate phase of [1]. Though more effective than pre-selection ones, these techniques typically have a quadratic runtime which makes them infeasible to cope with large datasets like Border. Moreover, Border relies on border objects around clusters to build constraints rather than must-link graphs [1, 19]. The inheritance approach is closely related to the constraint propagation in the multi-view clustering algorithm [10, 11] for transferring constraints among different views. The major difference is that we use the \(\mu \)-nearest neighbors rather than the \(\epsilon \)-neighborhoods which is limited to Gaussian clusters and can lead to an excessive number of constraints.

Temporal Clustering. Temporal smoothness has been introduced in the evolution framework [4] for making clustering results stable w.r.t. the time. We significantly extend this framework by incorporating instance-level constraints, active query selections and constraint propagation for further improving clustering quality while minimizing constraint annotation effort.

6 Conclusion

We introduce a scalable novel framework which incorporates an iterative active learning scheme, instance-level and temporal smoothness constraints for coping with large temporal data. Experiments show that our constrained clustering algorithm, CVQE+, performs better than existing techniques such as CVQE [8], LCVQE [17] and MPC-kMeans [1]. By exploring border objects and propagating constraints via nearest neighbors, our active learning algorithm, Border, results in good clustering results with much smaller constraint sets compared to other methods such as NPU [19] and Min-max [15]. Moreover, it is orders of magnitude faster making it possible to cope with large datasets. Finally, we revisit our approach in the context of evolutionary clustering adding a temporal smoothness constraint and a time-fading factor to our constraint propagation among different data snapshots. Our future work aims at providing more expressive support for user feedback. We are currently using our framework to track group evolution of our patient data with sleeping disorder symptoms.

Footnotes

Notes

Acknowledgment

This work is supported by the CDP Life Project.

References

  1. 1.
    Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise constrained clustering. In: SDM, pp. 333–344 (2004)CrossRefGoogle Scholar
  2. 2.
    Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: ICML (2004)Google Scholar
  3. 3.
    Birgé, L., Rozenholc, Y.: How many bins should be put in a regular histogram. ESAIM: Probab. Stat. 10, 24–45 (2006)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: SIGKDD, pp. 554–560 (2006)Google Scholar
  5. 5.
    Cohn, D., Caruana, R., Mccallum, A.: Semi-supervised clustering with user feedback. Technical report (2003)Google Scholar
  6. 6.
    Davidson, I.: Two approaches to understanding when constraints help clustering. In: KDD, pp. 1312–1320 (2012)Google Scholar
  7. 7.
    Davidson, I., Basu, S.: A survey of clustering with instance level constraints. TKDD (2007)Google Scholar
  8. 8.
    Davidson, I., Ravi, S.S.: Clustering with constraints: feasibility issues and the k-means algorithm. In: SDM, pp. 138–149 (2005)CrossRefGoogle Scholar
  9. 9.
    Davidson, I., Ravi, S.S., Ester, M.: Efficient incremental constrained clustering. In: KDD, pp. 240–249 (2007)Google Scholar
  10. 10.
    Eaton, E., desJardins, M., Jacob, S.: Multi-view clustering with constraint propagation for learning with an incomplete mapping between views. In: CIKM, pp. 389–398 (2010)Google Scholar
  11. 11.
    Eaton, E., desJardins, M., Jacob, S.: Multi-view constrained clustering with an incomplete mapping between views. Knowl. Inf. Syst. 38(1), 231–257 (2014)CrossRefGoogle Scholar
  12. 12.
    Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco (2005)Google Scholar
  13. 13.
    Huang, R., Lam, W.: Semi-supervised document clustering via active learning with pairwise constraints. In: ICDM, pp. 517–522 (2007)Google Scholar
  14. 14.
    Huang, Y., Mitchell, T.M.: Text clustering with extended user feedback. In: SIGIR, pp. 413–420 (2006)Google Scholar
  15. 15.
    Mallapragada, P.K., Jin, R., Jain, A.K.: Active query selection for semi-supervised clustering. In: ICPR, pp. 1–4 (2008)Google Scholar
  16. 16.
    Nguyen, X.V., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML, pp. 1073–1080 (2009)Google Scholar
  17. 17.
    Pelleg, D., Baras, D.: K-means with large and noisy constraint sets. In: Kok, J.N., Koronacki, J., Mantaras, R.L., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 674–682. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-74958-5_67CrossRefGoogle Scholar
  18. 18.
    Chouakria, A.D., Mai, S.T., Amer-Yahia, S.: Scalable active temporal constrained clustering. In: EDBT (2018)Google Scholar
  19. 19.
    Xiong, S., Azimi, J., Fern, X.Z.: Active learning of constraints for semi-supervised clustering. IEEE Trans. Knowl. Data Eng. 26(1), 43–54 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Son T. Mai
    • 1
  • Sihem Amer-Yahia
    • 1
  • Ahlame Douzal Chouakria
    • 1
  • Ky T. Nguyen
    • 1
  • Anh-Duong Nguyen
    • 2
  1. 1.CNRSUniv. Grenoble AlpesGrenobleFrance
  2. 2.University of Rennes 1RennesFrance

Personalised recommendations