Scalable Active Constrained Clustering for Temporal Data
Abstract
In this paper, we introduce a novel interactive framework to handle both instancelevel and temporal smoothness constraints for clustering large temporal data. It consists of a constrained clustering algorithm, called CVQE+, which optimizes the clustering quality, constraint violation and the historical cost between consecutive data snapshots. At the center of our framework is a simple yet effective active learning technique, named Border, for iteratively selecting the most informative pairs of objects to query users about, and updating the clustering with new constraints. Those constraints are then propagated inside each data snapshot and between snapshots via two schemes, called constraint inheritance and constraint propagation, to further enhance the results. Experiments show better or comparable clustering results than stateoftheart techniques as well as high scalability for large datasets.
Keywords
Semisupervised clustering Active learning Interactive clustering Incremental clustering Temporal clustering1 Introduction
In semisupervised clustering, domain knowledge is typically encoded in the form of instancelevel mustlink and cannotlink constraints [9] for aiding the clustering process, thus enhancing the quality of results. Such constraints specify that two objects must be placed or must not be placed in the same clusters, respectively. Constraints have been successfully applied to improve clustering quality in realworld applications, e.g., identifying people from surveillance cameras [9] and aiding robot navigation [8]. However, current research on constrained clustering still faces several major issues described below.
Most existing approaches assume that we have a set of constraints beforehand, and an algorithm will use this set to produce clusters [2, 8]. Davidson et al. [6] show that the clustering quality varies significantly using different equisize sets of constraints. Moreover, annotating constraints requires human intervention, an expensive and time consuming task that should be minimized as much as possible given the same expected clustering quality. Therefore, how to choose a good and compact set of constraints rather than randomly selecting them from the data has been the focus of many research efforts, e.g., [1, 15, 19].
Many approaches employ different active learning schemes to select the most meaningful pairs of objects and then query experts for constraint annotation [1, 15]. By allowing the algorithms to choose constraints themselves, we can avoid insignificant ones, and expect to have high quality and compact constraint sets compared to the randomized scheme. These constraints are then used as input for constrained clustering algorithms to operate. However, if users are not satisfied with the results, they are asked to provide another constraint set and start the clustering again, which is obviously time consuming and expensive.
Other algorithms follow a feedback schema which does not require a full set of constraints in the beginning [5]. They iteratively produce clusters with their available constraints, show results to users, and get feedback in the form of new constraints. By iteratively refining clusters according to user feedback, the acquired results fit users’ expectations better [5]. Constraints are also easier to select with an underlying cluster structure as a guideline, thus reducing the overall number of constraints and human annotation effort for the same quality level. However, exploring the whole data space for finding meaningful constraints is also a nontrivial task for users.
To reduce human effort, several methods incorporate active learning into the feedback process, e.g., [13, 14, 15, 19]. At each iteration, the algorithm automatically chooses pairs of objects and queries users for their feedback in terms of mustlink and cannotlink constraints instead of leaving the whole clustering results for users to examine. Though these active feedback techniques are proven to be very useful in realworld tasks such as document clustering [13], they suffer from very high runtime since they have to repeatedly perform clustering as well as exploring all \(O(n^2)\) pairs of objects to generate queries to users each time.
In this paper and its preliminary version [18], we develop an efficient framework to cope with the above problems following the iterative active learning approach as in [13, 19]. However, instead of examining all pairs of objects, our technique, called Border, selects a small set of objects around cluster borders and queries users about the most uncertain pairs of objects. We also introduce a constraint inheritance approach based on the notion of \(\mu \)nearest neighbors for inferring additional constraints, thus further boosting performance. Finally, we revisit our approach in the context of evolutionary clustering [4]. Evolutionary clustering aims to produce high quality clusters while ensuring that the clustering does not change dramatically between consecutive timestamps. This scheme is very useful in many application scenarios. For example, doctors want to track groups patients based on their treatment progresses each year. They may expect that existing groups do not change much over time if there are minor changes in the data. However, the clustering process should be able to reflex the changes if there are significant differences in the new data. Therefore, we propose to formulate a temporal smoothness constraint into our framework and add a timefading factor to our constraint propagation.

We introduce a new algorithm CVQE+ that extends CVQE [8] with weighted mustlink and cannotlink constraints and a new object assignment scheme.

We propose a new algorithm, Border, that relies on active clustering and constraint inheritance to choose a small number of objects to solicit user feedback for. Beside the active selection scheme for pairs of objects, Border employs a constraint inheritance method for inferring more constraints, thus further enhancing the performance.

We present an evolutionary clustering framework which incorporates instancelevel and temporal smoothness constraints for temporal data. To the best of our knowledge, our algorithm is the first framework that combines active learning, instancelevel and temporal smoothness constraints.

Experiments are conducted for six real datasets to demonstrated the performance of our algorithms over stateoftheart ones.
2 Problem Formulation
Let \(D = \{(d,t)\}\) be a set of D vectors \(d \in \mathbb {R}^p\) observed at time t. Let \(S=\{(S_s,D_s,ts_{s},te_{s})\}\) be a set of preselected S data snapshots. Each \(S_s\) starts at time \(ts_{s}\), ends at time \(te_{s}\) and contains a set of objects \(D_s = \{(d,t) \in D \;  \; ts_{s} \le t < te_s\}\). Two snapshots \(S_s\) and \(S_{s+1}\) may overlap but must satisfy the time order, i.e., \(ts_{s} \le ts_{s+1}\) and \(te_{s} \le te_{s+1}\). For each snapshot \(S_s\), let \(ML_s = \{(x,y,w_{xy})  (x,y) \in D_s^2 \}\) and \(CL_s = \{(x,y,w_{xy})  (x,y) \in D_s^2 \}\) be the set of mustlink and cannotlink constraints of \(S_s\) with a degree of belief of \(w_{xy} \in [0,1]\). Initially, \(ML_s\) and \(CL_s\) can be empty.
In this paper, we focus on the problem of grouping objects in all snapshots into clusters. Our goals are (1) reduce the number of constraints thus reducing the constraint annotation costs (2) make the algorithm scale well with large datasets and (3) smooth the gap between clustering results of two consecutive snapshots, i.e., ensure temporal smoothness.
3 Our Proposed Framework
3.1 Constrained Clustering Algorithm
For each snapshot \(S_s\), we use constrained kMeans for grouping objects. Generally, any existing techniques such as MPCKMeans [2], CVQE [8] or LCVQE [17] can be used. Here we introduce CVQE+, an extension of CVQE [8] to cope with weighted constraints, to do the task.
Complexity Analysis. Let n be the number of objects, m be the number of constraints, k be the number of clusters. CVQE+ has time complexity \(O(rkn + rk^2m^2)\) which is higher than \(O(rkn + rk^2m)\) of CVQE due to the fact that all related constraints must be examined while assigning a constraint, where r is the number of iterations of the algorithm. Since k and m are constants, CVQE+ is thus has linear time complexity to the number of objects n. It also require O(n) space for storing objects and constraints.
3.2 Active Constraint Selection
We introduce an active learning method called Border for selecting pairs of objects and query users for constraint types. The general idea is examining objects lying around borders of clusters since they are the most uncertain ones and choosing a block of \(\beta \) pairs of objects to query users until the query budget \(\delta \) is reached. Here, \(\beta \) and \(\delta \) are predefined constants.
We divide \(m^2 = O(n)\) pairs of selected objects into two sets: the set of inside cluster pairs X and between cluster pairs Y, i.e., for all \((x,y) \in X : label(x) = label(y)\) and for all \((x,y) \in Y : label(x) \ne label(y)\). For a pair \((x,y) \in X\), it is sorted by \(val(x,y)=\frac{(xy)^2 (1+sco(x))(1+sco(y))}{(1+ml(x)+cl(x))(1+ml(y)+cl(y))}\). For \((x,y) \in Y\), \(val(x,y)=\frac{(xy)^2 (1+ml(x)+cl(x))(1+ml(y)+cl(y))}{(1+sco(x))(1+sco(y))}\). The larger val is, the more likely x and y belong to different clusters and vice versa. Moreover, in Y, we tend to select pairs with more related constraints to strengthen the current clusters, while we try to separate clusters in X by considering pairs with fewer related constraints. We choose top \(\beta /2\) nonoverlapped largest val pairs of X and top \(\beta /2\) nonoverlapped smallest pairs of Y in order to maximize the changes in clustering results (inside and between clusters). To be concrete, if a pair (a, b) was chosen, all pairs starting and ending with a or b will not be considered for enhancing the constraint diversity, which can help to bring up better performance. If all pairs are excluded, we select the remainder randomly.
Constraint Inheritance in Border. For further reducing the number of queries to users, the general idea is to infer new constraints automatically based on annotated ones. Our inheritance scheme is based on the concept of \(\mu \) nearest neighbors below.
Updating Clusters. At each iteration, instead of performing clustering again for updating the clustering result with the new set of constraints, we propose to update it incrementally for saving runtime. To do so, we only need to take the old cluster centers and update them following Eq. 1 with the updated constraints set. The intuition behind this is that new constraints is more likely to change clusters locally. Thus, starting from the current state might make the algorithm converges faster. In Sect. 4, we show that this updating scheme acquire the same quality but converge much faster than reclustering from scratch.
Complexity Analysis. Similarly to CVQE+, Border has O(n) time and space complexity at each iteration and thus has \(O(\delta n / \beta )\) time complexity overall, where \(\delta \) is the budget limitation and \(\beta \) is the number of selected objects at each iteration described above.
3.3 Temporal Smoothness Constraints
The general idea of temporal smoothness [4] is that clusters not only have high quality in each snapshot but also do not change much between sequential time frames. It is useful in many applications where the transition between different snapshots is smoothed for consistency.
4 Experiments
Experiments are conducted on a workstation with 4.0 Ghz CPU and 32 GB RAM using Java. We use 6 datasets Iris, Ecoli, Seeds, Libras, Optdigits, and Wdbc acquired from the UCI archives^{1}. The numbers of clusters k are acquired from the ground truths. Constraint queries are also simulated from the ground truths by adding a mustlink if two objects have the same labels or a cannotlink if they have different labels. We use Normalized Mutual Information (NMI) [16] for assessing the clustering quality. NMI score is in [0, 1] where 1 means a perfect clustering result compared to the ground truth and vice versa. All results are averaged over 10 runs.
4.1 Constrained Clustering
Effect of Constraint Types. Figure 6 shows the performance of CVQE+ and its related techniques CVQE and LCVQE when the number of mustlink constraints increases from 20% to 80% of the constraint sets. The clustering quality of CVQE+ and CVQE increases with the number of mustlink constraints, while that of LCVQE decreases. This can be explained by the ways they calculate the constraint violation costs for the mustlink and especially the cannotlink constraints. LCVQE treats violated cannotlink constraints more properly than CVQE and CVQE+. Thus, it deals well with higher number of those constraints.
4.2 Active Constraint Selection
Active Constraint Selection. Figure 7 shows comparisons between Border, NPU [19], Huang [19] (a modified version of [13] for working with nondocument data), Minmax [15], ExplorerConsolidate [1], and a randomized method (Huang and Consolidate are removed from Fig. 7 for readability). Border acquires better results than others on Libras, Wdbc and Optidigits, comparable results on Iris and Ecoli. For the Seeds dataset, it is outperformed by NPU. The difference is because Border tends to strengthen existing clusters by fortifying both the cluster borders and inter connectivity for groups of objects rather than connecting a single object to existing components like NPU and Huang. Moreover, since it iteratively studies the clustering results for selecting constraints, it has better performance than noniterative methods like Consolidate and Minmax.
Runtime Comparison. For studying the runtime of Border on largescale datasets, we create five synthetics datasets of sizes 2000 to 10000 consisting of 5 Gaussian clusters and measure the time for acquiring 100 constraints. The results are shown in Fig. 8. Border is orders of magnitude faster than other methods in selecting pairs to query. For 1000 objects, it takes Border 0.1 s while NPU and Minmax need 439.4 and 3.0 s, which is 4394 and 30 times slower than Border. For 10000 objects, Border, NPU and Minmax consumes 0.18, 5216.3 and 18.2 s, respectively. It is due to the fact that Border does not evaluate all pairs of objects at each iteration. Thus, it does much less works than others and faster. Besides, NPU and Minmax are implemented in Matlab which is slower than Border in Java. Nevertheless, the higher the number of objects and constraints, the higher the runtime differences. For 10000 objects, Border is around 28979.4 and 101.1 times faster than NPU and Minmax, respectively. Hence, its runtime performance makes Border an effective technique to cope with very large datasets.
Cluster Update. Figure 9 shows the NMI and the number of iterations of our algorithm for the Ecoli dataset. The NMI scores are comparable, while it takes fewer iterations for our algorithm to converge in its update mode.
Effect of the Block Size \(\beta \). Figure 10 shows the performance of Border when the query block size \(\beta \) varies from 10 to 30. As we can see, the smaller the value of \(\beta \) is, the better the performance of Border since the cluster structure is assessed more frequently, thus leading to better constraints to be selected at each iteration.
4.3 Temporal Clustering
Temporal Clustering. Figure 12 shows the active temporal clustering results for three snapshots of the Optdigits dataset (we set \(\alpha =0.5\)). As we see, our active learning scheme can help boost clustering quality inside each snapshot compared to the original kMeans or a randomized constraint selection method. With the constraint propagation scheme (BorderPropagation), the clustering results are further boosted compared to Border. For example, in Snapshot 2 and 3, Borderpropagation performs much better than Border without the constraint propagation scheme. Since we only consider forward propagation, the clustering result in Snapshot 3 will be more affected than Snapshot 2 and Snapshot 1. For example, in Snapshot 3 the difference between Border and BorderPropagation is much higher than in Snapshot 2. In case of interest, we can easily extend the algorithm for a backward propagation scheme.
5 Related Work
Constraint Clustering. There are many proposed constrained clustering algorithms such as MPCkMeans [2], CVQE [8] and LCVQE [17]. These techniques optimize an objective function consisting of the clustering quality and the constraint violation cost like our algorithm CVQE+. CVQE+ is an extension of CVQE [8], where we extend the cost model to deal with weighted constraints, make the mustlink violation cost symmetric and change the way each constraint is assigned to clusters by considering all of its related constraints. This makes cluster assignment more stable, thus enhancing the clustering quality. Interested readers are referred to [7] for a comprehensive survey on constrained clustering methods.
Active Learning. Most existing techniques employ active learning for acquiring a desired constraints set before or during clustering. In [1], the authors introduce the ExplorerConsolidating algorithm to select constraints by exploiting the connectedcomponents of mustlink ones. Minmax [15] extends the Consolidation phase of [1] by querying most uncertain objects rather than randomly selecting them. These techniques produce constraints sets before clustering. Thus, they cannot exploit the cluster labels for further enhancing performance. Huang et al. [13] introduce a framework that iteratively generates constraints and updates clustering results until a query budget is reached. However, it is limited to a probabilistic document clustering algorithm. NPU [19] also uses connectedcomponents of mustlink constraints as a guideline for finding most uncertain objects. Constraints are then collected by querying these objects again existing connected components like the Consolidate phase of [1]. Though more effective than preselection ones, these techniques typically have a quadratic runtime which makes them infeasible to cope with large datasets like Border. Moreover, Border relies on border objects around clusters to build constraints rather than mustlink graphs [1, 19]. The inheritance approach is closely related to the constraint propagation in the multiview clustering algorithm [10, 11] for transferring constraints among different views. The major difference is that we use the \(\mu \)nearest neighbors rather than the \(\epsilon \)neighborhoods which is limited to Gaussian clusters and can lead to an excessive number of constraints.
Temporal Clustering. Temporal smoothness has been introduced in the evolution framework [4] for making clustering results stable w.r.t. the time. We significantly extend this framework by incorporating instancelevel constraints, active query selections and constraint propagation for further improving clustering quality while minimizing constraint annotation effort.
6 Conclusion
We introduce a scalable novel framework which incorporates an iterative active learning scheme, instancelevel and temporal smoothness constraints for coping with large temporal data. Experiments show that our constrained clustering algorithm, CVQE+, performs better than existing techniques such as CVQE [8], LCVQE [17] and MPCkMeans [1]. By exploring border objects and propagating constraints via nearest neighbors, our active learning algorithm, Border, results in good clustering results with much smaller constraint sets compared to other methods such as NPU [19] and Minmax [15]. Moreover, it is orders of magnitude faster making it possible to cope with large datasets. Finally, we revisit our approach in the context of evolutionary clustering adding a temporal smoothness constraint and a timefading factor to our constraint propagation among different data snapshots. Our future work aims at providing more expressive support for user feedback. We are currently using our framework to track group evolution of our patient data with sleeping disorder symptoms.
Footnotes
Notes
Acknowledgment
This work is supported by the CDP Life Project.
References
 1.Basu, S., Banerjee, A., Mooney, R.J.: Active semisupervision for pairwise constrained clustering. In: SDM, pp. 333–344 (2004)CrossRefGoogle Scholar
 2.Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semisupervised clustering. In: ICML (2004)Google Scholar
 3.Birgé, L., Rozenholc, Y.: How many bins should be put in a regular histogram. ESAIM: Probab. Stat. 10, 24–45 (2006)MathSciNetCrossRefGoogle Scholar
 4.Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: SIGKDD, pp. 554–560 (2006)Google Scholar
 5.Cohn, D., Caruana, R., Mccallum, A.: Semisupervised clustering with user feedback. Technical report (2003)Google Scholar
 6.Davidson, I.: Two approaches to understanding when constraints help clustering. In: KDD, pp. 1312–1320 (2012)Google Scholar
 7.Davidson, I., Basu, S.: A survey of clustering with instance level constraints. TKDD (2007)Google Scholar
 8.Davidson, I., Ravi, S.S.: Clustering with constraints: feasibility issues and the kmeans algorithm. In: SDM, pp. 138–149 (2005)CrossRefGoogle Scholar
 9.Davidson, I., Ravi, S.S., Ester, M.: Efficient incremental constrained clustering. In: KDD, pp. 240–249 (2007)Google Scholar
 10.Eaton, E., desJardins, M., Jacob, S.: Multiview clustering with constraint propagation for learning with an incomplete mapping between views. In: CIKM, pp. 389–398 (2010)Google Scholar
 11.Eaton, E., desJardins, M., Jacob, S.: Multiview constrained clustering with an incomplete mapping between views. Knowl. Inf. Syst. 38(1), 231–257 (2014)CrossRefGoogle Scholar
 12.Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco (2005)Google Scholar
 13.Huang, R., Lam, W.: Semisupervised document clustering via active learning with pairwise constraints. In: ICDM, pp. 517–522 (2007)Google Scholar
 14.Huang, Y., Mitchell, T.M.: Text clustering with extended user feedback. In: SIGIR, pp. 413–420 (2006)Google Scholar
 15.Mallapragada, P.K., Jin, R., Jain, A.K.: Active query selection for semisupervised clustering. In: ICPR, pp. 1–4 (2008)Google Scholar
 16.Nguyen, X.V., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML, pp. 1073–1080 (2009)Google Scholar
 17.Pelleg, D., Baras, D.: Kmeans with large and noisy constraint sets. In: Kok, J.N., Koronacki, J., Mantaras, R.L., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 674–682. Springer, Heidelberg (2007). https://doi.org/10.1007/9783540749585_67CrossRefGoogle Scholar
 18.Chouakria, A.D., Mai, S.T., AmerYahia, S.: Scalable active temporal constrained clustering. In: EDBT (2018)Google Scholar
 19.Xiong, S., Azimi, J., Fern, X.Z.: Active learning of constraints for semisupervised clustering. IEEE Trans. Knowl. Data Eng. 26(1), 43–54 (2014)CrossRefGoogle Scholar