Evolutionary Active Constrained Clustering for Obstructive Sleep Apnea Analysis
 356 Downloads
Abstract
We introduce a novel interactive framework to handle both instancelevel and temporal smoothness constraints for clustering large longitudinal data and for tracking the cluster evolutions over time. It consists of a constrained clustering algorithm, called CVQE+, which optimizes the clustering quality, constraint violation and the historical cost between consecutive data snapshots. At the center of our framework is a simple yet effective active learning technique, named Border, for iteratively selecting the most informative pairs of objects to query users about, and updating the clustering with new constraints. Those constraints are then propagated inside each data snapshot and between snapshots via two schemes, called constraint inheritance and constraint propagation, to further enhance the results. Moreover, a historical constraint is enforced between consecutive snapshots to ensure the consistency of results among them. Experiments show better or comparable clustering results than stateoftheart techniques as well as high scalability for large datasets. Finally, we apply our algorithm for clustering phenotypes in patients with Obstructive Sleep Apnea as well as for tracking how these clusters evolve over time.
Keywords
Semisupervised clustering Active learning Interactive clustering Incremental clustering Temporal clustering Obstructive Sleep Apnea1 Introduction
In semisupervised clustering, domain knowledge is typically encoded in the form of instancelevel mustlink and cannotlink constraints [11] for aiding the clustering process, thus enhancing the quality of results. Such constraints specify that two objects must be placed or must not be placed in the same clusters, respectively. Constraints have been successfully applied to improve clustering quality in realworld applications, e.g., identifying people from surveillance cameras [11] and aiding robot navigation [10]. However, current research on constrained clustering still faces several major issues described below.
Most existing approaches assume that we have a set of constraints beforehand, and an algorithm will use this set to produce clusters [3, 10]. Davidson et al. [8] show that the clustering quality varies significantly using different equisize sets of constraints. Moreover, annotating constraints requires human intervention, an expensive and time consuming task that should be minimized as much as possible given the same expected clustering quality. Therefore, how to choose a good and compact set of constraints rather than randomly selecting them from the data has been the focus of many research efforts, e.g., [2, 32, 43].
Many approaches employ different active learning schemes to select the most meaningful pairs of objects and then query experts for constraint annotation [2, 32]. By allowing the algorithms to choose constraints themselves, we can avoid insignificant ones, and expect to have high quality and compact constraint sets compared to the randomized scheme. These constraints are then used as input for constrained clustering algorithms to operate. However, if users are not satisfied with the results, they must provide another constraint set and then start the clustering again, which is obviously expensive.
Other algorithms follow a feedback schema which does not require a full set of constraints in the beginning [7]. They iteratively produce clusters with their available constraints, show results to users, and get feedback in the form of new constraints. By iteratively refining clusters according to user feedback, the acquired results fit users’ expectations better [7]. Constraints are also easier to select with an underlying cluster structure as a guideline, thus reducing the overall number of constraints and human annotation effort for the same quality level. However, exploring the whole data space for finding meaningful constraints is also a nontrivial task for users.
To reduce human effort, several methods incorporate active learning into the feedback process, e.g., [17, 18, 32, 43]. At each iteration, the algorithm automatically chooses pairs of objects and queries users for their feedback in terms of mustlink and cannotlink constraints instead of leaving the whole clustering results for users to examine. Though these techniques are proven to be very useful in realworld tasks such as document clustering [17], they suffer from very high runtime since they have to repeatedly perform clustering and explore all \(O(n^2)\) pairs of objects to generate queries to users each time.
In this work, we develop an efficient framework to cope with the above problems following the iterative active learning approach as in [17, 43]. However, instead of examining all pairs of objects, our technique, called Border, selects a small set of objects around cluster borders and queries users about the most uncertain pairs of objects. We also introduce a constraint inheritance approach based on the notion of \(\mu\)nearest neighbors for inferring additional constraints, thus further boosting performance. Finally, we revisit our approach in the context of evolutionary clustering [6]. Evolutionary clustering aims to produce highquality clusters while ensuring that the clustering does not change dramatically between consecutive timestamps. This scheme is very useful in many application scenarios. For example, doctors want to better describe trajectories of chronic diseases over time by identifying progressive aggregation of complications (comorbidities) [19]. It is a crucial issue to identify and ideally to anticipate the occurrence of multimorbidity (i.e., the association of more than 2 chronic diseases). They may expect that existing groups do not change so much over time if there are minor changes in the data. However, the clustering process should be able to reflect the changes if there are significant differences in the new data. Therefore, we propose to formulate a temporal (longitudinal) smoothness constraint into our framework and add a timefading factor to our constraint propagation.
1.1 Contributions

We introduce a new algorithm CVQE+ that extends CVQE [10] with weighted mustlink and cannotlink constraints and a new object assignment scheme.

We propose a new algorithm, Border, that relies on active clustering and constraint inheritance to choose a small number of objects to solicit user feedback for. Beside the active selection scheme for pairs of objects, Border employs a constraint inheritance method for inferring more constraints, thus further enhancing the performance.

We present an evolutionary clustering framework which incorporates instancelevel and temporal smoothness constraints for temporal data. To the best of our knowledge, our algorithm is the first framework that combines active learning, instancelevel and temporal smoothness constraints.

Experiments are conducted for six real datasets to demonstrated the performance of our algorithms over stateoftheart ones.
1.2 Extensions

We extend the data model by assigning for each data object a patient identification (or process ID) as its owner. These IDs will be used for finding overlapped patient clusters at different time frames.

We revise the historical cost of the temporal smoothness scheme by considering all overlapped clusters from previous snapshots.

We introduce a temporal evolution graph to capture the evolvements of patient clusters at different visit times.
1.3 Outline
The rest of the paper is organized as follows. We formulate the problem in Sect. 2. Our framework is described in Sect. 3. Experiments are presented in Sect. 4. Section 5 discusses related works. In Sect. 6, we present an application of our algorithm to track the evolution of groups of patients with sleep disorder symptoms. Section 8 concludes the paper.
2 Problem Formulation
Let \(D = \{(d,t,id(d))\}\) be a set of D vectors (objects) \(d \in \mathbb {R}^p\) observed at time t associated with a process ID pid(d) that generates d. Note that two objects u and v in D may be generated by the same process, i.e., \(pid(u) = pid(v)\).
Let \(S=\{(S_s,D_s,ts_{s},te_{s})\}\) be a set of preselected S data snapshots. Each \(S_s\) starts at time \(ts_{s}\), ends at time \(te_{s}\) and contains a set of objects \(D_s = \{(d,t,pid(d)) \in D \;  \; ts_{s} \le t < te_s \ \wedge \ \forall u, v \in D_s: pid(u) \ne pid(v) \}\). In other words, all objects in \(D_s\) must be generated by different processes during the time frame \([ts_s,te_s)\). Two snapshots \(S_s\) and \(S_{s+1}\) may overlap but must satisfy the time order, i.e., \(ts_{s} \le ts_{s+1}\) and \(te_{s} \le te_{s+1}\).
For each snapshot \(S_s\), let \(ML_s = \{(x,y,w_{xy})  (x,y) \in D_s^2 \}\) and \(CL_s = \{(x,y,w_{xy})  (x,y) \in D_s^2 \}\) be the set of mustlink and cannotlink constraints of \(S_s\) with a degree of belief of \(w_{xy} \in [0,1]\). \(ML_s\) and \(CL_s\) can be empty.
2.1 Algorithms
In this paper, we focus on the problem of grouping objects in all snapshots into clusters in an active interactive feedback scheme as described in Sect. 1. To summary, our goals are (1) reduce the number of constraints thus reducing the constraint annotation costs (2) make the algorithm scale well with large datasets and (3) smooth the gap between clustering results of two consecutive snapshots, i.e., ensure temporal smoothness. The technical details of our methods will be described in Sect. 3.
2.2 Applications
We apply our algorithms for investigating the clinical groups of patients with Obstructive Sleep Apnea (OSAS) based on their medical visit records over time. The detailed study will be presented in Sect. 6.
3 Our Proposed Framework
Figure 1 illustrates our framework which relies on two algorithms, Border and CVQE+. Our framework starts with a small (or empty) set of constraints in each snapshot. Then, it iteratively produces clustering results and receives refined constraints from users in the next iterations. This process is akin to feedbackdriven algorithms for enhancing clustering quality and reducing human annotation effort [7]. However, instead of passively waiting for user feedback as in [7], our algorithm, Border, actively examines the current cluster structure, selects \(\beta\) pairs of objects whose clustering labels are the least certain, and asks users for their feedback in terms of instancelevel constraints. Examining all possible pairs of objects to select queries is time consuming due the quadratic number of candidates. To ensure scalability, Border limits its selection to a small set of most promising objects. When there are new constraints, instead of reclustering from scratch as in [17, 43], our algorithm, CVQE+, incrementally updates the cluster structures for saving computation times. We also aim to ensure a smooth transition between consecutive clusterings [6]. We additionally introduce two novel concepts: (1) the constraint inheritance scheme for automatically inferring more constraints inside each snapshot and (2) the constraint propagation scheme for propagating constraints between different snapshots. These schemes help significantly reduce the number of constraints that users must enter into the systems for acquiring a desired level of clustering quality by automatically adding more constrained based on the annotated ones. To the best of our knowledge, Border is the first framework that combines active learning, instancelevel and temporal smoothness constraints.
3.1 Constrained Clustering Algorithm
For each snapshot \(S_s\), we use constrained kmeans for grouping objects. Generally, any existing techniques such as MPCKmeans [3], CVQE [10] or LCVQE [36] can be used. Here we introduce CVQE+, an extension of CVQE [10] to cope with weighted constraints, to do the task.
3.1.1 The New Algorithm CVQE+
3.1.2 Complexity Analysis
Let n be the number of objects, m be the number of constraints, k be the number of clusters. CVQE+ has time complexity \(O(rkn + rk^2m^2)\) which is higher than \(O(rkn + rk^2m)\) of CVQE due to the fact that all related constraints must be examined while assigning a constraint, where r is the number of iterations of the algorithm. Since k and m are constants, CVQE+ is thus has linear time complexity to the number of objects n. It also requires O(n) space for storing objects and constraints.
3.2 Active Constraint Selection
We introduce an active learning method called Border for selecting pairs of objects and query users for constraint types. The general idea is examining objects lying around borders of clusters since they are the most uncertain ones and choosing a block of \(\beta\) pairs of objects to query users until the query budget \(\delta\) is reached. Here, \(\beta\) and \(\delta\) are predefined constants.
3.2.1 Active Learning with Border
We divide \(m^2 = O(n)\) pairs of selected objects into two sets: the set of inside cluster pairs X and between cluster pairs Y, i.e., for all \((x,y) \in X : label(x) = label(y)\) and for all \((x,y) \in Y : label(x) \ne label(y)\). For a pair \((x,y) \in X\), it is sorted by \(val(x,y)=\frac{(xy)^2 (1+sco(x))(1+sco(y))}{(1+ml(x)+cl(x))(1+ml(y)+cl(y))}\). For \((x,y) \in Y\), \(val(x,y)=\frac{(xy)^2 (1+ml(x)+cl(x))(1+ml(y)+cl(y))}{(1+sco(x))(1+sco(y))}\). The larger val is, the more likely x and y belong to different clusters and vice versa. Moreover, in Y, we tend to select pairs with more related constraints to strengthen the current clusters, while we try to separate clusters in X by considering pairs with fewer related constraints. We choose top \(\beta /2\) nonoverlapped largest val pairs of X and top \(\beta /2\) nonoverlapped smallest pairs of Y in order to maximize the changes in clustering results (inside and between clusters). To be concrete, if a pair (a, b) was chosen, all pairs starting and ending with a or b will not be considered for enhancing the constraint diversity, which can help to bring up better performance. If all pairs are excluded, we select the remainder randomly.
We show \(\beta\) pairs to users to ask for the constraint type and add their feedback to the constraints set and update clusters until the total number of queries exceeds a predefined budget \(\delta\) as illustrated in Fig. 1.
3.2.2 Constraint Inheritance in Border
For further reducing the number of queries to users, the general idea is to infer new constraints automatically based on annotated ones. Our inheritance scheme is based on the concept of \(\mu\) nearest neighbors below.
3.2.3 Updating Clusters
At each iteration, instead of performing clustering again for producing the clustering result with the new set of constraints, we propose to update it incrementally for saving runtime. To do so, we only need to take the old cluster centers and update them following Eq. 1 with the updated constraints set. The intuition behind this is that new constraints are more likely to change clusters locally. Hence, starting from the current state might make the algorithm to converge faster, thus saving runtimes. In Sect. 4, we show that this updating scheme acquires the same quality but converges much faster than reclustering from scratch.
3.2.4 Complexity Analysis
Similarly to CVQE+, Border has O(n) time and space complexity at each iteration and thus has \(O(\delta n / \beta )\) time complexity overall, where \(\delta\) is the budget limitation and \(\beta\) is the number of selected objects at each iteration described above.
3.3 Temporal Smoothness Constraints
The general idea of temporal smoothness [6] is that clusters not only have high quality in each snapshot but also do not change much between sequential time frames. It is useful in many applications where the transition between different snapshots is smoothed for consistency.
3.3.1 Temporal Smoothness
3.3.2 Constraint Propagation
3.4 Temporal Evolutionary Graph
We propose to build a graph \(G = (V,E)\), called the temporal evolutionary graph, to keep track of the relationships among clusters. In G, each node v is a cluster and each edge (u, v) represents the similarity between two clusters u and v that belong to two consecutive snapshots.
Figure 4 illustrates a temporal evolution graph G with three snapshots. We have \(sim(C_{11},C_{21}) = 4 / 5 = 0.8\). There is no edge between \(C_{12}\) and \(C_{21}\) since they share nothing. And, \(C_{22}\) is the closest cluster of \(C_{12}\) (indicated by red edge). By following edges of G, we can keep track of the way clusters evolve over time. For example, from snapshot S1 to \(S_2\), object 4 has changed its membership to an other different cluster. This scheme will be useful for us to study how cohorts (groups) of patients change over time and identify factors that cause these changes in our application scenario described in Sect. 6.
4 Experiments
Experiments are conducted on a workstation with 4.0Ghz CPU and 32GB RAM using Java. We use 6 datasets Iris, Ecoli, Seeds, Libras, Optdigits, and Wdbc acquired from the UCI archives.^{1} The numbers of clusters k are acquired from the ground truths. Constraint queries are also simulated from the ground truths by adding a mustlink if two objects have the same labels or a cannotlink if they have different labels. We use Normalized Mutual Information (NMI) [33] for assessing the clustering quality. NMI score is in [0,1] where 1 means a perfect clustering result compared to the ground truth and vice versa. All results are averaged over 10 runs.
4.1 Constrained Clustering
We study the performance of our constrained clustering method CVQE+ in comparisons with kmeans and stateoftheart constrained clustering techniques such as MPCKmeans [3], CVQE [10] and LCVQE [36].
4.1.1 Performance of CVQE+
4.1.2 Noise Robustness
4.1.3 Effect of Constraint Types
Figure 7 shows the performance of CVQE+ and its related techniques CVQE and LCVQE when the number of mustlink constraints increases from 20 to 80% of the constraint sets. The clustering quality of CVQE+ and CVQE increases with the number of mustlink constraints, while that of LCVQE decreases. This can be explained by the ways they calculate the constraint violation costs for the mustlink and especially the cannotlink constraints. LCVQE treats violated cannotlink constraints more properly than CVQE and CVQE+. Thus, it deals well with higher number of those constraints.
4.2 Active Constraint Selection
4.2.1 Active Constraint Selection
4.2.2 Runtime Comparison
4.2.3 Cluster Update
4.2.4 Effect of the Block Size \(\beta\)
4.2.5 Effect of the Constraint Inheritance Scheme
Figure 12 shows the effect of the parameter \(\mu\) on our algorithm Border via the inheritance scheme. Typically, its performance will increase with \(\mu\) until it reaches the peak and then decreases as shown for the dataset Iris. This can be explained by the neighborhood influence scheme of Border. When \(\mu\) is large enough, the number of wrong constraints will be increased, thus lower down the performance of Border. However, the peak value of \(\mu\) is actually dataset dependence and thus is very hard to predict. Taking the dataset Optdigits as an example, the performance of Border still increases when \(\mu =5\). However, with \(\mu =3\), Border starts perform worse on the dataset Seeds. Unfortunately, the value of \(\mu\) is highly data dependent and is hard to select. In our experiments, we observe that the value of \(\mu\) around 2–4 is overall good for most datasets. Thus, we choose \(\mu =4\) as a default value.
4.3 Temporal Clustering
4.3.1 Clustering Quality
We divide the datasets into different snapshots and measuring the clustering quality using the ground truths of the full datasets.
4.3.2 Temporal Smoothness
Figure 14 shows the overall smoothness of the clustering results of kmeans, CVQE+, and Border wrt. different values of \(\alpha\) from 0 to 1 on the dataset Optdigits with three snapshots. The linear best fit lines (dotted) indicate that when the value of \(\alpha\) increases, the overall smoothness among snapshots increases, i.e., the more consistent the clustering results between two consecutive snapshots. However, the trend is much clearer for kmeans compared to CVQE+ and especially Border. The reason is that in CVQE+ and Border we must tradeoff the historical costs and constraint violation costs instead of only the historical costs as in kmeans. Thus, the historical aspect is more likely to be violated in CVQE+ and Border compared to kmeans. For Border, constraints are selected around the border of clusters, which are more likely to be violated than the randomly selected ones of CVQE+. Thus, the historical part of Border is more affected. As a result, the clustering results of Border are less smooth than those of kmeans and CVQE+.
5 Related Work
5.1 Constraint Clustering
There are many proposed constrained clustering algorithms such as MPCkmeans [3], CVQE [10] and LCVQE [36]. These techniques optimize an objective function consisting of the clustering quality and the constraint violation cost like our algorithm CVQE+. CVQE+ is an extension of CVQE [10], where we extend the cost model to deal with weighted constraints, make the mustlink violation cost symmetric and change the way each constraint is assigned to clusters by considering all of its related constraints. This makes cluster assignment more stable, thus enhancing the clustering quality. Interested readers are referred to [9] for a comprehensive survey on constrained clustering methods.
5.2 Active Learning
Active learning [37] are widely used in many different fields such as data clustering and pattern recognition [26, 28, 29, 30, 31, 37, 38, 40, 42, 45].
For constrained clustering, most existing techniques employ active learning for acquiring a desired constraints set before or during clustering. In [2], the authors introduce the ExplorerConsolidating algorithm to select constraints by exploiting the connected components of mustlink ones. Min–max [32] extends the consolidation phase of [2] by querying most uncertain objects rather than randomly selecting them. These techniques produce constraints sets before clustering. Thus, they cannot exploit the cluster labels for further enhancing performance. Huang et al. [17] introduce a framework that iteratively generates constraints and updates clustering results until a query budget is reached. However, it is limited to a probabilistic document clustering algorithm. NPU [43] also uses connectedcomponents of mustlink constraints as a guideline for finding most uncertain objects. Constraints are then collected by querying these objects again existing connected components like the Consolidate phase of [2]. Though more effective than preselection ones, these techniques typically have a quadratic runtime which makes them infeasible to cope with large datasets like Border. Moreover, Border relies on border objects around clusters to build constraints rather than mustlink graphs [2, 43]. The inheritance approach is closely related to the constraint propagation in the multiview clustering algorithm [13, 14] for transferring constraints among different views. The major difference is that we use the \(\mu\)nearest neighbors rather than the \(\epsilon\)neighborhoods which is limited to Gaussian clusters and can lead to an excessive number of constraints.
5.3 Temporal Clustering
Temporal smoothness has been introduced in the evolution framework [6] for making clustering results stable w.r.t. the time. We significantly extend this framework by incorporating instancelevel constraints, active query selections and constraint propagation for further improving clustering quality while minimizing constraint annotation effort.
6 Application
Obstructive Sleep Apnea (OSA) is a major sleep disorder causing by the repetitive collapses of upper airway during sleep. OSA is associated with many health problems such as cardiovascular and metabolic diseases [22] including diabetes [35], coronary heart diseases [16], cancer [5] with finally an increased risk of mortality [5]. It is also known a heterogeneous disease with different symptoms and comorbidities for patients exhibiting the same level of OSA severity [21]. Thus, recent studies aim at better allocate patients into welldefined subgroups (i.e., phenotypes) based on clinical information such as symptoms, comorbidities, and demographics using clustering methods [20, 39, 41]. This can help to improve the clinical management and to define personalized treatments at time of diagnosis. For example, in [44], a Latent Class Analysis (LCA) is used to identify groups of patients in the Icelandic Sleep Apnea Cohort including 822 patients with moderatetosevere OSA. A similar strategy is employed for 922 patients recruited from the Sleep Apnea Global Interdisciplinary Consortium (SAGIC) [21]. Hierarchical clustering is employed in [1] for analyzing diagnosis data of 18,263 OSA patients. In [23], a relationship graph is built upon 198 patients collected from the Sleep Centre at the University of Foggia from 2012 to 2014 and patients are grouped using community detection algorithms. All of these approaches, however, do not incorporate domain knowledge into the clustering processes to improve the clustering quality. Moreover, they are only able to process static data, e.g., questionnaires [44] or a diagnosis visit data [1]. However, patients’ responses to treatments and associated compilations are not static and change during clinical courses. Tracking these changes will provide more insights into disease progression and prognosis [1]. However, all existing techniques are not specifically designed to capture the evolution of patient cohorts over time.
In this section, we apply the algorithm Border to group patients into clinical meaningful clusters as well as to track how these clusters evolve over time. To the best of our knowledge, it is the first attempt that incorporates domain knowledge and tracks the cohort evolution for analyzing OSA syndromes.
6.1 OSFP Data
Our OSFP dataset is acquired from the French national registry of sleep apnea (OSFP).^{2} It consists of longitudinal clinical information of many patients suffering from OSA collected from private practices, general hospitals, and university hospitals in France. At each medical visit, data are recorded such as demographic characteristics, comorbidities, and OSA symptoms as well as some environmental risk factors such as smoke, alcohol, and sedentary.
6.2 Snapshots
In our OSFP data, each patient data consist of information at different hospital visits as demonstrated in Fig. 15. For example, patients \(p_1\), \(p_2\), and \(p_3\) have visited 5 times each. However, the visit times of patients vary significantly. For example, all \(p_1\) visits are from 2009 to 2012, while all \(p_2\) visits are from 2013 to 2016. Thus, it will not be reasonable if we create snapshots by using the exact visit times since patients with the same OSA symptoms and severities may be referred at different times and different followup points. Hence, our medical expert suggests to use the relative visit times. Concretely, we treat the first visit of a patient as time 0 (time of diagnosis) and calculate the next visits by the time difference in months to the first one. By this way, we can capture the disease evolvements. Following the relative visit times, we create different snapshots after a specific time frame of visits. For example, in Fig. 15, we use 4 different snapshots after 0, 12, 24, and 36 months. Note that if a patient has several visits at a specific snapshot, we use the last visit to present the patient status at that snapshot.
6.3 Clusters at Snapshot 1 (Time 0)

Group A Youngest, very few comorbidities, and highest OSA severity (Cluster 6 in Fig. 16 with 4535 patients): this cluster is the middle group in terms of BMI (average BMI = 32.03) and the youngest group (average age = 49.99). Patients in this group consume less alcohol than those from other groups but smoke the most. They also among groups with lowest numbers of comorbidities. However, they suffer from the highest numbers of OSAS symptoms. For example, 25.7% people in this group has high blood pressure while 83.5% and 96.9% of them experience morning headache and fatigue, respectively. They also have the highest functional scales with median scales of Epworth, Pichot and Depression as 12.09, 16.1, and 4.92, respectively.

Group B Oldest, poorest life style, highest comorbidities, and medium OSA severity (Cluster 2 in Fig. 16 with 1498 patients): this cluster consists of oldest people (median age = 68.08) who consume more alcohol than other groups and has the poorest life style. They have an average BMI = 32.5 and the highest number of comorbidities, e.g., 83.85% with high blood pressure. However, they show average OSA symptoms compared to other clusters.

Group C Average age, best life style, few comorbidities, and lowest OSA severity (Cluster 5 in Fig. 16 with 3063 patients): this group has an average age of 59.48 and is among the lowest BMI groups (average BMI = 31.50). They have a particular good life style and consume less alcohols than others. They have few comorbidities. However, they suffer from the lowest numbers of OSA symptoms. Their functional scales are the lowest with medians of Epworth, Pichot, and Depression as 7.61, 7.16 and 2.35, respectively.

Group D Average age, poor life style, high comorbidities, and high OSA severity (Cluster 4 in Fig. 16 with 3809 patients): people in this group have median age of 61.54, poor life styles, medium comorbidities, and suffer from high OSA severity. They also belong to the most obese group with average BMI of 33.10. Moreover, their medians of AHI (44.7 events/h) and ODI (36.03 events/h) are the highest among groups.

Group E Second oldest group, few comorbidities, and medium OSA severity (Cluster 3 in Fig. 16 with 3925 patients): the medians age and BMI of this group are 65.47 and 32.10, respectively. The numbers of comorbidities are high. And people have medium OSA severity.

Group F Second youngest group, very few comorbidities, and medium OSA severity (Cluster 1 in Fig. 16 with 5738 patients): the averaged AHI and ODI of this groups are 40.98 and 30.68 events/h, respectively, and are the lowest of all groups. The average age is 53.82. This group is the least obese one with averaged BMI of 31.35. It has few comorbidities and medium OSA symptoms.
6.4 Clusters at Other Snapshots
6.5 Evolution Graph
6.6 Tracking the Group Changes
The graph G can be used to track the group changes of a set of patients. For example, we can see that there is a set P of 1415 patients that changes from the group A (Cluster 6 in Fig. 16 with 4535 patients) in Snapshot 1 into the group C (Cluster 5 in Fig. 17 with 9459 patients) in Snapshot 2. The question is how it happens?
6.7 How a Specific Group of Patients Evolve Over Time?
6.8 Tracking Patients with a Specific Path
From the graph G, we can easily track the information of a set of patients whose diseases evolve by a specific path. Figure 21 shows the set P of 48 patients who has OSA severity changes from very low to high following the path \(C_{15}\), \(C_{25}\), \(C_{36}\), \(C_{45}\), and \(C_{54}\) in Fig. 18. As we can see, from \(S_1\) to \(S_5\), the average number of symptoms per patients increases considerably from 2.29 to 3.16 while the average number of comorbidities stands still. Some specific symptoms such as Morning fatigue, Nocturia, and Snoring increase over time. However, the changes in AHI and ODI are not clear from \(S_2\) to \(S_5\). However, they are still over 15, indicating a medium to severe OSA level. The rates of Oxygen therapy, RLS drug treatments, and ventilation are also higher than other groups, while CPAP and lifestyle are less used. Thus, increasing CPAP and lifestyle treatments for this group may be helpful.
7 Discussion
Throughout this section, we present how our algorithm Border can be used for finding or tracking the evolution of clinical meaningful groups of patients. Currently, our study is done on the largest and generalized collection of patients in the field with wide range of attributes. However, there are still some limitations and potential future works that we are aiming at. First, while we use the whole dataset for studying the heterogeneity of OSA patients, examining more specific patient cohorts may help to reveal more interesting information, e.g., patients with Oxygen therapy or patients with hypertension and nocturia [12]. Second, other environmental factors such as air pollutions is known to be related to OSA, especially for children [24]. Studying how the disease evolve over time wrt. the air pollution level at some specific locations will be a very interesting direction to pursue.
8 Conclusion
We introduce a scalable novel framework which incorporates an iterative active learning scheme, instancelevel and temporal smoothness constraints for coping with large temporal data. Experiments show that our constrained clustering algorithm, CVQE+, performs better than existing techniques such as CVQE [10], LCVQE [36] and MPCkmeans [2]. By exploring border objects and propagating constraints via nearest neighbors, our active learning algorithm, Border, results in good clustering results with much smaller constraint sets compared to other methods such as NPU [43] and min–max [32]. Moreover, it is orders of magnitude faster making it possible to cope with large datasets. Finally, we revisit our approach in the context of evolutionary clustering adding a temporal smoothness constraint and a timefading factor to our constraint propagation among different data snapshots. Our future work aims at providing more expressive support for user feedback as well as improving the performance of CVQE+ on noisy constraints. We are currently using our framework to track group evolution of our patient data with sleeping disorder symptoms.
Footnotes
Notes
Acknowledgements
This work is supported by the French National Research Agency in the framework of the “Investissements d’avenir” program (ANR15IDEX02).
References
 1.Bailly S, Destors M, Grillet Y, Richard P, Stach B, Vivodtzev I, Timsit JF, Lévy P, Tamisier R, Pépin JL, Scientific Council, Investigators of the French National Sleep Apnea Registry (OSFP) (2016) Obstructive sleep apnea: a cluster analysis at time of diagnosis. PLOS ONE 11(6):1–12CrossRefGoogle Scholar
 2.Basu S, Banerjee A, Mooney RJ (2004) Active semisupervision for pairwise constrained clustering. In: SDM, pp 333–344Google Scholar
 3.Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semisupervised clustering. In: ICMLGoogle Scholar
 4.Birgé L, Rozenholc Y (2006) How many bins should be put in a regular histogram. ESAIM Probab Stat 10:2445. https://doi.org/10.1051/ps:2006001 MathSciNetCrossRefzbMATHGoogle Scholar
 5.CamposRodriguez F, MartinezGarcia MA, Martinez M, DuranCantolla J, Pea MDL, Masdeu MJ, Gonzalez M, Campo FD, Gallego I, Marin JM, Barbe F, Montserrat JM, Farre RA (2013) Association between obstructive sleep apnea and cancer incidence in a large multicenter Spanish cohort. Am J Respir Crit Care Med 187(1):99–105CrossRefGoogle Scholar
 6.Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: SIGKDD, pp 554–560Google Scholar
 7.Cohn D, Caruana R, Mccallum A (2003) Semisupervised clustering with user feedback. Technical reportGoogle Scholar
 8.Davidson I (2012) Two approaches to understanding when constraints help clustering. In: KDD, pp 1312–1320Google Scholar
 9.Davidson I, Basu S (2007) A survey of clustering with instance level constraints. TKDDGoogle Scholar
 10.Davidson I, Ravi SS (2005) Clustering with constraints: feasibility issues and the \(k\)means algorithm. In: SDM, pp 138–149Google Scholar
 11.Davidson I, Ravi SS, Ester M (2007) Efficient incremental constrained clustering. In: KDD, pp 240–249Google Scholar
 12.Destors M, Tamisier R, Sapene M, Grillet Y, Baguet JP, Richard P, GireyRannaud J, DiasDomingos S, Martin F, Stach B, Housset B, Levy P, Pepin JL (2014) Nocturia is an independent predictive factor of prevalent hypertension in obstructive sleep apnea patients. Eur Respir J 44(Suppl 58):P1744Google Scholar
 13.Eaton E, desJardins M, Jacob S (2010) Multiview clustering with constraint propagation for learning with an incomplete mapping between views. In: CIKM, pp 389–398Google Scholar
 14.Eaton E, desJardins M, Jacob S (2014) Multiview constrained clustering with an incomplete mapping between views. Knowl Inf Syst 38(1):231–257CrossRefGoogle Scholar
 15.Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
 16.Hla KM, Young T, Hagen EW, Stein JH, Finn LA, Nieto FJ, Peppard PE (2015) Coronary heart disease incidence in sleep disordered breathing: the Wisconsin sleep cohort study. Sleep 38(5):677–684CrossRefGoogle Scholar
 17.Huang R, Lam W (2007) Semisupervised document clustering via active learning with pairwise constraints. In: ICDM, pp 517–522Google Scholar
 18.Huang Y, Mitchell TM (2006) Text clustering with extended user feedback. In: SIGIR, pp 413–420Google Scholar
 19.Jensen A, Moseley P, Oprea T, Ellese S, Eriksson R, Schmock H, Jensen P, Jensen L, Brunak S (2014) Temporal disease trajectories condensed from populationwide registry data covering 6.2 million patients. Nat Commun 5:4022CrossRefGoogle Scholar
 20.Joosten SA, Hamza K, Sands S, Turton A, Berger P, Hamilton GS (2011) Phenotypes of patients with mild to moderate obstructive sleep apnoea as confirmed by cluster analysis. Respirology 17(1):99–107CrossRefGoogle Scholar
 21.Keenan BT, Kim J, Singh B, Bittencourt L, Chen NH, Cistulli PA, Magalang UJ, McArdle N, Mindel JW, Benediktsdottir B, Arnardottir ES, Prochnow LK, Penzel T, Sanner B, Schwab RJ, Shin C, Sutherland K, Tufik S, Maislin G, Gislason T, Pack AI (2018) Recognizable clinical subtypes of obstructive sleep apnea across international sleep centers: a cluster analysis. Sleep 41(3):zsx214CrossRefGoogle Scholar
 22.Kendzerska T, Gershon AS, Hawker G, Leung RS, Tomlinson G (2014) Obstructive sleep apnea and risk of cardiovascular events and allcause mortality: a decadelong historical cohort study. PLOS Med 11(2):1–15CrossRefGoogle Scholar
 23.Lacedonia D, Carpagnano GE, Sabato R, Storto MMl, Palmiotti GA, Capozzi V, Barbaro MPF, Gallo C, (2016) Characterization of obstructive sleep apneahypopnea syndrome (OSA) population by means of cluster analysis. J Sleep Res 25(6):724–730CrossRefGoogle Scholar
 24.Lawrence WR, Yang M, Zhang C, Liu RQ, Lin S, Wang SQ, Liu Y, Ma H, Chen DH, Zeng XW, Yang BY, Hu LW, Yim SHL, Dong GH (2018) Association between longterm exposure to air pollution and sleep disorder in Chinese children: the Seven Northeastern Cities study. Sleep 41:zsy122CrossRefGoogle Scholar
 25.Lévy P, Kohler M, McNicholas WT, Barbé F, McEvoy RD, Somers VK et al. (2015) Obstructive sleep apnoea syndrome. Nat Rev Dis Primers 1:15015CrossRefGoogle Scholar
 26.Mai ST, AmerYahia S, Chouakria AD (2018) Scalable active temporal constrained clustering. In: EDBT, pp 449–452Google Scholar
 27.Mai ST, AmerYahia S, Chouakria AD, Nguyen KT, Nguyen A (2018) Scalable active constrained clustering for temporal data. In: DASFAA, pp 566–582Google Scholar
 28.Mai ST, Assent I, Jacobsen J, Dieu MS (2018) Anytime parallel densitybased clustering. Data Min Knowl Discov 32(4):1121–1176MathSciNetCrossRefGoogle Scholar
 29.Mai ST, Assent I, Storgaard M (2016) AnyDBC: an efficient anytime densitybased clustering algorithm for very large complex datasets. In: SIGKDD, pp 1025–1034Google Scholar
 30.Mai ST, Dieu MS, Assent I, Jacobsen J, Kristensen J, Birk M (2017) Scalable and interactive graph clustering algorithm on multicore CPUs. In: IEEE international conference on data engineering (ICDE), pp 349–360Google Scholar
 31.Mai ST, He X, Hubig N, Plant C, Böhm C (2013) Active densitybased clustering. In: ICDM, pp 508–517Google Scholar
 32.Mallapragada PK, Jin R, Jain AK (2008) Active query selection for semisupervised clustering. In: ICPR, pp 1–4Google Scholar
 33.Nguyen XV, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML, pp 1073–1080Google Scholar
 34.Nieto FJ, Peppard PE, Young T, Finn L, Hla KM, Farré R (2012) Sleepdisordered breathing and cancer mortality. Am J Respir Crit Care Med 186(2):190–194CrossRefGoogle Scholar
 35.Pamidi S, Tasali E (2012) Obstructive sleep apnea and type 2 diabetes: is there a link? Front Eurol 3:126Google Scholar
 36.Pelleg D, Baras D (2007) Kmeans with large and noisy constraint sets. In: ECML, pp 674–682Google Scholar
 37.Settles B (2010) Active learning literature survey. Technical report 1648, University of Wisconsin–MadisonGoogle Scholar
 38.Son MT, AmerYahia S, Assent I, Birk M, Dieu MS, Jacobsen J, Kristensen J (2018) Scalable interactive dynamic graph clustering on multicore CPUs. IEEE Trans Knowl Data Eng (TKDE) (to appear)Google Scholar
 39.Tsuchiya M, Lowe AA, Pae EK, Fleetham JA (1992) Obstructive sleep apnea subtypes by cluster analysis. Am J Orthod Dentofac Orthop 101(6):533–542CrossRefGoogle Scholar
 40.Tuia D, MuñozMarí J, CampsValls G (2012) Remote sensing image segmentation by active queries. Pattern Recognit 45(6):2180–2192CrossRefGoogle Scholar
 41.Vavougios GD, Natsios G, Pastaka C, Zarogiannis SG, Gourgoulianis KI (2016) Phenotypes of comorbidity in OSAS patients: combining categorical principal component analysis with cluster analysis. J Sleep Res 25(1):31–38CrossRefGoogle Scholar
 42.Voevodski K, Balcan MF, Röglin H, Teng SH, Xia Y (2012) Active clustering of biological sequences. J Mach Learn Res 13:203–225MathSciNetzbMATHGoogle Scholar
 43.Xiong S, Azimi J, Fern XZ (2014) Active learning of constraints for semisupervised clustering. IEEE Trans Knowl Data Eng 26(1):43–54CrossRefGoogle Scholar
 44.Ye L, Pien GW, Ratcliffe SJ, Björnsdottir E, Arnardottir ES, Pack AI, Benediktsdottir B, Gislason T (2014) The different clinical faces of obstructive sleep apnoea: a cluster analysis. Eur Respir J 44(6):1600–1607CrossRefGoogle Scholar
 45.Zhao W, He Q, Ma H, Shi Z (2012) Effective semisupervised document clustering via active learning with instancelevel constraints. Knowl Inf Syst 30(3):569–587CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.