Online updating of active function crossentropy clustering
Abstract
Gaussian mixture models have many applications in density estimation and data clustering. However, the model does not adapt well to curved and strongly nonlinear data, since many Gaussian components are typically needed to appropriately fit the data that lie around the nonlinear manifold. To solve this problem, the active function crossentropy clustering (afCEC) method was constructed. In this article, we present an online afCEC algorithm. Thanks to this modification, we obtain a method which is able to remove unnecessary clusters very fast and, consequently, we obtain lower computational complexity. Moreover, we obtain a better minimum (with a lower value of the cost function). The modification allows to process data streams.
Keywords
Clustering Active function crossentropy clustering Gaussian mixture models Data streams1 Introduction
In [39], authors have constructed the afCEC (active function crossentropy clustering) algorithm, which allows the clustering of data on submanifolds of \({\mathbb {R}}^d\). The motivation comes from the observation that it is often profitable to describe nonlinear data by a smaller number of components with more complicated curved shapes to obtain a better fit of the data, see Fig. 1b. The afCEC method automatically reduces unnecessary clusters and accommodates nonlinear structures.
In this paper, the online version of the afCEC^{1} algorithm using Hartigan’s approach is presented. In a case when a new point appears, we are able to update parameters of all clusters without recomputing all variables. Because we have to approximate complicated structures in each step, we have to construct a numerically efficient model. Therefore, we have chosen an approach that allows for the use of an explicit formula in each step.
Thanks to such a modification, the unnecessary clusters are efficiently removed [40], usually in the first three or four iterations. In consequence, one needs smaller number of steps in each iteration to find the local minimum. Moreover, Hartigan’s method finds essentially better minima (with lower cost function value). In Fig. 2, we present the convergence process of Hartigan’s afCEC with the initial number of clusters at \(k=10\), which is reduced to \(k=5\).
The modification also allows processing data streams [33] in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available for them (much less than the input size) and also limited processing time per item. The intrinsic nature of stream data requires the development of algorithms capable of performing fast and incremental processing of data objects. Therefore, Hartigan’s version of afCEC algorithm can be applied in data streams clustering.
The paper is organized as follows. In the next section, related work is presented. In Sect. 3, we introduce afCEC algorithm. In Sect. 4, we present the Hartigan modification of the method. In particular, we discuss how to update parameters online. In the last section, a comparison between our approach and classical algorithms is made.
2 Related works
Clustering is the classical problem of dividing a data \(X \in {\mathbb {R}}^N\) into a collection of disjoint groups \(X_1, \ldots X_k\). Several of the most popular clustering methods are based on the kmeans approach [1]. In the context of the algorithm, there were introduced two basic heuristics for minimizing the cost function: Lloyd’s and Hartigan’s. The methods became standards in the general clustering theorem.
The first heuristic for kmeans (or general clustering methods) is the Lloyd’s approach: given some initial clustering, we assign points to the closest one [9, 25, 26]. This scheme is intuitive, and empirical support is favorable: the technique generally seems to find a good solution in a small number of iterations. The alternative heuristic was presented by Hartigan [14, 15]: repeatedly pick a point and determine its optimal cluster assignment. The obvious distinction with Lloyd is that the algorithm proceeds point by point. The comparison of the method is presented in [41]. Roughly speaking, in the context of kmeans, Hartigan’s approach converges to the minimum faster and generally find better minima of the cost function. On the other hand, Lloyd’s approach is more resistant to outliers.
The basic drawback of kmeans algorithm was solved by using densitybased techniques, which use expectation maximization (EM) method [27]. The Gaussian mixture model (GMM) is probably the most popular [28, 29]. Thanks to this approach, we can describe clusters by more general shapes like ellipses.
The crossentropy clustering (CEC) approach [40] joins the clustering advantages of kmeans and EM. It turns out that CEC inherits the speed and scalability of kmeans, while overcoming the ability of EM to use mixture models. CEC allows an automatic reduction in “unnecessary” clusters, since, contrary to the case of classical kmeans and EM, there is a cost of using each cluster. One of the most important properties of CEC, in relation to GMM, is that, similar to kmeans, we can use Hartigan’s approach.
Since typically data lie around curved structures (manifold hypotheses), algorithms which can approximate curves or manifolds are important. Principal curves and principal surfaces [16, 18, 22] have been defined as selfconsistent smooth curves (or surfaces in \({\mathbb {R}}^2\)) which pass through the middle of a ddimensional probability distribution or data cloud. They give a summary of the data and also serve as an efficient feature extraction tool.
Another method that attempts to solve the problem of fitting nonlinear manifolds is that of selforganizing maps (SOM) [20], or selforganizing feature maps (SOFM) [19]. These methods are types of artificial neural networks which are trained using unsupervised learning to produce a lowdimensional (typically twodimensional) discretized representation of the input space of the training samples, called a map.
Kernel methods provide a powerful way of capturing nonlinear relations. One of the most common, kernel PCA (KCPA) [32], is a nonlinear version of principal component analysis (PCA) [17] that gives an explicit lowdimensional space such that the data variance in the feature space is preserved as much as possible.
The above approaches focus on finding only a single complex manifold. In general, they do not focus on the clustering method. Furthermore, it is difficult to use them for dealing with clustering problems. Kernel methods and selforganizing maps can be used as a preprocessing for classical clustering methods. In such a way, spectral clustering methods were constructed [24]. The classical kernel kmeans [24] is equivalent to KPCA prior to the conventional kmeans algorithm. Spectral clustering is a large family of grouping methods which partition data using eigenvectors of an affinity matrix derived from the data [7, 43, 44, 45, 48].
The active curve axis Gaussian mixture model (AcaGMM) [47] is an adaptation of the Gaussian mixture model, which uses a nonlinear curved Gaussian probability model in clustering. AcaGMM works well in practice; however, it has major limitations. First of all, the AcaGMM cost function does not necessarily decrease with iterations, which causes problems with the stop condition, see [39]. Since the method uses orthogonal projections and arc lengths, it is very hard to use AcaGMM for more complicated curves in higherdimensional spaces.
The active function crossentropy clustering [39] (afCEC) method (see Fig. 1b), which is based on the crossentropy clustering (CEC) model, solves all the above limitations. The method has a few advantages in relation to AcaGMM: it enables easy adaptation to clustering of complicated datasets along with a predefined family of functions and does not need external methods to determine the number of clusters, as it automatically reduces the number of groups.
In practice, afCEC gives essentially better results than linear models like GMM or CEC, since we obtain a similar level of the Loglikelihood function by using a smaller number of parameters to describe the model. On the other hand, the results are similar to that of AcaGMM when we restrict the data to two dimensions and use the quadratic function as the baseline. For more detailed comparison between the methods, see [39].
All the above approaches do not have Hartigan’s versions. In this article, we present an online afCEC algorithm. In the case of Lloyd’s approach, authors use the regression method for each step. In this paper, we present how to apply Hartigan’s heuristic for minimizing afCEC cost function. Thanks to this modification, we obtain a method which is able to remove unnecessary clusters very fast and, consequently, we obtain a lower computational complexity. Moreover, we obtain a better minimum (with lower value of the cost function).
3 AfCEC algorithm
Definition 1
Optimization Problem 31
If \({\mathcal {F}}\) is a set of functions which are invariant under the operations \(f \rightarrow a+f\) for any a, we have a following theorem.
Theorem 1
CEC allows an automatic reduction in “unnecessary” clusters, since, contrary to the case of classical kmeans and EM, there is a cost of using each cluster. (The stepbystep view of this process is shown in Fig. 2.) There are also several probabilistic approaches which try to estimate the correct number of clusters. For example, [11] uses the generalized distance between Gaussian mixture models with different components number by using the Kullback–Leibler divergence, see [6, 21]. A similar idea is presented by [46] (Competitive Expectation Maximization) which uses the minimum message length criterion provided by [8]. In practice, MDLP can also be directly used in clustering, see [42]. However, most of the abovementioned methods typically proceed through all the consecutive clusters and do not reduce the number of clusters online during the clustering process.
Classical AfCEC algorithm presented in [39] uses Lloyd’s method. The alternative heuristic was presented by Hartigan [14, 15]: repeatedly pick a point and determine its optimal (from the cost function’s point of view) cluster assignment. Observe that in the crucial step in Hartigan’s approach we compare the crossentropy after and before the switch, while the switch removes a given point from one cluster and adds it to the other. It means that to apply efficiently the Hartigan approach in clustering it is essential to update parameters (7) when we add a point to the cluster and downdate parameters (7) when we delete a point from group. In the next section, we present how we can update and downdate all parameters of afCEC online.
4 Updating the value of the cost function
 (a)The update procedure:where \(p_1=\frac{X}{X + 1}\), \(p_2=\frac{1}{X+1}\), and \({x}\notin X\).$$\begin{aligned} \begin{array}{l} {\mathrm {mean}}\left( X_{\hat{d}} \cup \left\{ {x}_{\hat{d}}\right\} \right) = p_{1} {\mathrm {mean}}\left( X_{\hat{d}}\right) +p_{2}{x}_{\hat{d}}, \\ {\mathrm {cov}}\left( X_{\hat{d}} \cup \left\{ {x}_{\hat{d}}\right\} \right) = p_{1}{\mathrm {cov}}\left( X_{\hat{d}}\right) +p_{1}p_{2}\left( {\mathrm {mean}}\left( X_{\hat{d}}\right) {x}_{\hat{d}}\right) \left( {\mathrm {mean}}\left( X_{\hat{d}}\right) {x}_{\hat{d}}\right) ^T, \end{array} \end{aligned}$$
 (b)The downdate procedure:where \(q_1=\frac{X}{X1}\), \(q_2=\frac{1}{X1}\), and \({x}\in X\).$$\begin{aligned} \begin{array}{l} {\mathrm {mean}}\left( X_{\hat{d}} {\setminus} \left\{ {x}_{\hat{d}}\right\} \right) = q_1 {\mathrm {mean}}\left( X_{\hat{d}}\right) q_2{x}_{\hat{d}}, \\ {\mathrm {cov}}\left( X_{\hat{d}} {\setminus} \left\{ {x}_{\hat{d}}\right\} \right) = q_1{\mathrm {cov}}\left( X_{\hat{d}}\right) q_1 q_2 \left( {\mathrm {mean}}\left( X_{\hat{d}}\ \right) {x}_{\hat{d}}\right) \left( {\mathrm {mean}}\left( X_{\hat{d}}\right) {x}_{\hat{d}}\right) ^T, \end{array} \end{aligned}$$
Theorem 2
Theorem 3
 (a)The update procedure:$$\begin{aligned} {\mathrm {A}}_{X \cup \{{x}\}}={\mathrm {A}}_X + \begin{bmatrix} f_1\left( {x}_{\hat{d}}\right) \\ \vdots \\ f_m\left( {x}_{\hat{d}}\right) \end{bmatrix} \begin{bmatrix} f_1\left( {x}_{\hat{d}}\right) \\ \vdots \\ f_m\left( {x}_{\hat{d}}\right) \end{bmatrix}^T, \end{aligned}$$(16)$$\begin{aligned} {b}_{X \cup \{{x}\}}={b}_X + \begin{bmatrix} f_1\left( {x}_{\hat{d}}\right) \\ \vdots \\ f_m\left( {x}_{\hat{d}}\right) \end{bmatrix} x_d. \end{aligned}$$(17)
 (b)The downdate procedure:$$\begin{aligned} {\mathrm {A}}_{X \setminus \{{x}\}}={\mathrm {A}}_X  \begin{bmatrix} f_1\left( {x}_{\hat{d}}\right) \\ \vdots \\ f_m\left( {x}_{\hat{d}}\right) \end{bmatrix} \begin{bmatrix} f_1\left( {x}_{\hat{d}}\right) \\ \vdots \\ f_m\left( {x}_{\hat{d}}\right) \end{bmatrix}^T, \end{aligned}$$(18)$$\begin{aligned} {b}_{X \setminus \{{x}\}}={b}_X  \begin{bmatrix} f_1\left( {x}_{\hat{d}}\right) \\ \vdots \\ f_m\left( {x}_{\hat{d}}\right) \end{bmatrix} x_d. \end{aligned}$$(19)
Proof
Thanks to Theorem 3, we can update parameters \(A_X\) and \({b}_X\). Then we solve the system of linear equations \( A_X \upalpha ={b}_{X}. \) Therefore, we obtain \(\upalpha \) for \(X\cup \{{x}\}\) or \(X\setminus \{{x}\}\), respectively. In the last step, we can update and downdate the mean squared error (MSE), respectively, on the dcoordinate by using a new value of \(A,{b}, \upalpha \).
Theorem 4
Proof
This can be always done, provided that the matrix \(\varSigma _X\) or \(\varSigma _{X \setminus \{{x}\}}\) (depending on whether we add or remove the point \({x}\) from the dataset X) is nonsingular. Having the values of \(\{\alpha _1,\ldots ,\alpha _k\}\), one can immediately obtain the desired value.
5 Algorithm
In this section, we present our algorithm. The aim of the Hartigan’s method is to find a partition \(X_1,\ldots ,X_n\) of X, for which the cost function (4) is as close as possible to the minimum, by subsequently reassigning membership of elements from X.
To explain Hartigan’s approach more precisely, we need the notion of a group membership function \( {\mathrm{gr}} : \{1,\ldots ,n\} \rightarrow \{ 0,\ldots ,k\}, \) which describes the membership of the ith element, where 0 value is a special symbol which denotes that \({x}_i\) is as yet unassigned. In other words, if \({\mathrm{gr}}(i) = l > 0\), then \({x}_i\) is a part of the lth group, and if \({\mathrm{gr}}(i) = 0\) then \({x}_i\) is unassigned.
In Algorithm 1, we present a pseudocode of the method. The algorithm starts from an initial clustering, which can be obtained randomly or with the use of the kmeans++. In our case, we assume that we have an initial clustering given by \({\mathrm {cl}}\). (The number of clusters is given by k.) At the beginning, the algorithm calculates the initial values of parameters which describe each cluster.

If the chosen set \(x_i\) is unassigned, assign it to the first nonempty group;

Reassign \(x_i\) to that group for which the decrease in crossentropy is maximal;

Check if no group needs to be removed/unassigned, if this is the case unassign its all elements;
To implement Hartigan’s approach, we still have to add a condition regarding when to unassign a given group. For example, in the case of AfCEC clustering in \({\mathbb {R}}^d\), to avoid overfitting we cannot consider clusters which contain less then \(d + 1\) points. In practice while applying Hartigan’s approach on discrete data, we usually remove clusters which contain less then five percent of all dataset.
Observe that in the crucial step in Hartigan’s approach, we compare the crossentropy after and before the switch, while the switch removes a given point from one cluster and adds it to the other. It means that to apply the Hartigan approach efficiently in clustering, it is essential to update/downdate parameters when we add/delete a point from a group by using formulas from Sect. 4.
6 Experiments
In this section, we present a comparison of the Hartigan version of afCEC with densitybased methods: GMM, CEC, and Lloyd’s afCEC. It is difficult to compare methods, which use different number of parameters to approximate data. In general, if we use a more complex model, we can fit the data better. Therefore, we use indexes which measure level of fitting and use penalty for using more complicated models.
Hence, there is a tradeoff: the better fit, created by making a model more complex by requiring more parameters, must be considered in light of the penalty imposed by adding more parameters.
Let’s analyze the two components of the AIC. The first component, \(\, 2\)LL, is the value of the likelihood function, which is the probability of obtaining the data given the candidate model. It measures how well the data are fitted by the model. Since the likelihood function’s value is multiplied by \(\,2\), ignoring the second component, the model with the minimum AIC is the one with the highest value of the likelihood function.
However, to this first component we add an adjustment based on the number of estimated parameters. The more parameters, the greater the amount added to the first component, increasing the value for the AIC and penalizing the model. Hence, there is a tradeoff: the better fit, created by making a model more complex by requiring more parameters, must be considered in light of the penalty imposed by adding more parameters. This is why the second component of the AIC is thought of in terms of a penalty.
The Bayesian information criterion (BIC) is another model selection criterion based on information theory but set within a Bayesian context. The difference between the BIC and the AIC is the greater penalty imposed for the number of parameters by the former than the latter.
Consequently, we need a number of parameters which are used in each model. In the case of \({\mathbb {R}}^2\), afCEC uses two scalars for the mean, three scalars for the covariance matrix, and three scalars for the parabola. It should be emphasized that in afCEC, we need to remember which coordinate is the dependent one. This parameter is discrete, so we do not consider it in our investigation.
6.1 The computational times
One can observe that in the case of higher dimensions both afCEC methods give slightly worse results since the regression function must be fitted with respect to all possible dependent variables. It should be highlighted that the application of afCEC method in highdimensional spaces is rather limited. CEC and GMM methods give comparable results for large datasets.
In the case of data with an increasing number of elements, we can observe that afCEC method gives comparable results to the GMM approach. The method can be applied even to reasonably large datasets. We can observe that Lloyd’s approach gives slightly better results than Hartigan’s algorithm, since we do not have to update parameters in each step. But the use of an online version of the method allows to obtain a better minimum of the cost function and consequently, better clustering, see Tables 1 and 2.
6.2 2D dataset
Chinese characters consist of straightline strokes (horizontal, vertical) and curved strokes (slash, backslash and many types of hooks). GMM has already been employed for analyzing the structure of Chinese characters and has achieved commendable performance [46]. However, some lines extracted by GMM may be too short, and it is quite difficult to join these short lines to form semantic strokes due to the ambiguity of joining them together. This problem becomes more serious when analyzing handwritten characters by GMM, and this was the motivation to use afCEC to represent Chinese characters, see Fig. 5.
Comparison of afCEC, GMM, CEC, and Lloyd’s afCEC in the case of a 2D datasets
Number of clusters  GMM  CEC  afCEC Lloyd  afCEC Hartigan  

MLE  BIC  AIC  MLE  BIC  AIC  MLE  BIC  AIC  MLE  BIC  AIC  
 4  − 1911.63  3954.45  3869.26  − 1875.23  3881.66  3796.47  − 1808.89  3794.60  3679.78  − 1800.30  3777.42  3662.60 
5  − 1903.03  3971.46  3864.05  − 1846.76  3858.94  3751.53  − 1757.44  3737.33  3592.88  − 1756.63  3735.70  3591.25  
6  − 1901.57  4002.77  3873.14  − 1832.37  3864.38  3734.75  − 1756.91  3781.90  3607.82  − 1750.71  3769.50  3595.42  
7  − 1887.10  4008.05  3856.20  − 1802.71  3839.28  3687.43  − 1754.28  3822.26  3618.55  − 1744.65  3803.00  3599.30  
8  − 1874.74  4017.55  3843.47  − 1782.80  3833.67  3659.59  − 1755.03  3869.41  3636.07  − 1746.00  3851.34  3618.01  
 6  − 2078.75  4358.51  4227.51  − 1983.19  4167.39  4036.39  − 1966.84  4203.60  4027.68  − 1874.89  4019.71  3843.79 
9  − 2060.61  4425.61  4227.23  − 1932.41  4169.20  3970.82  − 1807.87  4023.50  3757.74  − 1744.43  3896.61  3630.86  
12  − 2033.28  4474.31  4208.55  − 1852.89  4113.52  3847.77  − 1706.21  3958.01  3602.42  − 1503.52  3552.63  3197.04  
15  − 1996.56  4504.24  4171.11  − 1817.99  4147.11  3813.98  − 1566.34  3816.10  3370.68  − 1247.54  3178.51  2733.09  
18  − 1986.85  4588.20  4187.70  − 1743.26  4101.02  3700.52  − 1385.55  3592.34  3057.09  − 1117.97  3057.18  2521.93  
 15  548.64  − 453.51  − 919.29  1215.48  − 1787.17  − 2252.95  1219.47  − 1578.17  − 2200.95  1321.22  − 1781.65  − 2404.44 
20  611.74  − 362.69  − 985.47  1574.88  − 2288.97  − 2911.76  1487.02  − 1823.92  − 2656.04  1629.39  − 2108.67  − 2940.79  
25  711.73  − 345.68  − 1125.47  1605.02  − 2132.25  − 2912.03  1709.81  − 1980.17  − 3021.63  1692.28  − 1945.11  − 2986.56  
30  845.66  − 396.54  − 1333.32  1673.03  − 2051.28  − 2988.07  1739.32  − 1749.84  − 3000.64  1731.93  − 1735.06  − 2985.86  
35  933.43  − 355.07  − 1448.86  1703.08  − 1894.36  − 2988.15  1709.58  − 1401.02  − 2861.15  1755.73  − 1493.33  − 2953.46  
 2  108.58  − 146.39  − 195.17  255.27  − 439.77  − 488.55  361.76  − 627.00  − 693.52  361.76  − 627.00  − 693.52 
4  265.92  − 383.85  − 485.84  533.22  − 918.45  − 1020.45  951.59  − 1703.72  − 1841.19  1031.43  − 1863.38  − 2000.85  
6  358.94  − 492.67  − 647.88  953.03  − 1680.86  −1836.07  1168.02  − 2033.61  − 2242.03  1197.57  − 2092.72  − 2301.15  
8  486.40  − 670.38  − 878.80  1115.54  − 1928.67  − 2137.09  1234.26  − 2063.14  − 2342.52  1239.19  − 2073.00  − 2352.37  
10  519.47  − 659.30  − 920.94  1186.66  − 1993.68  − 2255.32  1267.15  − 2025.97  − 2376.30  1253.97  − 1999.62  − 2349.95  
 2  164.26  − 257.99  − 306.52  325.41  − 580.30  − 628.83  406.28  − 716.37  − 782.55  406.28  − 716.37  − 782.55 
4  265.27  − 383.07  − 484.55  835.15  − 1522.82  − 1624.29  928.93  − 1659.10  − 1795.87  955.98  − 1713.19  − 1849.96  
6  455.91  − 687.40  − 841.81  1018.12  − 1811.83  − 1966.24  1167.58  − 2033.80  − 2241.15  1175.12  − 2048.89  − 2256.24  
8  549.31  − 797.27  − 1004.63  1115.11  − 1928.87  − 2136.22  1199.06  − 1994.17  − 2272.12  1202.64  − 2001.33  − 2279.27  
10  701.89  − 1025.48  − 1285.78  1164.62  − 1950.93  − 2211.23  1220.49  − 1934.45  − 2282.99  1222.57  − 1938.61  − 2287.15 
6.3 3D scans of objects
In this subsection, we present how our method works in the case of segmentation of 3D objects. Similarly as before, we report the results of afCEC, GMM, CEC, see Table 2. We show how the loglikelihood, BIC, and AIC functions change when the number of clusters increases. As we can see, for similar values of the loglikelihood function, we have to use less clusters for afCEC than for GMM and CEC. Moreover, we also obtain a better value of BIC and AIC, see the last three examples in Table 2.
The effect of afCEC on 3D objects [2, 3] is shown in Fig. 7. Since afCEC is able to cluster data on submanifolds of \({\mathbb {R}}^d\), it is able to fit strongly nonlinear structures of 3D scans of objects. Moreover, afCEC method automatically reduces unnecessary clusters which allows to reduce too small components.
Comparison of afCEC, GMM, CEC, and Lloyd’s afCEC in the case of a 3D datasets
Number of clusters  GMM  CEC  afCEC Lloyd  afCEC Hartigan  

MLE  BIC  AIC  MLE  BIC  AIC  MLE  BIC  AIC  MLE  BIC  AIC  
 7  − 96695  194025  193528  − 91273  183183  182685  − 90330  181490  180841  − 90310  181448  180800 
12  − 94030  189156  188298  − 87799  176695  175837  − 85866  173160  172042  − 85399  172225  171108  
17  − 92406  186368  185150  − 85172  171900  170682  − 84234  170494  168908  − 84293  170612  169026  
22  − 91422  184861  183282  − 82912  167842  166262  − 83148  168921  166866  − 82882  168390  166335  
27  − 90100  182679  180739  − 81909  166297  164357  − 80870  164964  162440  − 80859  164941  162418  
 7  − 100604  201844  201346  − 96919  194474  193976  − 94846  190522  189873  − 93827  188483  187834 
12  − 98592  198281  197423  − 94623  190343  189485  − 90181  181791  180673  − 91202  183831  182714  
17  − 97041  195639  194420  − 93494  188544  187326  − 89501  181029  179443  − 89109  180245  178659  
22  − 96087  194191  192612  − 92705  187428  185849  − 87648  177921  175866  − 88383  179391  177336  
27  − 95588  193654  191714  − 91714  185905  183966  − 86946  177117  174593  − 86081  175385  172862  
 7  61719  − 122804  − 123301  66911  − 133187  − 133685  71048  − 141268  − 141917  71276  − 141723  − 142372 
12  63780  − 126464  − 127323  71130  − 141164  − 142022  75468  − 149509  − 150627  75542  − 149656  − 150774  
17  64645  − 127734  − 128953  72913  − 144270  − 145489  78344  − 154662  − 156248  77792  − 153557  − 155144  
22  65849  − 129681  − 131260  74732  − 147447  − 149026  80055  − 157485  − 159540  80558  − 158491  − 160546  
27  66635  − 130792  − 132732  75575  − 148673  − 150613  81276  − 159330  −161853  81561  − 159898  − 162422  
 7  − 222564  445765  445267  − 216654  433944  433447  − 216213  433255  432606  − 215943  432715  432066 
12  − 221373  443842  442984  − 214305  429707  428849  − 211936  425301  424183  − 210847  423121  422004  
17  − 220538  442633  441414  − 212444  426444  425226  − 208329  418685  417099  − 207095  416217  414631  
22  − 219191  440400  438820  − 211014  424046  422467  − 206140  414905  412850  − 204798  412222  410167  
27  − 218504  439485  437546  − 209790  422058  420118  − 205804  414831  412308  − 202849  408923  406399  
 7  − 284456  569549  569051  − 278484  557603  557106  − 277287  555403  554754  − 276928  554686  554037 
12  − 282304  565705  564847  − 272508  546112  545254  − 271694  544815  543698  − 271117  543662  542544  
17  − 281074  563705  562486  − 269244  540046  538827  − 268625  539277  537691  − 267457  536940  535354  
22  − 280188  562394  560815  − 267147  536312  534733  − 267345  537316  535261  − 265938  534502  532447  
27  − 279772  562021  560082  − 266101  534679  532740  − 266111  535445  532922  − 264731  532686  530162 
6.4 Comparison with nondensitybased methods
Now we present a comparison between afCEC and classical approaches dedicated to clustering of nonlinear datasets: kkmeans [24] and spectral clustering [31] (see Fig. 8). We also use recent modification of the classical method dedicated to nonlinear data STSC [45], SMMC [43], and SSC [7, 44, 48]. In this subsection, we compare algorithms with respect to Rand and Jaccard indexes, see Fig. 9.
Kernel methods can be used as a preprocessing for classical clustering methods. In such a way, spectral clustering methods were constructed [5, 24, 31]. The classical kernel kmeans [24] is equivalent to KPCA prior to the conventional kmeans algorithm. Most of kernel methods consist of two steps: an embedding into a feature space and a classical clustering method used on the data transformed to feature space. Therefore, spectral methods are typically timeconsuming and use large number of parameters.
6.5 Data streams
Typical statistical and data mining methods, including clustering, work with “static” datasets, meaning that the complete dataset is available as a whole to perform all necessary computations. However, in recent years more and more applications need to work with data which is not static but is the result of a continuous data generating process which is likely to evolve over time. This type of data is called a data stream, and dealing with data streams has become an increasingly important area of research.

Bounded storage The algorithm can only store a very limited amount of data to summarize the data stream;

Single pass The incoming data points cannot be permanently stored and need to be processed at once in the arriving order;

Realtime The algorithm has to process data points on average at least as fast as the data is arriving;

Concept drift The algorithm has to be able to deal with a data generating process which evolves over time (e.g., distributions change or a new structure in the data appears).
Figure 10c shows the Rand index for the four data stream clustering algorithms and afCEC method over the evolving data stream. All algorithms show that separating the two clusters is impossible around position 3000 when the two clusters overlap. It should be highlighted that afCEC method has a problem with reconstructing the model after the merge. The number of clusters was reduced and cannot be reconstructed. It is possible to add a split merge strategy [13] which would allow to refit afCEC model. The second possible strategy is to add an additional dimension with time components. Since afCEC is an affine invariant it does not change clustering structures and allows to keep two clusters without reduction.
In general, afCEC works in the case when a dataset contains curvetype structures. In the second example, we present how the methods work on such data. Similar to the previous examples, we consider two clusters (two curvetype clusters), where the first moves from top left to bottom right, and the other one moves from bottom left to top right. Figure 10b shows plots where the clusters move over time. Arrows are added to highlight the direction of cluster movement. Figure 10d shows the Rand index. As we see, AfCEC is able to almost perfectly recover the original clustering .
7 Conclusions
In this paper, the Hartigan approach to afCEC method for clustering curved data, which uses generalized Gaussian distributions in curvilinear coordinate systems, was presented. The afCEC method has a strong theoretical background. Moreover, afCEC can be used as a density estimation model. Since afCEC is an implementation of the crossentropy clustering approach, the method reduces unnecessary clusters online.
In practice, the algorithm gives essentially better results than linear models, like GMM or CEC and the classical Lloyd’s approach to afCEC, since we obtain a similar level of the Loglikelihood function by using a smaller number of parameters to describe the model. Moreover, the online version of afCEC method can be use in the case of stream data.
In the future, we want to update our algorithm to allow the use of closed curves. Thanks to such a modification, we will able either to find more complicated shapes in data or to better adapt to the data structure.
Footnotes
 1.
The Hartigan’s as well as classical Lloyd’s approaches are included in R package afCEC https://cran.rproject.org/web/packages/afCEC/index.html.
Notes
Acknowledgements
The work of P. Spurek was supported by the National Centre of Science (Poland) Grant No. 2015/19/D/ST6/01472. The work of K. Byrski was supported by the National Centre of Science (Poland) Grant No. 2015/19/D/ST6/01472. The work of J. Tabor was supported by the National Centre of Science (Poland) Grant No. 2017/25/B/ST6/01271.
References
 1.Bock HH (2007) Clustering methods: a history of KMeans algorithms. In: Bock HH (ed) Selected contributions in data analysis and classification. Springer, Berlin, pp 161–172CrossRefGoogle Scholar
 2.Bronstein AM, Bronstein MM, Kimmel R (2006) Efficient computation of isometryinvariant distances between surfaces. SIAM J Sci Comput 28(5):1812–1836MathSciNetCrossRefzbMATHGoogle Scholar
 3.Bronstein AM, Bronstein MM, Kimmel R (2008) Numerical geometry of nonrigid shapes. Springer, BerlinzbMATHGoogle Scholar
 4.Cayton L (2005) Algorithms for manifold learning. Univ Calif San Diego Tech Rep 12:1–17Google Scholar
 5.Chi SC, Yang CC (2006) Integration of ant colony SOM and kmeans for clustering analysis. In: International conference on knowledgebased and intelligent information and engineering systems, Springer, Berlin, pp 1–8Google Scholar
 6.Cover TM, Thomas JA (2012) Elements of information theory. Wiley, HobokenzbMATHGoogle Scholar
 7.Elhamifar E, Vidal R (2013) Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans Pattern Anal Mach Intell 35(11):2765–2781CrossRefGoogle Scholar
 8.Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396CrossRefGoogle Scholar
 9.Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769Google Scholar
 10.Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via modelbased cluster analysis. Comput J 41(8):578–588CrossRefzbMATHGoogle Scholar
 11.Goldberger J, Roweis ST (2004) Hierarchical clustering of a mixture model. In: Proceedings of advances in neural information processing systems, pp 505–512Google Scholar
 12.Hahsler M, Bolanos M, Forrest J (2017) Introduction to stream: an extensible framework for data stream clustering research with R. J Stat Softw 76(14):1–50CrossRefGoogle Scholar
 13.Hajto K, Kamieniecki K, Misztal K, Spurek P (2017) Splitandmerge tweak in cross entropy clustering. In: IFIP international conference on computer information systems and industrial management, Springer, Berlin, pp 193–204Google Scholar
 14.Hartigan JA (1975) Clustering algorithms. Wiley, New YorkGoogle Scholar
 15.Hartigan JA, Wong MA (1979) Algorithm as 136: a kmeans clustering algorithm. Appl Stat 28:100–108CrossRefzbMATHGoogle Scholar
 16.Hastie T, Stuetzle W (1989) Principal curves. J Am Stat Assoc 84(406):502–516MathSciNetCrossRefzbMATHGoogle Scholar
 17.Jolliffe I (2002) Principal component analysis. Encycl Stat Behav Sci 30:487MathSciNetzbMATHGoogle Scholar
 18.Kegl BA (1999) Principal curves: learning, design, and applications. Ph.D. Thesis, CiteseerGoogle Scholar
 19.Kohonen T (1989) Selforganizing feature maps. Springer, BerlinCrossRefzbMATHGoogle Scholar
 20.Kohonen T (2001) Selforganizing maps, vol 30. Springer, BerlinzbMATHGoogle Scholar
 21.Kullback S (1997) Information theory and statistics. Dover Pubns, MineolazbMATHGoogle Scholar
 22.LeBlanc M, Tibshirani R (1994) Adaptive principal surfaces. J Am Stat Assoc 89(425):53–64CrossRefzbMATHGoogle Scholar
 23.Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2015) Rmixmod: the R package of the modelbased unsupervised, supervised and semisupervised classification mixmod library. J Stat Softw 67:241–270CrossRefGoogle Scholar
 24.Li J, Li X, Tao D (2008) KPCA for semantic object extraction in images. Pattern Recognit 41(10):3244–3250CrossRefzbMATHGoogle Scholar
 25.Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theor 28(2):129–137MathSciNetCrossRefzbMATHGoogle Scholar
 26.MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. Oakland, pp 281–297Google Scholar
 27.McLachlan G, Krishnan T (1997) The EM algorithm and extensions, vol 274. Wiley, HobokenzbMATHGoogle Scholar
 28.McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley, HobokenzbMATHGoogle Scholar
 29.McLachlan G, Peel D (2004) Finite mixture models. Wiley, HobokenzbMATHGoogle Scholar
 30.Narayanan H, Mitter S (2010) Sample complexity of testing the manifold hypothesis. In: Proceedings of advances in neural information processing systems, pp 1786–1794Google Scholar
 31.Ng AY, Jordan MI, Weiss Y et al (2002) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 2:849–856Google Scholar
 32.Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319CrossRefGoogle Scholar
 33.Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46(1):13CrossRefzbMATHGoogle Scholar
 34.Śmieja M, Geiger BC (2017) Semisupervised crossentropy clustering with information bottleneck constraint. Inf Sci 421:254–271MathSciNetCrossRefGoogle Scholar
 35.Śmieja M, Wiercioch M (2016) Constrained clustering with a complex cluster structure. Adv Data Anal Classif 11:1–26MathSciNetGoogle Scholar
 36.Spurek P (2017) General split gaussian crossentropy clustering. Expert Syst Appl 68:58–68CrossRefGoogle Scholar
 37.Spurek P, Kamieniecki K, Tabor J, Misztal K, Śmieja M (2017) R package CEC. Neurocomputing 237:410–413CrossRefGoogle Scholar
 38.Spurek P, Pałk, W (2016) Clustering of gaussian distributions. In: 2016 IEEE international joint conference on neural networks (IJCNN), pp 3346–3353Google Scholar
 39.Spurek P, Tabor J, Byrski K (2017) Active function crossentropy clustering. Expert Syst Appl 72:49–66CrossRefGoogle Scholar
 40.Tabor J, Spurek P (2014) Crossentropy clustering. Pattern Recognit 47(9):3046–3059CrossRefzbMATHGoogle Scholar
 41.Telgarsky M, Vattani A (2010) Hartigan’s method: kmeans clustering without voronoi. In: International conference on artificial intelligence and statistics, pp 820–827Google Scholar
 42.Wallace RS, Kanade T (1990) Finding natural clusters having minimum description length. In: 10th IEEE international conference on proceedings of pattern recognition, 1990, vol 1. pp 438–442Google Scholar
 43.Wang Y, Jiang Y, Wu Y, Zhou ZH (2011) Spectral clustering on multiple manifolds. IEEE Trans Neural Netw 22(7):1149–1161CrossRefGoogle Scholar
 44.Yan Q, Ding Y, Xia Y, Chong Y, Zheng C (2017) Classprobability propagation of supervised information based on sparse subspace clustering for hyperspectral images. Remote Sens 9(10):1017CrossRefGoogle Scholar
 45.ZelnikManor L, Perona P (2005) Selftuning spectral clustering. In: Advances in neural information processing systems, pp 1601–1608Google Scholar
 46.Zhang B, Zhang C, Yi X (2004) Competitive em algorithm for finite mixture models. Pattern Recognit 37(1):131–144CrossRefzbMATHGoogle Scholar
 47.Zhang B, Zhang C, Yi X (2005) Active curve axis gaussian mixture models. Pattern Recognit 38(12):2351–2362CrossRefGoogle Scholar
 48.Zhang H, Zhai H, Zhang L, Li P (2016) Spectralspatial sparse subspace clustering for hyperspectral remote sensing images. IEEE Trans Geosci Remote Sens 54(6):3672–3684CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.