Keywords

1 Introduction

The introduction of the fuzzy logic into c-means clustering opened new horizons toward fine data partitioning, but also rose several obstacles that proved difficult to handle simultaneously. The early probabilistic clustering models introduced by Dunn [6] and generalized by Bezdek [4] are strongly influenced by outlier data and uneven sized clusters. A solution to the problem of outliers was given by Dave [5], who relaxed the probabilistic constraint by introducing an extra cluster that attracted noisy data, but this method still creates clusters of equal diameter. The possibilistic c-means clustering algorithm [8] addresses the uneven sized clusters as well, but frequently creates coincident clusters [3]. Several solutions have been proposed to enable the probabilistic clustering models to correctly handle clusters of different weight and/or diameter (e.g. [10, 11]), but these are still sensitive to outlier data due to their probabilistic constraints. Leski [9] recently proposed the so-called fuzzy c-ordered means algorithm, which achieved high robustness, but dropped the classical alternative optimization scheme of the fuzzy c-means algorithm.

In this paper we propose a possibilistic fuzzy c-means clustering approach, which employs an extra noise cluster whose prototype is situated at constant distance from every input vector, and an adaptation mechanism that enables the algorithm to handle different cluster diameters.

2 Background

The fuzzy c -means algorithm. The conventional fuzzy c-means (FCM) algorithm partitions a set of object data \(\mathbf {X}=\{\mathbf {x}_1,\mathbf {x}_2,\dots ,\mathbf {x}_n\}\) into a number of c clusters based on the minimization of a quadratic objective function, defined as:

$$\begin{aligned} J_{\mathrm {FCM}}=\sum \limits _{i=1}^{c}\sum \limits \limits _{k=1}^{n} u_{ik}^m||\mathbf {x}_k-\mathbf {v}_i||_\mathbf {A}^2 = \sum \limits _{i=1}^{c}\sum \limits _{k=1}^{n}u_{ik}^md_{ik}^2, \end{aligned}$$
(1)

where \(\mathbf {v}_i\) represents the prototype or centroid of cluster i (\(i=1\dots c\)), \(u_{ik}\in [0,1]\) is the fuzzy membership function showing the degree to which vector \(\mathbf {x}_k\) belongs to cluster \(i, m>1\) is the fuzzyfication parameter, and \(d_{ik}\) represents the distance (any inner product norm defined by a symmetrical positive definite matrix \(\mathbf {A}\)) between \(\mathbf {x}_k\) and \(\mathbf {v}_i\). FCM uses a probabilistic partition, meaning that the fuzzy memberships assigned to any input vector \(\mathbf {x}_k\) with respect to clusters satisfy the probability constraint \(\sum _{i=1}^{c} u_{ik}=1\). The minimization of the objective function \(J_{\mathrm {FCM}}\) is achieved by alternately applying the optimization of \(J_{\mathrm {FCM}}\) over \(\{u_{ik}\}\) with \(\mathbf {v}_i\) fixed, \(i=1\dots c\), and the optimization of \(J_{\mathrm {FCM}}\) over \(\{\mathbf {v}_{i}\}\) with \(u_{ik}\) fixed, \(i=1\dots c, k=1\dots n\) [4]. Obtaining the optimization formulas involves zero gradient conditions of \(J_{\mathrm {FCM}}\) and Langrange multipliers. Iterative optimization is applied until cluster prototypes \(\mathbf {v}_i\) (\(i=1\dots c\)) converge.

Relaxing the probabilistic constraint. The relaxation of the probabilistic constraint was a necessity provoked by the outlier sensitivity of the FCM algorithm. Here we need to mention two different ways the constraint was eliminated. Krishnapuram and Keller [8] introduced the possibilistic c-means (PCM) algorithm, which optimizes

$$\begin{aligned} J_{\mathrm {PCM}}=\sum \limits _{i=1}^{c}\sum \limits _{k=1}^{n}\left[ t_{ik}^pd_{ik}^2 + (1-t_{ik})^p \eta _i \right] , \end{aligned}$$
(2)

constrained by \(0 \le t_{ik} \le 1\ \forall i=1\dots c, \forall k=1\dots n\), and \(0< \sum _{i=1}^{c} t_{ik} < c\ \forall k=1\dots n\), where \(p>1\) represents the possibilistic exponent, and parameters \(\eta _i\) are the penalty terms that control the diameter of the clusters. The iterative optimization algorithm of PCM objective function is derived from the zero gradient conditions of \(J_{\mathrm {PCM}}\). In the probabilistic fuzzy partition, the degrees of membership assigned to an input vector \(\mathbf {x}_k\) with respect to cluster i depends on the distances of the given vector to all cluster prototypes: \(d_{1k}, d_{2k}, \dots , d_{ck}\). On the other hand, in the possibilistic partition, the typicality value \(u_{ik}\) assigned to input vector \(\mathbf {x}_k\) with respect to any cluster i depends on only one distance: \(d_{ik}\). PCM efficiently suppresses the effects of outlier data, at the price of frequently producing coincident cluster prototypes. The latter is the result of the highly independent cluster prototypes [3].

On the other hand, Dave [5] introduced a noise cluster into the FCM algorithm, which by definition is situated at a constant distance \(d_0\) from any input vector \(\mathbf {x}_k\) (\(k=1\dots n\)). Thus, the objective function becomes

$$\begin{aligned} J_{\mathrm {Dave}}= \sum \limits _{i=0}^{c}\sum \limits _{k=1}^{n}u_{ik}^md_{ik}^2 = J_{\mathrm {FCM}} + \sum \limits _{k=1}^{n}u_{0k}^m d_0^2 , \end{aligned}$$
(3)

where the noise cluster is the one with index 0. The probabilistic constraint becomes \(\sum _{i=0}^{c} u_{ik}=1\ \forall k=1\dots n\), thus the degrees of membership of any input vector with respect to the real clusters does not sum up to 1 anymore. Outliers will be attributed with high degrees of membership towards the noise class, making the algorithm insensitive to outlier data, without producing coincident clusters. However, this approach cannot handle clusters of different size or diameter.

Several fuzzy-possibilistic mixture partition models have been proposed to deal with the coincident clusters of PCM (e.g. [12, 13]). The most recent additive mixture model proposed by Pal et al. [13] called possibilistic-fuzzy c-means (PFCM) clustering minimizes

$$\begin{aligned} J_{\mathrm {PFCM}}=\sum \limits _{i=1}^{c}\sum \limits \limits _{k=1}^{n} [a u_{ik}^m + b t_{ik}^p] d_{ik}^2 + \sum \limits _{i=1}^{c} \eta _i \sum \limits \limits _{k=1}^{n} (1-t_{ik})^p , \end{aligned}$$
(4)

constrained by the conventional probabilistic and possibilistic conditions of FCM and PCM, respectively. Here a and b are two tradeoff parameters that control the strength of the possibilistic and probabilistic term in the mixed partition. All other parameters are the same as in FCM and PCM. This clustering model was found accurate and robust, but still sensitive to outlier data [14].

Fuzzy c -means with various cluster diameters. Komazaki and Miyamoto [7] presents a collection of solutions how the FCM algorithm can adapt to different cluster sizes and diameters. From the point of view of this paper, it is relevant to mention the FCMA algorithm by Miyamoto and Kurosawa [11], which minimizes

$$\begin{aligned} J_\mathrm {FCMA} = \sum _{i=1}^c\sum _{k=1}^n \alpha _i^{1-m} u_{ik}^m d_{ik}^2 , \end{aligned}$$
(5)

subject to the probabilistic constraint of the fuzzy memberships \(u_{ik}\) (\(i=1\dots c, k=1\dots n\)), and of the extra terms \(\alpha _i\) (\(i=1\dots c\)): \(\sum _{i=1}^c \alpha _i=1\). The optimization algorithm of \(J_\mathrm {FCMA}\) can be derived from zero gradient conditions using Lagrange multipliers. Each iteration updates the probabilistic memberships \(u_{ik}\) (\(i=1\dots c, k=1\dots n\)), the cluster prototypes \(\mathbf {v}_i\) (\(i=1\dots c\)), and the cluster diameter terms \(\alpha _i\) (\(i=1\dots c\)) as well. The algorithm stops when cluster prototypes stabilize.

3 Methods

In the following, we propose a fuzzy c-means clustering model with relaxed probabilistic constraint, which employs an extra noise cluster whose prototype is situated at constant distance from every input vector, and an adaptation mechanism to different cluster diameters. The proposed objective function is:

$$\begin{aligned} J = \sum _{i=0}^c \sum _{k=1}^n \alpha _i^{1-m}u_{ik}^m d_{ik}^2 , \end{aligned}$$
(6)

where \(d_{ik} = ||\mathbf {x}_k - \mathbf {v}_i||\) (\(\forall {i=1\dots c} , \forall {k=1\dots n} \)), \(m>1\) is the fuzzy exponent, and \(d_{0k} = d_0\ \forall {k=1\dots n} \) with \(d_0\) predefined constant. Variables \(\alpha _i\) (\({i=1\dots c} \)) satisfy the probabilistic constraint \(\sum _{i=0}^c \alpha _i= 1\), while fuzzy memberships are also constrained to the probabilistic rule, similarly to the FCM algorithm: \(\sum _{i=0}^c u_{ik}= 1\) for any \({k=1\dots n} \). The minimization formulas of the objective function given in Eq. (6) are obtained using zero gradient conditions and Lagrange multipliers. Let us consider the functional

$$\begin{aligned} \mathcal{L} = J + \sum _{k=1}^n \lambda _k \left( 1 - \sum _{i=0}^c u_{ik}\right) + \lambda _\alpha \left( 1 - \sum _{i=0}^c \alpha _i\right) , \end{aligned}$$
(7)

where \(\lambda _1 \dots \lambda _n\) and \(\lambda _\alpha \) are the Lagrange multipliers. The zero gradient conditions with respect to the fuzzy memberships \(u_{ik}\) (\(\forall i=0\dots c, \forall {k=1\dots n} \)) imply

$$\begin{aligned} \frac{\partial \mathcal{L}}{\partial u_{ik}} = 0 \,\, \Rightarrow \alpha _i^{1-m}mu_{ik}^{m-1} d_{ik}^2 = \lambda _k , \end{aligned}$$
(8)

and so

$$\begin{aligned} u_{ik} = \left( \frac{\lambda _k}{m}\right) ^{1/(m-1)} \alpha _i d_{ik}^{-2/(m-1)} . \end{aligned}$$
(9)

According to the probabilistic constraint of fuzzy memberships, we have:

$$\begin{aligned} \sum _{j=0}^c u_{jk} = 1 \Rightarrow 1 = \left( \frac{\lambda _k}{m}\right) ^{1/(m-1)} \sum _{j=0}^c \alpha _j d_{jk}^{-2/(m-1)} . \end{aligned}$$
(10)

Equations (9) and (10) allows us to eliminate the Lagrange multiplier \(\lambda _k\) from the formula of \(u_{ik}\):

$$\begin{aligned} u_{ik} = \frac{u_{ik}}{1} = \frac{\left( \frac{\lambda _k}{m}\right) ^{1/(m-1)} \alpha _i d_{ik}^{-2/(m-1)}}{\left( \frac{\lambda _k}{m}\right) ^{1/(m-1)} \sum \limits _{j=0}^c \alpha _j d_{jk}^{-2/(m-1)}} = \frac{\alpha _i d_{ik}^{-2/(m-1)}}{\sum \limits _{j=0}^c \alpha _j d_{jk}^{-2/(m-1)}} . \end{aligned}$$
(11)

The optimization formula for \(\alpha _i\) (\(i=0\dots c\)) is obtained similarly. We start from the zero crossing of the partial derivative

$$\begin{aligned} \frac{\partial \mathcal{L}}{\partial \alpha _i} = 0 \Rightarrow \sum _{k=1}^n\left[ \alpha _i^{-m}(1-m) u_{ik}^m d_{ik}^2 \right] = \lambda _\alpha , \end{aligned}$$
(12)

which implies

$$\begin{aligned} (1-m)\alpha _i^{-m} \sum _{k=1}^n u_{ik}^m d_{ik}^2= \lambda _\alpha \Rightarrow \alpha _i^m = \left( \frac{1-m}{\lambda _\alpha }\right) \left[ \sum _{k=1}^n u_{ik}^m d_{ik}^2 \right] , \end{aligned}$$
(13)

and so we get

$$\begin{aligned} \alpha _i = \left( \frac{1-m}{\lambda _\alpha }\right) ^{1/m}\left[ \sum _{k=1}^n u_{ik}^m d_{ik}^2 \right] ^{1/m}. \end{aligned}$$
(14)
figure a

On the other hand, the probabilistic constraint \(\sum _{j=0}^c \alpha _j=1\) implies:

$$\begin{aligned} \sum _{j=0}^c \alpha _j = 1 \Rightarrow 1 = \left( \frac{1-m}{\lambda _\alpha }\right) ^{1/m} \sum _{j=0}^c\left[ \sum _{k=1}^n u_{jk}^m d_{jk}^2 \right] ^{1/m} . \end{aligned}$$
(15)

Equations (14) and (15) allows us to eliminate the Lagrange multiplier \(\lambda _\alpha \) from the formula of \(\alpha _i\):

$$\begin{aligned} \alpha _i = \frac{\alpha _i}{1} = \frac{\left( \frac{1-m}{\lambda _\alpha }\right) ^{1/m}\left[ \sum \limits _{k=1}^n u_{ik}^m d_{ik}^2 \right] ^{1/m}}{\left( \frac{1-m}{\lambda _\alpha }\right) ^{1/m}\sum \limits _{j=0}^c\left[ \sum \limits _{k=1}^n u_{jk}^m d_{jk}^2 \right] ^{1/m}} = \frac{\left[ \sum \limits _{k=1}^n u_{ik}^m d_{ik}^2 \right] ^{1/m}}{\sum \limits _{j=0}^c\left[ \sum \limits _{k=1}^n u_{jk}^m d_{jk}^2 \right] ^{1/m}} . \end{aligned}$$
(16)

The update formula of cluster prototypes \(\mathbf {v}_i\) is obtained as:

$$\begin{aligned} \begin{array}{rcl}\frac{\partial \mathcal{L}}{\partial \mathbf {v}_{i}} = 0 &{} \Rightarrow &{} \sum \limits _{k=1}^n (-2) \alpha _i^{1-m} u_{ik}^m (\mathbf {x}_k - \mathbf {v}_i) = 0 \\ &{} \Rightarrow &{} \sum \limits _{k=1}^n u_{ik}^m \mathbf {x}_k = \mathbf {v}_i \sum \limits _{k=1}^n u_{ik}^m \\ &{} \Rightarrow &{} \mathbf {v}_i = \frac{\sum \limits _{k=1}^n u_{ik}^m \mathbf {x}_k}{\sum \limits _{k=1}^n u_{ik}^m} . \end{array} \end{aligned}$$
(17)

If a defuzzyfied partition is desired, any input vector \(\mathbf {x}_k\) can be assigned to cluster number \(\arg \max \limits _i\{u_{ik},i=0\dots c\}\). Vectors belonging to cluster number 0 are detected outliers. The proposed algorithm is summarized in Algorithm 1.

4 Results and Discussion

The proposed method was evaluated on three different data sets, and its behavior compared to FCM [4] and PFCM [13]. The first data set consisted of two groups of randomly generated two-dimensional input vectors, situated inside the circle with center at (0, 1) and radius 1.2, and the circle with center at \((0,-1)\) and radius 0.6, respectively. Each group contained 100 vectors. The input vectors are exhibited in Fig. 1, together with the partitions created by the proposed algorithms and its counter candidates. Adding an extra input vector situated at \((\delta ,0)\) and treating \(\delta \) as a parameter allowed us to evaluate the sensitivity of the clustering to outliers. FCM and PFCM were able to provide two valid clusters up \(\delta = 132\) and \(\delta = 158\), respectively. The proposed method was not influenced by high values of \(\delta \), it assigned the extra vector to the third (outlier) class.

The second data set employed by the numerical evaluation of the algorithms was the IRIS data set [1], which consist of 150 labeled feature vectors of four dimensions, organized in three groups that contain fifty vectors each. It is a reported fact, that conventional clustering models like FCM produce 133–134 correct decisions when classifying IRIS data. PFCM produced the best reported accuracy with 140 correct decisions using \(a=b=1, m=p=3\), and initializing \(\mathbf {v}_i\) with terminal FCM prototypes [13]. Under less advantageous circumstances, PFCM reportedly produced 136–137 correct decisions. Initially we normalized the IRIS data set, and included an outlier situated at \((\delta , \delta , \delta , \delta )\), where \(\delta \) was a variable parameter. We tested the clustering models for various values of the algorithm parameters and outlier positions \(\delta \). Figure 2 shows the outcome of tests. FCM produces three valid clusters in case of \(\delta < 9\), PFCM can correctly handle the outlier for values of \(\delta \) up to 12, while the proposed method can deal with the outlier situated at any distance.

The third numerical test employed the WINE data set [2], which consist of 178 labeled feature vectors of 13 dimensions, organized in three groups of uneven cardinality. The WINE data set was initially normalized, and an outlier was included at the position \((\delta , \delta , \dots , \delta )\), where \(\delta \) represents a variable parameter. We tested the clustering models for various values of the algorithm parameters and outlier positions \(\delta \). Figure 3 exhibits the test results. FCM and PFCM can produce three valid clusters in case of \(\delta < 6\), and \(\delta < 7\), respectively, while the proposed method has no difficulty with handling correctly an outlier situated at any distance.

Fig. 1.
figure 1

The case of two groups of same cardinality but different diameter

Fig. 2.
figure 2

Evaluation using the IRIS data set: (left) correct decisions out of 150 vs. \(\delta \), (middle) CVI vs. \(\delta \), (right) final \(\alpha _i\) values of the proposed algorithm vs. \(\delta \)

Fig. 3.
figure 3

Evaluation using the WINE data set: (left) correct decisions out of 150 vs. \(\delta \), (middle) CVI vs. \(\delta \), (right) final \(\alpha _i\) values of the proposed algorithm vs. \(\delta \)

Based on all numerical tests, we can assert that the proposed clustering model is able to perform better than its counter candidates, as: (1) it creates valid clusters according to cluster validity indexes proposed for the characterisation of such partitions; (2) it produces more correct decisions than previous algorithms; (3) it can correctly handle severe outlier vectors; (4) it is able to adapt the diameter of clusters to the input data to some considerable extent; (5) it does not produce coincident clusters; (6) it uses a reduced number of parameters (2 instead of the \(c+1\) parameters of PCM). Results reported in Figs. 1, 2 and 3 were obtained using the following parameter settings: \(m=2\) for FCM and the proposed method, \(m=p=2\) and \(a=b=1\). Possibilistic penalty terms \(\eta _i\) were not established based on a terminal FCM partition, as recommended by their authors, because in many cases the terminal FCM partition was severely damaged by the presence of the outlier. Parameters \(\eta _i\) and \(d_0\) were used as constants whose values were set empirically to favor correct and fine partitions.

5 Conclusions

This paper proposed a possibilistic c-means clustering model with a reduced number of parameters (two instead of \(c+1\)) that can robustly handle distant outlier data. The advantageous properties of the algorithms were numerically validated using synthetic and standard test data sets. A formula for an optimal \(d_0\) distance may further enhance the robustness of the algorithm.