1 Introduction

Clustering is the method of finding meaningful groups within a collection of patterns or data points based on some prefixed similarity/dissimilarity measures. Although there does not exist a universally accepted criterion for such grouping, most often it is done in such a manner that the patterns in the same group (called a cluster) are as homogenous as possible, while patterns from different clusters share maximum heterogeneity. Clustering plays a pivotal role in efficient data analysis procedure by extracting essential and granular information from raw datasets. Unlike supervised learning or discriminant analysis, it involves no labeled data or any training set. Clustering techniques find huge applications in diverse areas of science and technology like financial data analysis, web mining, spatial data processing, land cover study, medical image understanding, social network analysis etc. (Anderberg 2014).

In a broad sense, data clustering algorithms can be of two types—hierarchical or partitional (Jain et al. 1999). Hierarchical clustering seeks to build a hierarchy of clusters (using a tree-like structure, called the dendrogram) following the agglomerative or the divisive approach. Partitional clustering algorithms, on the other hand, attempt to partition the dataset directly into a given number of clusters, where each cluster is characterized by a vector prototype or cluster representative. These algorithms try to optimize certain objective function involving the data points and the cluster representative (e.g., a squared-error function based on the intra-cluster spread). These algorithms typically come in two variations. One is hard clustering, where we assign each pattern to a single cluster only. The other variation is fuzzy clustering, where each pattern can belong to all the clusters with a certain membership degree (in [0, 1]) for each of them.

In order to detect the extent of homogeneity and/or heterogeneity between two data points, a dissimilarity measure is required. Dissimilarity measures are a generalized version of the distance functions in the sense that the former need not obey the triangle inequality. However, the terms dissimilarity and distance are used almost synonymously in the literature. For sound clustering, it is important to choose a dissimilarity measure, which is able to explore the inherent structure of the input data. It is also equally important to put different degrees of emphasis on different features (or variables describing each data point) while computing the dissimilarity, since it is a well-known fact that all features do not have equal contribution to the task of data clustering, and some even bear a negative influence on the task. Feature weighting is a technique for approximating the optimal degree of influence of the individual features. Each feature may be considered to have a relative degree of importance called the feature weight, with value lying in the interval [0, 1]. The corresponding algorithm should have a learning mechanism to adapt the weights in such a manner that a noisy or derogatory feature should finally have a very small weight value, thus contributing insignificantly to the dissimilarity computation. If the weight values are confined to either 0 or 1 (i.e. bad features are completely eliminated), then feature weighting reduces to the process of feature selection. This can be done very simply by thresholding the numerical weights. Feature selection can significantly reduce the computational complexity of the learning process with negligible loss of vital information.

Due to the use of Euclidean distance, both k-means and FCM work best when the clusters present in the data are nearly hyper-spherical in shape. FCM was formulated with other dissimilarity measures like the Mahalanobis distance (Gustafson and Kessel 1978; Gath and Geva 1989), exponential distance (D’Urso et al. 2016) etc. The same is true for the k-means algorithm, see for example works like (Hung et al. 2011; Mao and Jain 1996). The general class of inner product induced norm based Consistent Dissimilarity (IPINCD) measures (Saha and Das 2016b) has a wide range and it includes Euclidean, Mahalanobis, and exponential distances as special cases. Several other dissimilarity measures, which can adapt themselves by estimating some kind of covariance measure of the clusters, can be derived by using the definition of this class. In a recent work (Saha and Das 2016b), we have discussed the convergence properties of the (hard) k-means and Fuzzy C-Means (FCM) clustering algorithms with the un-weighted IPINCD measures. In this article, we take one step forward and introduce the concept of a generalized feature weighting scheme for the IPINCD measures. We then develop the generalized k-means and FCM clustering algorithms with automated feature weight learning and present a complete convergence analysis of the algorithms, thus derived. Below we briefly summarize the main contributions of the paper as follows:

  1. 1.

    We introduce a general feature weighting scheme for clustering methods with the generalized IPIN-based dissimilarity measures (IPINCD). We treat the feature weights as optimizable parameters of the clustering problem. We develop a Lloyd heuristic and an Alternative Optimization (AO) algorithm to solve the automated feature-weighted k-means and FCM clustering problems respectively.

  2. 2.

    We perform an in-depth analysis of the characteristics of the optimization problems in the newly developed algorithms. We address the issues of existence and uniqueness of solutions of the sub-optimization problems, that form the basic structure of the general clustering algorithms.

  3. 3.

    We theoretically prove the convergence properties of the newly developed feature-weighted k-means and FCM algorithms to a stationary point. We explore the nature of the stationary points under all possible situations.

  4. 4.

    With a special choice of the proper dissimilarity measure, an automated weighted version of the classical fuzzy covariance matrix based clustering algorithm (Gustafson and Kessel 1978) is derived as a special case of the proposed general clustering algorithm.

Organization of the paper is in order. Section 2 provides a brief overview of the background of clustering, use of dissimilarity measures for clustering and feature weighting in the perspective of unsupervised learning. In Sect. 3, we define a general class of the IPIN-based weighted dissimilarity measures along with a novel generalized feature weighting scheme. We also present a Lloyd type and an AO algorithm to solve the weighted k-means and the FCM clustering problems respectively. Mathematical analysis of the convergence properties of the algorithms, as well as the characteristics of the underlying optimization problems, are provided in Sect. 4. In Sect. 5, the relationship of the proposed algorithm with the existing feature weighting schemes and clustering algorithms are discussed. Section 6 presents illustrative experimental results to highlight the effectiveness of the proposed feature-weighted dissimilarity measures. In Sect. 7, we present a theoretical discussion on the asymptotic runtime of the proposed algorithm. Finally, Sect. 8 concludes the proceedings and unearths a few interesting future avenues of research.

2 Background

2.1 Partitional clustering

Partitional clustering algorithms learn to divide a dataset directly into a predefined number of clusters. They either move the data points iteratively between possible subsets or try to detect areas of high concentration of data as clusters (Berkhin 2006). This paper addresses two very popular partitional algorithms of the first kind. These are the k-means algorithm for hard clustering and FCM for fuzzy clustering. Both the algorithms, in their classical forms, attempt to fit the data points most appropriately into their clusters and are likely to yield convex clusters.

2.1.1 k-means clustering algorithm

The k-means algorithm iteratively assigns each data point to the cluster, whose representative point or centroid is nearest to the data point with respect to some distance function. An optimal k-means clustering can be identified with a Voronoi diagram whose seeds are the centrality measures of the elements of the cluster. The classical k-means algorithm (MacQueen et al. 1967) minimizes the intra-cluster spread (measured by considering the squared Euclidean distance as a dissimilarity measure) by using some heuristic to converge quickly to a local optimum. Lloyd’s heuristic (Lloyd 1982) is a popular choice among the practitioners for optimizing the k-means objective function and recently a performance guarantee of this method for well-clusterable situations has been presented (Ostrovsky et al. 2012). There have been several attempts (Modha and Spangler 2003; Teboulle et al. 2006) to extend the conventional hard k-means algorithm by considering objective functions involving dissimilarity measures other than the usual squared Euclidean distance. The general Bregman divergence (Banerjee et al. 2005) unified the set of all divergence measures, for which using arithmetic mean as cluster representative guarantees a progressive decrease in the objective function with iterations.

2.1.2 Fuzzy clustering algorithms

In fuzzy clustering, each cluster is treated as a fuzzy set and all data points can belong to it with a varying degree of membership. Perhaps, the most popular algorithm in this area is the fuzzy ISODATA or Fuzzy C-means (FCM) (Dunn 1973) and its generalized version (Bezdek 1981). In FCM, each data point is assigned with a membership degree to each cluster (quantifying how truly the data point belongs to that cluster). The numerical membership degree is inversely related to the relative distance of that data point from the corresponding cluster representative. FCM uses Euclidean distance as the dissimilarity measure and an AO heuristic (Bezdek and Hathaway 2003) to locally minimize a cost function involving the cluster representatives and the membership values. Since the Euclidean distance has a bias towards forming hyper-spherical clusters, FCM has undergone significant changes in terms of the dissimilarity measures used. Gustafson and Kessel (1978) modified FCM by using the Mahalanobis distance resulting into the well-known GK (Gustafson Kessel) algorithm, which can capture hyper-ellipsoidal clusters of equal volume (Krishnapuram and Kim 1999) by estimating a fuzzy cluster covariance matrix. This algorithm has found several applications in pattern recognition and computer vision and is still a subject of active research (Chaomurilige et al. 2015). Gath and Geva (1989) extended the GK algorithm by considering the size and density of the clusters while computing the distance function.

A series of modification for the Mahalanobis distance-based clustering algorithms were proposed by adding restrictions to the covariance matrix (Liu et al. 2007a, b) or replacing the cluster specific covariance matrix by a single common covariance matrix (Liu et al. 2009a, b). Wu et al. (2012) showed that any distance function that preserves the local convergence of FCM (when the cluster representative is derived as an arithmetic mean) can be obtained from a class of continuously differentiable convex functions, called Point-to-Centroid Distance (P2C-D) by the authors. This class comprises of the Bregman divergence and a few other divergences. Teboulle (2007) presented a generic algorithm for the center-based soft and hard clustering methods with a broad class of the distance like functions. Recently Saha and Das (2016a) designed an FCM algorithm with the separable geometric distance and demonstrated its robustness towards the noise feature perturbations (Saha and Das 2016a).

2.2 Feature weighting

Representing data with a minimal number of truly discriminative features is a fundamental challenge in machine learning and it greatly alleviates the computational overhead of the learning process. We can project the data to a lower dimensional space by selecting only the relevant features from the entire set of features available (feature selection) or by generating a new set of features using a combination of all the existing ones. Feature weighting may be seen as a generalization of the feature selection process. Here the relative degree of importance of each feature is quantized as the feature weight with value lying in the interval [0, 1]. Preliminary approaches of feature weighting for clustering can be found in the works like Sneath et al. (1973) and Lumelsky (1982). In the SYNCLUS (SYNthesized CLUStering) (DeSarbo et al. 1984) algorithm, first a k-means algorithm partitions the data and then a group of new weights for various features is determined by optimizing a weighted mean-squared cost function. The algorithm executes these two steps iteratively until convergence to a set of optimal weights is achieved. De Soete (1988) proposed a feature weighting method for hierarchical clustering by using two objective functions to determine the weights for trees in ultrametric and additive forms. Makarenkov and Legendre (2001) adapted De Soete’s algorithm for k-means clustering and they reduced the computation time by using the Polak–Ribiere optimization procedure to minimize the objective function involving the weights.

A few well-known fuzzy clustering algorithms have also been modified to accommodate the feature weighting strategy. Keller and Klawonn (2000) adapted the Euclidean distance metric of FCM by using cluster-specific weights for each feature. Modha and Spangler (2003) presented the convex k-means algorithm where the feature weights are determined by minimizing a ratio of the average within-cluster distortion to the average between-cluster distortion. Huang et al. (2005) introduced a new step in the k-means algorithm to refine the feature weights iteratively based on the current clustering state. This particular notion of automated weighting was later integrated with FCM (Nazari et al. 2013). Hung et al. (2011) proposed an exponential distance-based clustering algorithm with similar automated feature weight learning and spatial constraints for image segmentation. Recently Saha and Das (2015b) extended the weight learning strategy to fuzzy k-modes clustering of categorical data.

Optimization with respect to a symmetric, positive definite matrix, ensures scaling with respect to the variance of different variables. The main drawback of Mahalanobis clustering is the summing up of variance normalized squared distance with equal weight (Wölfel and Ekenel 2005). In absence of noise variable, i.e. where each variable contributes in determining the underlying cluster structure, clustering with Mahalanobis distance provides perfect results. But in presence of a noise feature with extremely high values, the equal weighting method subdues the importance of the other variables, which leads to undesirable results (Wölfel and Ekenel 2005). For a more detailed discussion on these issues, see the “Appendix”. In order to find a clustering method robust enough to the noise variables, even after variance normalization, we do need to find a distance measure which gives less weight to the noise variable and more weight to the features, thus contributing to determine the cluster structure.

2.3 Notations

A few words about the notation used: bold faced letters, e.g., \({\mathbf {x}},{\mathbf {y}}\) represent vectors. Calligraphic upper-case alphabets denote sets, e.g., \({\mathcal {X}}, {\mathcal {Y}}\). Matrices are expressed with upper-case bold faced letters, e.g., \({\mathbf {X}},{\mathbf {Y}}\). The symbols \({\mathbb {R}}, {\mathbb {N}}\), and \({\mathbb {R}}^{d}\) denote the set of real numbers, the set of natural numbers, and the d-dimensional real vector space respectively. Further, \({\mathbb {R}}_{+}\) denotes the set of non-negative real numbers.

3 Clustering with the IPIN-based weighted dissimilarity measures

In this section, following the philosophy of Klawonn and Höppner (2003), we introduce the general class of IPIN-based weighted dissimilarity measures in an axiomatic approach. It is a fairly general and large class of weighted dissimilarity measures and deals with a general weight function.

3.1 The IPIN-based weighted dissimilarity measures

Let \({\mathcal {M}}^{d}\) denote the class of all symmetric, positive definite matrices \({\mathbf {A}}\) with finite Frobenius norm (Hilbert–Schmidt norm) (Golub and Van Loan 2012) i.e. \(||{\mathbf {A}} ||_{2}^{2} < \infty \). The standard \(d-1\)-dimensional simplex \({\mathcal {H}}^{d}\) is defined as follows:

$$\begin{aligned} {\mathcal {H}}^{d}= \left\{ (w_{1},w_{2},\ldots ,w_{d}) \in {\mathbb {R}}^{d} \left| \sum _{l=1}^{d} w_{l}=1, \, w_{l} \geqslant 0,\, 1 \leqslant l \leqslant d \right. \right\} \end{aligned}$$
(1)

Let,

$$\begin{aligned} {\mathbf {M}} \in {\mathcal {M}}^{d},\quad g : \left[ 0,1\right] \rightarrow {\mathbb {R}}_{+}, \; {\mathbf {w}} \in {\mathcal {H}}^{d}. \end{aligned}$$

Then we define \({\mathbf {M}}_{{\mathbf {w}},g}\) as follows:

$$\begin{aligned} {\mathbf {M}}_{{\mathbf {w}},g}= \begin{pmatrix} g(w_{1}) &{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} g(w_{d}) \end{pmatrix} {\mathbf {M}} \begin{pmatrix} g(w_{1}) &{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} g(w_{d}) \end{pmatrix}, \end{aligned}$$
(2)

Definition 1

For any \(g : \left[ 0,1 \right] \rightarrow {\mathbb {R}}_{+}\), a function \(dist_{g}: {\mathcal {M}}^{d} \times {\mathbb {R}}^{d} \times {\mathbb {R}}^{d} \times {\mathcal {H}}^{d} \rightarrow {\mathbb {R}}_{+}\) is called an IPIN-based Consistent Weighted Dissimilarity (IPINCWD) measure with respect to some convex set \({\mathcal {C}}_{1} \subseteq {\mathbb {R}}_{d}\) and \({\mathcal {C}}_{2} \subseteq {\mathcal {M}}^{d}\) if for some function \(h: {\mathbb {R}} \rightarrow {\mathbb {R}}_{+}\), \(dist_{g}({\mathbf {M}}, {\mathbf {y}}, {\mathbf {x}},{\mathbf {w}})=d_{{\mathbf {M}}_{{\mathbf {w}},g}}({\mathbf {y}},{\mathbf {x}})=h(({\mathbf {x}}-{\mathbf {y}})^{T}{\mathbf {M}}_{{\mathbf {w}},g}({\mathbf {x}}-{\mathbf {y}}))\), where the following assumptions hold:

  1. 1.

    h is differentiable on \({\mathbb {R}}_{+}\).

  2. 2.

    \(g : \left[ 0,1 \right] \rightarrow {\mathbb {R}}_{+}\) is strictly increasing, differentiable function.

  3. 3.

    \({\mathbf {M}} \rightarrow dist_{g}({\mathbf {M}}, {\mathbf {y}}, {\mathbf {x}},{\mathbf {w}})\) is a strictly convex or linear function on \({\mathcal {C}}_{2}, \forall {\mathbf {y}} \in {\mathcal {C}}_{1}, \forall {\mathbf {w}} \in {\mathcal {H}}^{d}\),

    \({\mathbf {y}} \rightarrow dist_{g}({\mathbf {M}}, {\mathbf {y}}, {\mathbf {x}},{\mathbf {w}})\) is strictly convex function on \({\mathcal {C}}_{1}, \forall {\mathbf {M}} \in {\mathcal {C}}_{2}, \, \forall {\mathbf {w}} \in {\mathcal {H}}^{d}\), and

    \({\mathbf {w}} \rightarrow dist_{g}({\mathbf {M}}, {\mathbf {y}}, {\mathbf {x}},{\mathbf {w}})\) is strictly convex function on \({\mathcal {H}}_{d}, \forall {\mathbf {M}} \in {\mathcal {C}}_{2},\; \forall {\mathbf {y}} \in {\mathcal {C}}_{1}.\)

We denote the family of functions \(dist_{g}\) satisfying the premises of Definition 1 by \({\mathcal {D}}_{g}({\mathcal {C}}_{1},{\mathcal {C}}_{2})\). The very definition of \(dist_{g}\) ensures that it is symmetric. Note that the definition of dist does not require the triangle inequality to hold and hence this particular class of dissimilarity measures need not be a metric or distance function in the true sense. The motivation behind the technical assumptions in Definition 1 will be evident from the mathematical development in Sect. 4. It should be noted that a sufficient condition for assumption 3 to hold is the increasing and convex nature of h and g at every point on their respective domains.

3.2 Examples of IPINCWD measures

We provide examples of two IPINCWD measures. The unweighted version of the first one of these has been extensively used for clustering (Saha and Das 2016b), while the other one has been observed to yield robust clustering being resistant to the presence of noise (Hung et al. 2011).

3.2.1 Weighted IPIN

The common IPIN is a popular distance measure for clustering (Gustafson and Kessel 1978). It is defined (corresponding to \({\mathbf {M}} \in {\mathcal {M}}^{d}\)) as follows,

$$\begin{aligned} ({\mathbf {x}}-{\mathbf {y}})^{T} {\mathbf {M}} ({\mathbf {x}}-{\mathbf {y}}),\, \, \, \forall {\mathbf {M}} \in {\mathcal {M}}^{d}, {\mathbf {y}}, {\mathbf {x}} \in {\mathbb {R}}^{d}. \end{aligned}$$

Hence, the weighted IPIN (corresponding to \({\mathbf {M}} \in {\mathcal {M}}^{d}\)) can be defined as follows \((g(x)=x^{m};\,m>1)\):

$$\begin{aligned} dist_{g}({\mathbf {M}},{\mathbf {y}},{\mathbf {x}},{\mathbf {w}})=({\mathbf {x}}-{\mathbf {y}})^{T} {\mathbf {M}}_{{\mathbf {w}},g} ({\mathbf {x}}-{\mathbf {y}}),\, \, \, \forall {\mathbf {M}} \in {\mathcal {M}}^{d}, \forall {\mathbf {y}}, {\mathbf {x}} \in {\mathbb {R}}^{d}, \forall {\mathbf {w}} \in {\mathcal {H}}^{d}. \end{aligned}$$

Hence, in this case, h is the identity function on non-negative real line, i.e. \(h(x)=x, \forall x \in {\mathbb {R}}_{+}\).

Now, h is trivially differentiable everywhere. \(({\mathbf {x}}-{\mathbf {y}})^{T} {\mathbf {M}}_{{\mathbf {w}},g} ({\mathbf {x}}-{\mathbf {y}})\) is strictly convex with respect to \({\mathbf {y}}\), hence so is \(dist({\mathbf {M}}, {\mathbf {y}}, {\mathbf {x}},{\mathbf {w}})\). Moreover, \(dist_{g}({\mathbf {M}}, {\mathbf {y}}, {\mathbf {x}},{\mathbf {w}})\) is a linear function with respect to \({\mathbf {M}}\). From the strict convexity of g it follows that \(dist_{g}({\mathbf {M}}, {\mathbf {y}}, {\mathbf {x}},{\mathbf {w}})\) is a strictly convex function of \({\mathbf {w}}\). Hence this dissimilarity measure is a valid member of the IPINCWD class.

3.2.2 Weighted exponential IPIN

The exponential IPIN-based dissimilarity measure can be very helpful in achieving robust clustering since it has been shown to provide natural resistance against noise (Hung et al. 2011). Under the realistic assumption that the concerned \({\mathcal {C}}_{1}\) and \({\mathcal {C}}_{2}\) in Definition 1 are bounded, it can be defined as follows (\(g(x)=x^{m}\; m > 1\)):

$$\begin{aligned} dist_{g}({\mathbf {M}}, {\mathbf {y}}, {\mathbf {x}},{\mathbf {w}})=\exp \big \{({\mathbf {x}}-{\mathbf {y}})^{T} {\mathbf {M}}_{{\mathbf {w}},g} ({\mathbf {x}}-{\mathbf {y}})\big \},\, \, \, \forall {\mathbf {M}} \in {\mathcal {M}}^{d}, \forall {\mathbf {y}}, {\mathbf {x}} \in {\mathbb {R}}^{d};\, \forall {\mathbf {w}} \in {\mathcal {H}}^{d}. \end{aligned}$$

Hence, in this case, h is the exponential function on non-negative real line, i.e. \(h(x)=\exp (x), \forall x \in {\mathbb {R}}_{+}\). Here also h is trivially differentiable everywhere. Since, the composition of a strictly convex increasing function (exponential function in this case) with a convex function is again strictly convex, assumption 3 in Definition 1 is satisfied.

On the other hand, as far as choices of g are concerned, some common choices of g (under the simplifying assumption that \(h(x)=x\)), can be given as follows:

$$\begin{aligned} g(x)= & {} x^{\beta }, \beta>1,\\ g(x)= & {} \exp ({x}),\\ g(x)= & {} \exp ({x^{\beta }}), \beta > 1. \end{aligned}$$

Hence this measure also fit well into the IPINCWD class.

3.3 Problem formulation and algorithm development

In this section, we present the general class of clustering problems with IPIN-based weighted dissimilarity measures. We also develop a Llyod heuristic and an AO algorithm to solve the k-means and FCM clustering problems respectively.

Let \({\mathcal {X}}=\{{\mathbf {x}}_{1}, {\mathbf {x}}_{2},\ldots , {\mathbf {x}}_{n}\}, {\mathbf {x}}_{i} \in {\mathbb {R}}^{d}, \forall i=1,2,\ldots ,n\) be the given set of patterns, which we want to partition into c (prefixed) clusters with \(2 \leqslant c \leqslant n\). Let \({\mathcal {B}} \subset {\mathbb {R}}^{d}\) be the convex hull of \({\mathcal {X}}\). The general clustering problem with any member of \({\mathcal {D}}({\mathcal {B}},{\mathcal {M}}^{d})\)(both \({\mathcal {B}}\) and \({\mathcal {M}}^{d}\) are convex, hence this class of IPINCWD is well-defined) is defined in the following way (fuzzifier \(m \geqslant 1, \rho _{j}>0, \forall j=1,2,\ldots ,c\)):

$$\begin{aligned} {\mathbf {P}}:\;\; { minimize}\, f_{m,{\varvec{\rho }},h,g} ({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}},{\mathcal {W}})=\sum _{i=1}^{n} \sum _{j=1}^{c}\ u_{ij}^{m} d_{{\mathbf {\Sigma }_{j}}_{{\mathbf {w}}_{j},g}}({\mathbf {z}}_{j},{\mathbf {x}}_{i}), \end{aligned}$$
(3)

subject to

$$\begin{aligned}&\sum _{j=1}^{c}\ u_{ij}=1, \quad \forall i=1,2,\ldots , n, \end{aligned}$$
(4a)
$$\begin{aligned}&0<\sum _{i=1}^{n}\ u_{ij}<n, \quad \forall j=1,2,\ldots ,c, \end{aligned}$$
(4b)
$$\begin{aligned}&u_{ij} \in [0,1], \quad \forall i=1,2,\ldots ,n; \;\; \forall j=1,2,\ldots , c, \end{aligned}$$
(4c)
$$\begin{aligned}&{\mathcal {Z}}=\{{\mathbf {z}}_{1},{\mathbf {z}}_{2}, \ldots , {\mathbf {z}}_{c}\},\quad {\mathbf {z}}_{j} \in {\mathcal {B}} \subseteq {\mathbb {R}}^{d}, \;\; \forall j=1,2,\ldots , c; {\mathcal {Z}} \in {\mathcal {B}}^{c} \subset {\mathbb {R}}^{d \times c}, \end{aligned}$$
(4d)
$$\begin{aligned}&{\mathcal {S}}=\{ \mathbf {\Sigma }_{1},\mathbf {\Sigma }_{2}, \ldots , \mathbf {\Sigma }_{c}\},\quad \mathbf {\Sigma }_{j} \in {\mathcal {M}}^{d}, \;\;\forall j=1,2,\ldots ,c; {\mathcal {S}} \in {\mathcal {M}}^{d \times c}, \end{aligned}$$
(4e)
$$\begin{aligned}&|\mathbf {\Sigma }_{j} |= \rho _{j}, \quad \forall j=1,2,\ldots ,c, \end{aligned}$$
(4f)
$$\begin{aligned}&{\mathbf {w}}_{j} \in {\mathcal {H}}^{d}\, \quad \forall d \in \{1,2,\ldots ,c\}; \, {\mathcal {W}} \in {\mathcal {H}}^{d \times c}. \end{aligned}$$
(4g)

To solve k-means and FCM problems, in this section, we present a Lyod’s heuristic and an AO procedure respectively. The general algorithm is schematically presented as Algorithm 1.

figure a

For hard clustering, we fix \(m=1\), whereas, for fuzzy clustering, we take \(m>1\). The general algorithm for solving the automated feature-weighted IPINCWD-based clustering algorithms is provided in Algorithm 1.

4 Convergence analysis of clustering with IPIN-based consistent dissimilarity measures

We carry out a full-fledged convergence analysis of the generic IPINCWD-based clustering procedure shown in Algorithm 1. First, we address the existence and uniqueness of the partial optimization problems with respect to the cluster representatives, norm inducing matrices, and weights.

For the sake of notational simplicity, we define the following:

$$\begin{aligned} {\mathcal {U}}_{c,n}= & {} \{ {\mathbf {U}} \mid {\mathbf {U}} \, \text {is a} \, n \times c \, \text {real matrix and} \,{\mathbf {U}} \, \text {satisfies}\, \hbox {(2a)--(2c)}\},\\ {\mathcal {M}}_{d,{\rho }_{j}}= & {} \{ {\mathbf {M}} \in {\mathcal {M}}^{d} \mid |{\mathbf {M}} |= \rho _{j}\},\\ {\mathcal {M}}_{d,{\varvec{\rho }}}= & {} \{ {\mathbf {M}}_{1}, {\mathbf {M}}_{2},\ldots , {\mathbf {M}}_{c} \mid {\mathbf {M}}_{j} \in {\mathcal {M}}_{d,{\rho }_{j}}, \forall j=1,2,\ldots ,c\};\quad \forall {\varvec{\rho }} \in {\mathbb {R}}^{c}_{+}. \end{aligned}$$

Theorem 1

For fixed \({\mathbf {U}}^{*} \in {\mathcal {U}}_{c,n}\), \({\mathcal {W}}^{*}=\{{\mathbf {w}}_{1}^{*},{\mathbf {w}}_{2}^{*},\ldots , {{\mathbf {w}}}_{c}^{*}\}, {\mathbf {w}}_{j}^{*} \in {\mathcal {H}}^{c}, \forall j=1,2,\ldots ,c;\) and \({\mathcal {S}}^{*}=\{\mathbf {\Sigma }_{1}^{*},\mathbf {\Sigma }_{2}^{*},\ldots , \mathbf {\Sigma }_{c}^{*}\}, \mathbf {\Sigma }_{j}^{*} \in {\mathcal {M}}_{d,\rho _{j}}, \forall j=1,2,\ldots ,c\), the problem \({\mathbf {P}}_{1} : \, minimize \, f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}},{\mathcal {S}}^{*},{\mathcal {W}}^{*}), {\mathcal {Z}} \in {\mathcal {B}}^{c}\), has a unique solution.

Proof

The function to be minimized in problem \({\mathbf {P}}_{1}\) is a convex function with respect to \({\mathcal {Z}}\) (from the assumption on IPINCWD measures, Definition 1) and the optimization task is carried out on a convex set. Hence, there exists at most one solution.

Now, the function under consideration is also a continuous function with respect to \({\mathcal {Z}}\) (from the assumption on IPINCWD measures, Definition 1). Thus it attains its maxima and minima in a closed and bounded interval, which is indeed the case here. Hence, the minimization task under consideration has at least one solution in the feasible region.

Employing the two aforementioned statements, we guarantee the existence of unique solution of the optimization task \({\mathbf {P}}_{1}\) in the feasible region. \(\square \)

Theorem 2

Let \(J_{1} : \; {\mathcal {B}}^{c} \rightarrow {\mathbb {R}}, J_{1}({\mathcal {Z}})=f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}},{\mathcal {S}}^{*},{\mathcal {W}}^{*}); {\mathbf {z}}_{j} \in {\mathcal {B}}, \forall j=1,2,\ldots , c\), where \({\mathbf {U}}^{*} \in {\mathcal {U}}_{c,n},\;{\mathcal {W}}^{*} \in {\mathcal {H}}^{d \times c},\; {\mathcal {S}}^{*} \in {\mathcal {M}}_{d,{\varvec{\rho }}}\) are fixed. Then \({\mathcal {Z}}^{*}\) is a global minimum of \(J_{1}\) if and only if \({\mathbf {z}}_{j}^{*}\) satisfies the following equation

$$\begin{aligned} \sum _{i=1}^{n}\ (u_{ij}^{*})^{m}h^{'}(({\mathbf {x}}_{i}-{\mathbf {z}}_{j}^{*})^{T} {\mathbf {\Sigma }_{j}^{*}}_{{\mathbf {w}}_{j},g} ({\mathbf {x}}_{i}-{\mathbf {z}}_{j}^{*})){\mathbf {\Sigma }_{j}^{*}}_{{\mathbf {w}}_{j},g}({\mathbf {x}}_{i}-{\mathbf {z}}_{j}^{*})=0;\;j=1,2,\ldots ,c. \end{aligned}$$
(5)

Proof

The condition in the theorem is derived as a first order necessary condition by setting the partial derivative of the objective function with respect to \({\mathbf {z}}_{j}\) equal to zero. From the strict convexity of the objective function with respect to \({\mathbf {z}}_{j}\), it follows that those conditions are also sufficient conditions to uniquely determine the minimizer. \(\square \)

Theorem 3

For fixed \({\mathbf {U}}^{*} \in {\mathcal {U}}_{c,n}\), \({\mathcal {Z}}^{*}=\{{\mathbf {z}}_{1}^{*},{\mathbf {z}}_{2}^{*},\ldots , {{\mathbf {z}}}_{c}^{*}\}, {\mathbf {z}}_{j}^{*} \in {\mathcal {B}}^{c}, \forall j=1,2,\ldots ,c, {\mathcal {W}}^{*}=\{{\mathbf {w}}_{1}^{*},{\mathbf {w}}_{2}^{*},\ldots , {{\mathbf {w}}}_{c}^{*}\}, {\mathbf {w}}_{j}^{*} \in {\mathcal {H}}^{d}, \forall j=1,2,\ldots ,c; {\varvec{\rho }} \in {\mathbb {R}}_{+}^{c},\) the problem \({\mathbf {P}}_{2} : \, minimize \,f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*},{\mathcal {S}},{\mathcal {W}}^{*}), {\mathcal {S}}=\{\mathbf {\Sigma }_{1},\mathbf {\Sigma }_{2},\ldots , \mathbf {\Sigma }_{c}\}, \mathbf {\Sigma }_{j} \in {\mathcal {M}}_{d,\rho _{j}}, \forall j=1,2,\ldots ,c\), has a unique solution.

Proof

Here, from the definition, it is clear that \({\mathcal {M}}_{d,\rho _{j}}\) is the inverse image of the compact set \(\{ \rho _{j}\}\) with respect to the continuous function \(\det \) (determinant function). Hence the feasible set \({\mathcal {M}}_{d,\rho _{j}}\) is a compact set. Thus from continuity of \(f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*},{\mathcal {S}},{\mathcal {W}}^{*})\) with respect to \(\mathbf {\Sigma }_{j}\), there exists at most one solution to the problem under consideration.

From the assumptions on the IPINCWD, we have that \(f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*},{\mathcal {S}},{\mathcal {W}}^{*})\) is a non-negative linear combination (\({\mathbf {U}}^{*}\) is not identically 0) of strictly convex functions in \(\mathbf {\Sigma }_{j}\) and hence is strictly convex with respect to \(\mathbf {\Sigma }_{j}\). To prove the convexity of the feasible set, we proceed as follows: We perform the following optimization task,

$$\begin{aligned}&{\mathbf {EP}}_{2}\;\; { minimize} \; f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*},{\mathcal {S}},{\mathcal {W}}^{*}),\\&{\mathcal {S}}=\{\mathbf {\Sigma }_{1},\mathbf {\Sigma }_{2},\ldots , \mathbf {\Sigma }_{c}\}, \mathbf {\Sigma }_{j} \in {\mathcal {F}}_{d,\rho _{j}}, \forall j=1,2,\ldots ,c; \end{aligned}$$

where

$$\begin{aligned} {\mathcal {F}}_{d,{\rho }_{j}}=\bigg \{ {\mathbf {M}} \in {\mathcal {M}}^{d} \mid |{\mathbf {M}} |\geqslant \rho _{j}\bigg \}. \end{aligned}$$

\({\mathcal {F}}_{d,{\rho }_{j}}\) being convex set, there can be at most one solution. Now, we observe that any minimizer of the objective function in \({\mathcal {F}}_{d,{\rho }_{j}}\) has to be in \({\mathcal {M}}_{d,{\rho }_{j}}\,\) (if not, we can divide by a suitable constant to get a minimizer in \({\mathcal {M}}_{d,{\rho }_{j}}\)). Hence, the objective function under consideration can have at most one solution in \({\mathcal {M}}_{d,{\rho }_{j}}\).

These two facts together guarantee that the optimization task with respect to matrices inducing inner products (\({\mathbf {P}}_{2}\)), has a unique solution. \(\square \)

Theorem 4

Let \(J_{2} : \; {\mathcal {M}}_{d ,{\varvec{\rho }}} \rightarrow {\mathbb {R}}, J_{2}({\mathcal {S}})=f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*},{\mathcal {S}},{\mathcal {W}}^{*}); \mathbf {\Sigma }_{j} \in {\mathcal {M}}_{d,\rho _{j}}, \forall j=1,2,\ldots , c\), where \({\mathbf {U}}^{*} \in {\mathcal {U}}_{c,n},\; {\mathcal {Z}}^{*} \in {\mathcal {B}}^{c},\; {\mathcal {W}}^{*} \in {\mathcal {H}}^{d \times c}\) are fixed. Then \({\mathcal {S}}^{*}\) is a global minimum of \(J_{2}\) if and only if \(\mathbf {\Sigma }_{j}^{*}\) satisfies the following equation:

$$\begin{aligned} {{\mathbf {M}}_{j}^{*}}^{-1}(\rho _{j} |{{\mathbf {M}}_{j}^{*}}|)^{\frac{1}{d}}=\mathbf {\Sigma }_{j}^{*};\; j=1,2,\ldots ,c. \end{aligned}$$
(6)

where,

$$\begin{aligned} {{\mathbf {M}}_{j}^{*}}=\sum _{i=1}^{n}\ (u_{ij}^{*})^{m}h^{'}(({\mathbf {x}}_{i}-{\mathbf {z}}_{j}^{*})^{T} {\mathbf {\Sigma }_{j}^{*}}_{{\mathbf {w}}_{j}^{*},g} ({\mathbf {x}}_{i}-{\mathbf {z}}_{j}^{*}))\left[ ({\mathbf {x}}_{i}-{\mathbf {z}}_{j}^{*})({\mathbf {x}}_{i}-{\mathbf {z}}_{j}^{*})^{T}\right] _{{\mathbf {w}}_{j}^{*},g}. \end{aligned}$$

Proof

Follows from the proof of Theorem 2. Like Theorem 2, the condition is obtained as the first order necessary condition by setting the first derivative equal to zero. The uniqueness follows from the strict convexity. \(\square \)

Theorem 5

For fixed \({\mathcal {Z}}^{*}=\{{\mathbf {z}}_{1}^{*},{\mathbf {z}}_{2}^{*},\ldots , {{\mathbf {z}}}_{c}\}, {\mathbf {z}}_{j}^{*} \in {\mathcal {B}}^{c}\), \({\mathcal {W}}^{*}=\{{\mathbf {w}}_{1}^{*},{\mathbf {w}}_{2}^{*},\ldots , {{\mathbf {w}}}_{c}^{*}\}, {\mathbf {w}}_{j}^{*} \in {\mathcal {H}}^{d}, \forall j=1,2,\ldots ,c\) and \({\mathcal {S}}^{*}=\{\mathbf {\Sigma }_{1}^{*},\mathbf {\Sigma }_{2}^{*},\ldots , \mathbf {\Sigma }_{c}^{*}\}, \mathbf {\Sigma }_{j}^{*} \in {\mathcal {M}}_{d,\rho _{j}}, \forall j=1,2,\ldots ,c\), the problem \({\mathbf {P}}_{3} : \, { minimize} \; f_{m,{\varvec{\rho }},h,g}({\mathbf {U}},{\mathcal {Z}}^{*},{\mathcal {S}}^{*},{\mathcal {W}}^{*}), {\mathbf {U}} \in {\mathcal {U}}_{c,n}\), has the solution \({\mathbf {U}}^{*}\) given by the following:

For \(m=1\) (hard clustering)

$$\begin{aligned} \zeta _{i}^{*}= & {} \left\{ j \mid d_{{{\Sigma }_{j}^{*}}_{{\mathbf {w}}_{j}^{*},g}}({\mathbf {z}}_{j}^{*},{\mathbf {x}}_{i})=\min _{1 \leqslant k \leqslant c}\ d_{{{\Sigma }_{k}^{*}}_{{\mathbf {w}}_{k}^{*},g}}({\mathbf {z}}_{k}^{*},{\mathbf {x}}_{i}) \right\} ,\nonumber \\ u_{ij}^{*}= & {} {\left\{ \begin{array}{ll} 1\, { or }\, 0\, { with}\, \sum _{k \in \zeta _{i}^{*}}\ u_{ik}=1, &{} \hbox {if}\, j \in \zeta _{i}^{*} \\ 0, &{} { otherwise.}\end{array}\right. } \end{aligned}$$
(7a)

For \(m>1\) (fuzzy clustering)

$$\begin{aligned} \psi _{i}^{*}= & {} \left\{ j \mid d_{{{\Sigma }_{j}^{*}}_{{\mathbf {w}}_{j}^{*},g}}({\mathbf {z}}_{j}^{*},{\mathbf {x}}_{i})=0 \right\} ,\nonumber \\ u_{ij}^{*}= & {} {\left\{ \begin{array}{ll} \left[ \sum _{k=1}^{c}\ \left[ \frac{d_{{\mathbf {\Sigma }_{j}^{*}}_{{\mathbf {w}}_{j}^{*},g}}({\mathbf {z}}_{j}^{*},{\mathbf {x}}_{i})}{d_{{\mathbf {\Sigma }_{k}^{*}}_{{\mathbf {w}}_{k}^{*},g}}({\mathbf {z}}_{k}^{*},{\mathbf {x}}_{i})}\right] ^\frac{1}{m-1}\right] ^{-1}, &{} { if}\, \psi _{i}^{*}=\phi \\ \geqslant 0 \, { with}\, \sum _{{\mathbf {z}}_{k}^{*}={\mathbf {x}}_{i}}\ u_{ik}^{*}=1, &{} { if}\, j \in \psi _{i}^{*} \\ 0. &{} { if}\, j \notin \psi _{i}^{*} \,{ and}\, \psi _{i}^{*}\ne \phi .\end{array}\right. } \end{aligned}$$
(7b)

Proof

As far as the k-means algorithm is concerned, the proof follows from the concave nature of the problem with respect to membership matrix which ensures that the minimum is realized at a boundary point of \({\mathbf {U}}_{c,n}\) (Selim and Ismail 1984).

In the case of FCM, the membership matrix updating rule follows from the techniques employed to obtain the membership updating rule corresponding to conventional FCM with squared Euclidean distance (Bezdek 1981). \(\square \)

Theorem 6

For fixed \({\mathbf {U}}^{*} \in {\mathcal {U}}_{c,n}\), \({\mathcal {Z}}^{*}=\{{\mathbf {z}}_{1}^{*},{\mathbf {z}}_{2}^{*},\ldots , {{\mathbf {z}}}_{c}^{*}\}, {\mathbf {z}}_{j}^{*} \in {\mathcal {B}}^{c}, \forall j=1,2,\ldots ,c, {\mathcal {S}}^{*}=\{\mathbf {\Sigma }_{1}^{*},\mathbf {\Sigma }_{2}^{*},\ldots , {\mathbf {\Sigma }}_{c}^{*}\}, \mathbf {\Sigma }_{j}^{*} \in {\mathcal {M}}_{d,\rho _{j}}, \forall j=1,2,\ldots ,c,\) the problem  \({\mathbf {P}}_{4}\;\; { minimize} f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*},{\mathcal {S}}^{*},{\mathcal {W}}), {\mathcal {W}}=\{{\mathbf {w}}_{1},{\mathbf {w}}_{2},\ldots , {\mathbf {w}}_{c}\}, {\mathbf {w}}_{j} \in {\mathcal {H}}^{d}, \forall j=1,2,\ldots ,c\), has a unique solution.

Proof

Follows from the proof of Theorem 1. Alike the case in Theorem 1, the problem in hand is a convex optimization on a convex set, hence have at most one solution. Moreover, it being a continuous function optimized on a compact set, does attend its extreme values. Hence, the uniqueness and existence of the solution follows. \(\square \)

Theorem 7

Let \(J_{3} : \; {\mathcal {H}}^{d \times c} \rightarrow {\mathbb {R}}, J_{3}({\mathcal {W}})=f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*},{\mathcal {S}}^{*},{\mathcal {W}}); {\mathbf {w}}_{j} \in {\mathcal {H}}^{d}, \forall j=1,2,\ldots , c\), where \({\mathbf {U}}^{*} \in {\mathcal {U}}_{c,n},\; {\mathcal {Z}}^{*} \in {\mathcal {B}}^{c},\; {\mathcal {S}}^{*} \in {\mathcal {M}}_{d,{\varvec{\rho }}}\) are fixed. Then \({\mathcal {W}}^{*}\) is a global minimum of \(J_{3}\) if and only if \({\mathbf {w}}_{j}^{*}\) satisfies the following equation

$$\begin{aligned}&{\mathbf {w}}_{j}^{*}={{\mathbf {v}}_{j}^{*}}^{2}= \left( {v_{j1}^{*}}^{2},{v_{j2}^{*}}^{2},\ldots ,{v_{jd}^{*}}^{2}\right) \quad \forall j=1,2,\ldots ,c. \end{aligned}$$
(8a)
$$\begin{aligned}&\sum _{i=1}^{n} u_{ij}^{m}\frac{\partial }{\partial {\mathbf {y}}_{j}} d_{{\mathbf {\Sigma }_{j}^{*}}_{{\mathbf {y}}_{j}^{2},g}}({\mathbf {z}}_{j}^{*},{\mathbf {x}}_{i})|_{{\mathbf {y}}_{j}={\mathbf {v}}_{j}^{*}}=-L\frac{\partial }{\partial {\mathbf {y}}_{j}}{\mathbf {y}}_{j}^{T}{\mathbf {y}}_{j}|_{{\mathbf {y}}_{j}={\mathbf {v}}_{j}^{*}}, \quad \forall j=1,2,\ldots ,c. \end{aligned}$$
(8b)
$$\begin{aligned}&{{\mathbf {v}}_{j}^{*}}^{T}{\mathbf {v}}_{j}^{*}=\rho _{j}\quad \forall j=1,2,\ldots ,c. \end{aligned}$$
(8c)

where L is any constant.

Proof

Follows from the proofs of Theorems 2 and 5 by replacing the non-negative weights with squares of unconstrained real numbers and using the Lagrange multiplier technique with the linear constraints on the sum of the weights. \(\square \)

Our proof of convergence for the proposed feature-weighted clustering algorithms are based on Zangwill’s global convergence theorem (Zangwill 1969). From the aforementioned theorems, the updating rules corresponding to the membership matrix, cluster representatives, matrices of inner product inducing norms, and the feature weights corresponding to each of the clusters can be interpreted with help of the following operators.

Definition 2

\(T_{memb} : {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c} \rightarrow {\mathcal {U}}_{c,n};\, \, \, T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}})={\mathbf {U}}=[u_{ij}]\), which is given by the following rule: For \(m=1\) (hard clustering)

$$\begin{aligned} \zeta _{i}= & {} \left\{ j \mid d_{{\mathbf {\Sigma }_{j}}_{{\mathbf {w}}_{j},g}}({\mathbf {z}}_{j},{\mathbf {x}}_{i})=\min _{1 \leqslant k \leqslant c}\ d_{{\mathbf {\Sigma }_{k}}_{{\mathbf {w}}_{k},g}}({\mathbf {z}}_{k},{\mathbf {x}}_{i}) \right\} ,\nonumber \\ u_{ij}= & {} {\left\{ \begin{array}{ll} 1\, { or }\, 0\, { with}\, \sum \nolimits _{k \in \zeta _{i}}\ u_{ik}=1, &{} { if}\, j \in \zeta _{i} \\ 0, &{} { otherwise.} \end{array}\right. } \end{aligned}$$
(9a)

For \(m>1\) (fuzzy clustering)

$$\begin{aligned} \psi _{i}= & {} \left\{ j \mid d_{{\mathbf {\Sigma }_{j}}_{{\mathbf {w}}_{j},g}}({\mathbf {z}}_{j},{\mathbf {x}}_{i})=0 \right\} ,\nonumber \\ u_{ij}= & {} {\left\{ \begin{array}{ll} \left[ \sum \nolimits _{k=1}^{c}\ \left[ \frac{d_{{\mathbf {\Sigma }_{j}}_{{\mathbf {w}}_{j},g}}({\mathbf {z}}_{j},{\mathbf {x}}_{i})}{d_{{\mathbf {\Sigma }_{k}}_{{\mathbf {w}}_{k},g}}({\mathbf {z}}_{k},{\mathbf {x}}_{i})}\right] ^\frac{1}{m-1}\right] ^{-1}, &{} { if}\, \psi _{i}=\phi \\ \geqslant 0 \, { with}\, \sum \limits _{{\mathbf {z}}_{k}={\mathbf {x}}_{i}}\ u_{ik}=1, &{} { if}\, j \in \psi _{i} \\ 0. &{} { if}\, j \notin \psi _{i} \,{ and}\, \psi _{i}\ne \phi .\end{array}\right. } \end{aligned}$$
(9b)

Definition 3

$$\begin{aligned} T_{cent} : {\mathcal {U}}_{c,n} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c} \rightarrow {\mathcal {B}}^{c};\,\,T_{cent} ({\mathbf {U}},{\mathcal {S}},{\mathcal {W}})={\mathcal {Z}}=( {\mathbf {z}}_{1},{\mathbf {z}}_{2},\ldots ,{\mathbf {z}}_{c}), \end{aligned}$$

where the vectors \(({\mathbf {z}}_{1},{\mathbf {z}}_{2},\ldots ,{\mathbf {z}}_{c}), {\mathbf {z}}_{j} \in {\mathcal {B}}, j=1,2,\ldots ,c\) are calculated such that they maintain the following condition

$$\begin{aligned} \sum _{i=1}^{n} (u_{ij})^{m} h^{'} \left[ ({\mathbf {x}}_{i}-{\mathbf {z}}_{j})^{T}{\mathbf {\Sigma }_{j}}_{{\mathbf {w}}_{j},g}({\mathbf {x}}_{i}-{\mathbf {z}}_{j}) \right] {\mathbf {\Sigma }_{j}}_{{\mathbf {w}}_{j},g}({\mathbf {x}}_{i}-{\mathbf {z}}_{j})=0;\quad \forall j=1,2,\ldots ,c. \end{aligned}$$
(10)

Definition 4

$$\begin{aligned} T_{matrix} : {\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {H}}^{d \times c} \rightarrow {\mathcal {M}}_{d,{\varvec{\rho }}}; \,\, T_{matrix}({\mathbf {U}},{\mathcal {Z}},{\mathcal {W}})={\mathcal {S}}=(\mathbf {\Sigma }_{1},\mathbf {\Sigma }_{2},\ldots , \mathbf {\Sigma }_{c}), \end{aligned}$$

where the matrices \((\mathbf {\Sigma }_{1},\mathbf {\Sigma }_{2},\ldots , \mathbf {\Sigma }_{c}), \mathbf {\Sigma }_{j} \in {\mathcal {M}}_{d,\rho _{j}}, j=1,2,\ldots , c\) are calculated such that they maintain the following condition:

$$\begin{aligned} {\mathbf {M}}_{j}^{-1}(\rho _{j} |{\mathbf {M}}_{j}|)^{\frac{1}{d}}=\mathbf {\Sigma }_{j};\quad j=1,2,\ldots ,c; \end{aligned}$$
(11)

where,

$$\begin{aligned} {\mathbf {M}}_{j}=\sum _{i=1}^{n}\ (u_{ij})^{m}h^{'}(({\mathbf {x}}_{i}-{\mathbf {z}}_{j})^{T} {\mathbf {\Sigma }_{j}}_{{\mathbf {w}}_{j},g} ({\mathbf {x}}_{i}-{\mathbf {z}}_{j}))\left[ ({\mathbf {x}}_{i}-{\mathbf {z}}_{j})({\mathbf {x}}_{i}-{\mathbf {z}}_{j})^{T}\right] _{{\mathbf {w}}_{j},g}. \end{aligned}$$

Definition 5

$$\begin{aligned} T_{weight} : {\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \rightarrow {\mathcal {H}}^{d \times c};\,\,T_{weight} ({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}})={\mathcal {W}}=( {\mathbf {w}}_{1},{\mathbf {w}}_{2},\ldots ,{\mathbf {w}}_{c}), \end{aligned}$$

where the vectors \(({\mathbf {w}}_{1},{\mathbf {w}}_{2},\ldots ,{\mathbf {w}}_{c}), {\mathbf {w}}_{j} \in {\mathcal {H}}^{d}, j=1,2,\ldots ,c\) are calculated such that they maintain the following condition:

$$\begin{aligned}&{\mathbf {w}}_{j}={\mathbf {v}}_{j}^{2}=\left( v_{j1}^{2},v_{j2}^{2},\ldots ,v_{jd}^{2}\right) \quad \forall j=1,2,\ldots ,c. \end{aligned}$$
(12a)
$$\begin{aligned}&\sum _{i=1}^{n} u_{ij}^{m} \frac{\partial }{\partial {\mathbf {y}}_{j}} d_{{\mathbf {\Sigma }_{j}}_{{\mathbf {y}}_{j}^{2},g}}({\mathbf {z}}_{j},{\mathbf {x}}_{i})|_{{\mathbf {y}}_{j}={\mathbf {v}}_{j}}=-L\frac{\partial }{\partial {\mathbf {y}}_{j}} {\mathbf {y}}_{j}^{T}{\mathbf {y}}_{j}|_{{\mathbf {y}}_{j}={\mathbf {v}}_{j}}, \quad \forall j=1,2,\ldots ,c.\end{aligned}$$
(12b)
$$\begin{aligned}&{\mathbf {v}}_{j}^{T}{\mathbf {v}}_{j}=\rho _{j}\quad \forall j=1,2,\ldots ,c. \end{aligned}$$
(12c)

where L is any constant.

With the help of these newly defined operators, that provide updating rules corresponding to the membership matrix, cluster representatives, matrices of IPINs and the feature weights corresponding to each of the clusters, the automated feature-weighted clustering algorithm with IPINCWD (Algorithm 1) is restated in Algorithm 2.

With the help of these newly defined operators, the clustering operator can be presented as follows:

Definition 6

$$\begin{aligned}&J : \; ({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}) \rightarrow ({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c});\,\\&J=O_{weight} \circ O_{matrix} \circ O_{cent} \circ O_{memb}, \end{aligned}$$

where

$$\begin{aligned}&O_{memb} : \, ({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}) \rightarrow ({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}),\\&O_{memb} ({\mathbf {U}},{\mathcal {Z}}, {\mathcal {S}},{\mathcal {Z}})=(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}).\\&O_{cent} : \; ({\mathcal {U}}_{c,n} \times {\mathcal {M}}_{d,{\varvec{\rho }}}\times {\mathcal {H}}^{d \times c}) \rightarrow ({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c}\times {\mathcal {H}}^{d \times c}), \\&O_{cent}({\mathbf {U}},{\mathcal {S}},{\mathcal {W}})=({\mathbf {U}},T_{cent}({\mathbf {U}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {W}}).\\&O_{matrix} : \; ({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c}\times {\mathcal {H}}^{d \times c}) \rightarrow ({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}}),\\&O_{matrix}({\mathbf {U}},{\mathcal {Z}},{\mathcal {W}})=({\mathbf {U}},{\mathcal {Z}},T_{matrix}({\mathbf {U}},{\mathcal {Z}},{\mathcal {W}})).\\&O_{weight} : \; ({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}}) \rightarrow ({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}) ,\, \\&O_{weight}({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}})=({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}},T_{weight}({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}})). \end{aligned}$$
figure b
$$\begin{aligned}&J({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}}.{\mathcal {W}})=O_{weight} \circ O_{matrix} \circ O_{cent} (T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}})\\&\quad =O_{weight} \circ O_{matrix}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),{\mathcal {W}})\\&\quad =O_{weight}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),\\&\quad \quad T_{matrix}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),{\mathcal {W}}))\\&\quad =(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),\\&\quad T_{matrix}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),{\mathcal {W}}),\\&\quad T_{weight}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),\\&\quad T_{matrix}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),{\mathcal {W}}))) \end{aligned}$$

In what follows we prove the convergence of the iterative algorithm (Algorithm 2) using Zangwill’s global convergence theorem (Zangwill 1969).

Lemma 1

\({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\) is compact.

Proof

\({\mathcal {U}}_{c,n}\) being closure of a set, is a closed set (Bezdek 1981), which accompanied with the boundedness of the set implies that it is a compact set. \({\mathcal {B}}\) being convex hull of finitely many points, is compact, ensuring the compactness of \({\mathcal {B}}^{c}\). In Theorem 3 the compactness of \({\mathcal {M}}_{d,\rho _{j}}, \forall j=1,2,\ldots ,c,\) was proved, which in turn implies the compactness of \({\mathcal {M}}_{d,{\varvec{\rho }}}\). The simplex is a compact set by definition, hence, the set under consideration being a product of four compact sets, is again compact. \(\square \)

We define the set of optimal points in the perspective of IPINCWD-based automated feature-weighted clustering algorithm in the following way.

Definition 7

\({\mathcal {T}}\) is a subset of \({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\), where

$$\begin{aligned} \begin{array}{llllll} {\mathcal {T}}= \left\{ \begin{array}{lcl} ({\mathbf {U}}^{*},{\mathcal {Z}}^{*}, {\mathcal {S}}^{*},{\mathcal {W}}^{*}) \in {\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}}\times {\mathcal {H}}^{d \times c} \end{array} \mid \right. \\ \left. \begin{array}{lcl} \quad f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*}, {\mathcal {S}}^{*},{\mathcal {W}}^{*}) \leqslant f_{m,{\varvec{\rho }},h,g}({\mathbf {U}},{\mathcal {Z}}^{*}, {\mathcal {S}}^{*},{\mathcal {W}}^{*}), \forall {\mathbf {U}} \in {\mathcal {U}}_{c,n}, {\mathbf {U}}\ne {\mathbf {U}}^{*},\\ \quad f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*}, {\mathcal {S}}^{*},{\mathcal {W}}^{*})< f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}, {\mathcal {S}}^{*},{\mathcal {W}}^{*}),\forall {\mathcal {Z}} \in {\mathcal {B}}^{c}, {\mathcal {Z}}\ne {\mathcal {Z}}^{*},\\ \quad f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*}, {\mathcal {S}}^{*},{\mathcal {W}}^{*})< f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*}, {\mathcal {S}},{\mathcal {W}}^{*}),\forall {\mathcal {S}} \in {\mathcal {M}}_{d,{\varvec{\rho }}}, {\mathcal {S}}\ne {\mathcal {S}}^{*},\\ \quad f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}^{*}, {\mathcal {S}}^{*},{\mathcal {W}}^{*}) < f_{m,{\varvec{\rho }},h,g}({\mathbf {U}}^{*},{\mathcal {Z}}, {\mathcal {S}}^{*},{\mathcal {W}}),\forall {\mathcal {W}} \in {\mathcal {H}}^{d \times c}, {\mathcal {W}}\ne {\mathcal {W}}^{*}. \end{array}\right\} \end{array} \end{aligned}$$
(13)

Lemma 2

The set defined in (13) satisfies the following two conditions:

  1. 1.

    If \({\mathbf {g}} \notin {\mathcal {T}}\), then \(f_{m,{\varvec{\rho }}}({\mathbf {g}}^{*})<f_{m,{\varvec{\rho }},h,g} ({\mathbf {g}}), \forall {\mathbf {g}}^{*} \in J({\mathbf {g}}), {\mathbf {g}},{\mathbf {g}}^{*} \in {\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {M}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\).

  2. 2.

    If \({\mathbf {g}} \in {\mathcal {T}}\), then \(f_{m,{\varvec{\rho }},h,g}({\mathbf {g}}^{*}) \leqslant f_{m,{\varvec{\rho }},h,g} ({\mathbf {g}}), \forall {\mathbf {g}}^{*} \in J({\mathbf {g}}), {\mathbf {g}},{\mathbf {g}}^{*} \in {\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {M}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\).

Proof

For any point \(({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}) \in {\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\), the following relation holds in general:

$$\begin{aligned}&f_{m,{\varvec{\rho }},h,g}(J({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}))\\&\quad =f_{m,{\varvec{\rho }},h,g}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),\\&\qquad T_{matrix}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),{\mathcal {W}}),\\&\qquad T_{weight}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),\\&\qquad T_{matrix}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),{\mathcal {W}})))\\&\quad \leqslant f_{m,{\varvec{\rho }},h,g}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),\\&\quad \quad T_{matrix}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),{\mathcal {W}}),{\mathcal {W}})\\&\quad \leqslant f_{m,{\varvec{\rho }},h,g}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),T_{cent}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}}),{\mathcal {S}},{\mathcal {W}})\\&\quad \leqslant f_{m,{\varvec{\rho }},h,g}(T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),{\mathcal {Z}},{\mathcal {S}},{\mathcal {W}})\\&\quad \leqslant f_{m,{\varvec{\rho }},h,g}({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}). \end{aligned}$$

Equality holds if and only if the following conditions are satisfied:

$$\begin{aligned} {\mathbf {U}}\in & {} T_{memb}({\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}),\\ {\mathcal {Z}}= & {} T_{cent}({\mathbf {U}},{\mathcal {S}},{\mathcal {W}}),\\ {\mathcal {S}}= & {} T_{matrix}{{\mathbf {U}},{\mathcal {Z}},{\mathcal {W}}},\\ {\mathcal {W}}= & {} T_{weight}({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}}). \end{aligned}$$

which implies that, \(({\mathbf {U}},{\mathcal {Z}},{\mathcal {S}},{\mathcal {W}}) \in {\mathcal {T}}\). Hence, the lemma follows. \(\square \)

Lemma 3

The map \(T_{memb}\) is closed at \(({\mathcal {Z}}^{R_{1}^{*}},{\mathcal {S}}^{R_{1}^{*}},{\mathcal {W}}^{R_{1}^{*}})\) if \(({\mathbf {U}},{\mathcal {Z}}^{R_{1}^{*}},{\mathcal {S}}^{R_{1}^{*}},{\mathcal {W}}^{R_{1}^{*}} \notin {\mathcal {T}}\) for some \({\mathbf {U}} \in {\mathcal {U}}_{c,n}\).

Proof

We have to prove the following: for all sequence \( \{({\mathcal {Z}}^{R_{1}(t)},{\mathcal {S}}^{R_{1}(t)},{\mathcal {W}}^{R_{1}(t)})\}_{t=0}^{\infty } \big \{\in {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\big \}\) converging to \(({\mathcal {Z}}^{R_{1}^{*}},{\mathcal {S}}^{R_{1}^{*}},{\mathcal {W}}^{R_{1}^{*}})\) and \(\{ {\mathbf {U}}^{R_{2}(t)}\}_{t=0}^{\infty } \left[ \in T_{memb}({\mathcal {Z}}^{R_{1}^{(t)}},{\mathcal {S}}^{R_{1}(t)},{\mathcal {W}}^{R_{1}^{(t)}})\right] \) converging to \({\mathbf {U}}^{R_{2}^{*}}\); we have that, \({\mathbf {U}}^{R_{2}^{*}} \in T_{memb}({\mathcal {Z}}^{R_{1}^{*}},{\mathcal {S}}^{R_{1}^{*}},{\mathcal {W}}^{R_{1}^{*}})\).

We shall show that the closedness property holds true individually for membership vector corresponding to each of the patterns \({\mathbf {x}}_{i},\forall i=1,2,\ldots ,n\).

For hard clustering i.e. \(m=1\), we define the following:

$$\begin{aligned} E_{i}^{(t)}= & {} \Bigg \{ j \left| dist_{g}\left( \mathbf {\Sigma }_{j}^{R_{1}^{(t)}},{\mathbf {z}}_{j}^{R_{1}^{(t)}},{\mathbf {x}}_{i},{\mathbf {w}}_{j}^{R_{1}^{(t)}}\right) \right. =\min _{1 \leqslant k\leqslant c} dist_{g}\left( \mathbf {\Sigma }_{k}^{R_{1}^{(t)}},{\mathbf {z}}_{k}^{R_{1}^{(t)}},{\mathbf {x}}_{i},{\mathbf {w}}_{k}^{R_{1}^{(t)}}\right) \Bigg \},\\ E_{i}^{*}= & {} \Bigg \{ j \left| dist_{g}\left( \mathbf {\Sigma }_{j}^{R_{1}^{*}},{\mathbf {z}}_{j}^{R_{1}^{*}},{\mathbf {x}}_{i},{\mathbf {w}}_{j}^{R_{1}^{*}}\right) \right. =\min _{1 \leqslant k\leqslant c} dist_{g}\left( \mathbf {\Sigma }_{k}^{R_{1}^{*}},{\mathbf {z}}_{k}^{R_{1}^{*}},{\mathbf {x}}_{i},{\mathbf {w}}_{k}^{R_{1}^{*}}\right) \Bigg \}. \end{aligned}$$

Using convergence of \(\{({\mathbf {z}}_{j}^{R_{1}^{(t)}},\mathbf {\Sigma }_{j}^{R_{1}^{(t)}},{\mathbf {w}}_{j}^{R_{1}^{(t)}})\}_{t=0}^{\infty }\) to \(({\mathbf {z}}_{j}^{R_{1}^{*}},\mathbf {\Sigma }_{j}^{R_{1}^{*}},{\mathbf {w}}_{j}^{R_{1}^{*}})\) and the continuity of \(dist_{g}\), we can find \(M_{hard}\) such that, \(\forall t>M_{hard}\), \(\max _{j \in E_{i}^{*}}dist_{g}(\mathbf {\Sigma }_{j}^{R_{1}^{(t)}},{\mathbf {z}}_{j}^{R_{1}^{(t)}},{\mathbf {x}}_{i},{\mathbf {w}}_{j}^{R_{1}^{(t)}})< \min _{j \notin E_{i}^{*}} \ dist_{g}(\mathbf {\Sigma }_{j}^{R_{1}^{(t)}},{\mathbf {z}}_{j}^{R_{1}^{(t)}},{\mathbf {x}}_{i},{\mathbf {w}}_{j}^{R_{1}^{(t)}})\), which completes the proof of the lemma.

For fuzzy clustering, i.e. \(m>1\), we define the following:

$$\begin{aligned} \Psi _{i}^{(t)}= & {} \Bigg \{ j \left| dist_{g}\left( \mathbf {\Sigma }_{j}^{R_{1}^{(t)}},{\mathbf {z}}_{j}^{R_{1}^{(t)}},{\mathbf {x}}_{i},{\mathbf {w}}_{j}^{R_{1}^{(t)}}\right) \right. =0 \Bigg \},\\ \Psi _{i}^{*}= & {} \Bigg \{ j \left| dist_{g}\left( \mathbf {\Sigma }_{j}^{R_{1}^{*}},{\mathbf {z}}_{j}^{R_{1}^{*}},{\mathbf {x}}_{i},{\mathbf {w}}_{j}^{R_{1}^{*}}\right) \right. =0 \Bigg \}. \end{aligned}$$

If \(|\Psi _{i}^{*}|=0\), using the convergence of \(\{({\mathbf {z}}_{j}^{R_{1}^{(t)}},\mathbf {\Sigma }_{j}^{R_{1}^{(t)}},{\mathbf {w}}_{j}^{R_{1}^{(t)}})\}_{t=0}^{\infty }\) to \(({\mathbf {z}}_{j}^{R_{1}^{*}},\mathbf {\Sigma }_{j}^{R_{1}{*}},{\mathbf {z}}_{j}^{R_{1}^{*}})\) and continuity of the dissimilarity measure, we can find \(M_{fuz1}\) such that, \(\forall t>M_{fuz1}\), \(|\Psi _{i}^{(t)}|=0\), implying the Lemma. If \(|\Psi _{i}^{*}|>0, \, \forall c>1\) we can find \(c (> 0),\;M_{fuz2}\) such that, \(\forall t>M_{fuz2}\), \(\max _{j \in \Psi _{i}^{*}}dist_{g}(\mathbf {\Sigma }_{j}^{R_{1}^{(t)}},{\mathbf {z}}_{j}^{R_{1}^{(t)}},{\mathbf {x}}_{i},{\mathbf {w}}_{j}^{R_{1}^{(t)}})< c\min _{j \notin \Psi _{i}^{*}} \ dist_{g}(\mathbf {\Sigma }_{j}^{R_{1}^{(t)}},{\mathbf {z}}_{j}^{R_{1}^{(t)}},{\mathbf {x}}_{i},{\mathbf {w}}_{j}^{R_{1}^{(t)}})\), which implies the lemma. \(\square \)

Lemma 4

\(O_{cent}\) is a continuous function on \({\mathbf {U}}_{c,n} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\).

Proof

$$\begin{aligned} T_{cent}= \left( T_{cent}^{1},T_{cent}^{2},\ldots , T_{cent}^{c}\right) : \, {\mathbf {U}}_{c,n} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c} \rightarrow {\mathcal {B}}^{c}, \end{aligned}$$

where, \(T_{cent}^{j}({\mathbf {U}},{\mathcal {S}},{\mathcal {W}})={\mathbf {z}}_{j}, \forall j=1,2,\ldots , c\), such that

$$\begin{aligned} \sum _{i=1}^{n}\ (u_{ij})^{m}\frac{\partial }{\partial {\mathbf {z}}_{j}}d_{{\mathbf {\Sigma }_{j}}_{{\mathbf {w}}_{j},g}}({\mathbf {z}}_{j},{\mathbf {x}}_{i})=0. \end{aligned}$$

In order to show the continuity of \(T_{cent}^{j}\), we proceed as follows. We define a function \(B_{j}\) in the following way:

$$\begin{aligned} B_{j}\,:\, {\mathcal {U}}_{c,n} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c} \times {\mathcal {B}} \rightarrow {\mathbb {R}},\quad j=1,2,\ldots ,c;\\ B_{j}({\mathbf {U}},{\mathcal {S}},{\mathcal {W}},{\mathbf {z}}_{j})=\sum _{i=1}^{n}\ (u_{ij})^{m}\frac{\partial }{\partial {\mathbf {z}}_{j}}d_{{\mathbf {\Sigma }_{j}}_{{\mathbf {w}}_{j},g}}({\mathbf {z}}_{j},{\mathbf {x}}_{i}). \end{aligned}$$

From the very definition of \(T_{cent}^{j}\), the set of zeroes of \(B_{j}\) can be written in the following form:

$$\begin{aligned} {\mathcal {L}}_{j}= \bigg \{ ({\mathbf {U}},{\mathcal {S}},{\mathcal {W}},T_{cent}^{j}({\mathbf {U}},{\mathcal {S}},{\mathcal {W}}))\bigg \}. \end{aligned}$$

As the function \(B_{j}\) is a continuous real valued function, the set of zeroes of \(B_{j}\) is a closed set. Now, from the very form of \({\mathcal {L}}_{j}\), we see that it is also the graph of the function \(T_{cent}^{j}\). Now, we apply the closed graph theorem (Fitzpatrick 2006) to prove the continuity of \(T_{cent}^{j}\).

Closed graph theorem (Munkres 2000, p. 171) Define the graph of a function \(T : {\mathcal {P}} \rightarrow {\mathcal {Y}}\) to be the set \(\{ ((x,y) \in {\mathcal {P}} \times {\mathcal {Y}} \mid T(x)=y)\}\). If \({\mathcal {P}}\) is a topological space and \({\mathcal {Y}}\) is a compact Hausdorff space, then the graph of T is closed if and only if T is continuous.

Using the fact that \({\mathcal {B}}\) is a compact set, from closed graph theorem it follows that \(T_{cent}\) is continuous. \(\square \)

Lemma 5

\(O_{matrix}\) is a continuous function on \({\mathbf {U}}_{c,n} \times {\mathcal {B}}_{c} \times {\mathcal {H}}^{d \times c}\).

Proof

Follows from the proof of Lemma 4. \(\square \)

Lemma 6

\(O_{weight}\) is a continuous function on \({\mathbf {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}}\).

Proof

Follows from the proof of Lemma 4. \(\square \)

Lemma 7

The map J is closed at \(({\mathbf {U}}^{R_{2}},{\mathcal {Z}}^{R_{2}},{\mathcal {S}}^{R_{2}},{\mathcal {W}}^{R_{2}})\) if \(({\mathbf {U}}^{R_{2}},{\mathcal {Z}}^{R_{2}},{\mathcal {S}}^{R_{2}},{\mathcal {W}}^{R_{2}}) \notin {\mathcal {T}}\).

Proof

From the very definition of J, we have the following:

$$\begin{aligned} J=O_{weight} \circ O_{matrix} \circ O_{cent} \circ O_{memb}. \end{aligned}$$

Now, \(O_{memb}\) is closed at \(({\mathbf {U}}^{R_{2}},{\mathcal {Z}}^{R_{2}},{\mathcal {S}}^{R_{2}},{\mathcal {W}}^{R_{2}})\) if \(({\mathbf {U}}^{R_{2}},{\mathcal {Z}}^{R_{2}},{\mathcal {S}}^{R_{2}},{\mathcal {W}}^{R_{2}}) \notin {\mathcal {T}}\) (Lemma 3), \(O_{cent},O_{matrix}\), and \(O_{weight}\) are continuous in their respective domain (from Lemma 4, 5, and 6 respectively). Now, composition of a closed map (at a particular point) with a continuous map, is again continuous. Hence, we have that J is closed at \(({\mathbf {U}}^{R_{2}},{\mathcal {Z}}^{R_{2}},{\mathcal {S}}^{R_{2}},{\mathcal {W}}^{R_{2}})\) if \(({\mathbf {U}}^{R_{2}},{\mathcal {Z}}^{R_{2}},{\mathcal {S}}^{R_{2}},{\mathcal {W}}^{R_{2}}) \notin {\mathcal {T}}\). \(\square \)

Theorem 8

\(\forall ({\mathcal {Z}}^{(0)},{\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)}) \in {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\), the sequence, \(\{ J(T_{memb}({\mathcal {Z}}^{(0)},{\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)}), {\mathcal {Z}}^{(0)}, {\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)})\}_{t=1}^{\infty }\) either terminates at a point in \({\mathcal {T}}\) [as defined in (13)] or has a subsequence that converges to a point in \({\mathcal {T}}\).

Proof

We begin by restating the Zangwill’s global convergence theorem from which the current proof is derived.

Zangwill’s global convergence theorem (Zangwill 1969) Let \({\mathcal {R}}\) be a set of minimizers of a continuous objective function \(\Omega \) on \({\mathcal {E}}\). Let \(A:\, {\mathcal {E}} \rightarrow {\mathcal {E}}\) be a point-to-set map which determines an algorithm that given a point \({\mathbf {s}}_{0} \in {\mathcal {E}}\), generates a sequence \(\{ {\mathbf {s}}_t\}_{t=0}^{\infty }\) through the iteration \({\mathbf {s}}_{t+1} \in A({\mathbf {s}}_{t})\). We further assume

  1. 1.

    The sequence \(\{ {\mathbf {s}}_t\}_{t=0}^{\infty } \in {\mathcal {C}} \subseteq {\mathcal {E}}\), where \({\mathcal {C}}\) is a compact set.

  2. 2.

    The continuous objective function \(\Omega \) on \({\mathcal {E}}\) satisfies the following:

    1. (a)

      If \({\mathbf {s}} \notin {\mathcal {R}}\), then \(\Omega ({\mathbf {s}}^{'})<\Omega ({\mathbf {s}}), \forall {\mathbf {s}}^{'} \in A({\mathbf {s}})\),

    2. (b)

      If \({\mathbf {s}} \in {\mathcal {R}}\), then \(\Omega ({\mathbf {s}}^{'})\leqslant \Omega ({\mathbf {s}}), \forall {\mathbf {s}}^{'} \in A({\mathbf {s}})\).

  3. 3.

    The map A is closed at \({\mathbf {s}}\) if \({\mathbf {s}} \notin {\mathcal {R}}\) (if A is actually a point-to-point map instead of a point-to-set map, condition (3) of the theorem turns out to be simply the continuity of A.)

Then the limit of any convergent subsequence of \(\{ {\mathbf {s}}_t\}_{t=0}^{\infty }\) is in \({\mathcal {R}}\).

We take A to be J, \({\mathcal {E}}\) to be \({\mathcal {U}}_{c,n} \times {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\); \({\mathbf {s}}_{0}\) to be \(({\mathbf {U}}^{(0)},{\mathcal {Z}}^{(0)},{\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)})\); \({\mathcal {C}}\) to be the whole of \({\mathcal {H}}\) (compactness of \({\mathcal {H}}\) is guaranteed by Lemma 1); \(\Omega \) to be \(f_{m,{\varvec{\rho }},h,g}\) (being sum of continuous functions, \(f_{m,{\varvec{\rho }},h,g}\) is a continuous function), R to be \({\mathcal {T}}\)  (13) (Lemma 2 justifies the choice). By Lemma 7, J is closed on this particular choice of \({\mathcal {H}}\). Thus, from Zangwill’s convergence theorem, the limit of any convergent subsequence of \(\{{\mathbf {U}}^{(t)},{\mathcal {Z}}^{(t)},{\mathcal {S}}^{(t)},{\mathcal {W}}^{(t)} \}_{t=0}^{\infty }\) has a limit in \({\mathcal {T}}\). Next \(\forall ({\mathcal {Z}}^{(0)},{\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)}) \in {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\), we consider the sequence \(\{ J^{(t)}(T_{memb}({\mathcal {Z}}^{(0)},{\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)}),{\mathcal {Z}}^{(0)},{\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)})\}_{t=0}^{\infty }\) which is contained in the compact set given by \({\mathcal {H}}\). Hence by Bolzano–Weierstrass Theorem (Olmsted 1961) it has a convergent subsequence. These statements imply that the sequence given by \(\{ J^{(t)}(T_{memb}({\mathcal {Z}}^{(0)},{\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)}),{\mathcal {Z}}^{(0)},{\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)})\}_{t=0}^{\infty }, \, \forall ({\mathcal {Z}}^{(0)},{\mathcal {S}}^{(0)},{\mathcal {W}}^{(0)}) \in {\mathcal {B}}^{c} \times {\mathcal {M}}_{d,{\varvec{\rho }}} \times {\mathcal {H}}^{d \times c}\) either terminates at a point in \({\mathcal {T}}\) given by Eq. (13) or has a subsequence that converges to a point in \({\mathcal {T}}\). \(\square \)

Theorem 8 provides us with a complete convergence result of the newly developed IPINCWD-based automated feature-weighted clustering algorithms. This particular class of clustering algorithms and the squared Euclidean distance based clustering algorithms share the similar type of global convergence characteristics.

5 Relationship with the existing feature weighting schemes and clustering algorithms

In this section, we discuss how the proposed feature-weighted clustering algorithm with IPINCWD is related to the existing clustering algorithms and feature weighting schemes. We start by presenting a comparative discussion on convergence analysis of IPINCWD-based clustering algorithms and that of the classical FCM with squared Euclidean distance.

5.1 A comparative discussion on the convergence analysis of IPINCWD-based clustering algorithm

The fundamental difference between the convergence analysis of FCM with squared Euclidean distance, with the general IPINCWD-based automated feature-weighted clustering, hinges on the following fact. In case of the feature-weighted clustering algorithm with IPINCWD measures, as far as the updating rules for the cluster representatives, the norm inducing matrices, and the feature weights corresponding to each cluster are concerned, in spite of knowing the existence of a unique upgradation rule, we do not know the continuity of the upgradation rule, which plays a key role in proving the convergence result of FCM with squared Euclidean distance. Hence, given the convergence analysis in Sect. 3, the convergence analysis corresponding to clustering algorithm with IPINCWD boils down to proving the closedness of the clustering operator J which (by Definition 6) is essentially equivalent to proving the continuity of \(O_{matrix}\), \(O_{cent}\), and \(O_{weight}\). In this article, we develop a novel proof of the fact that even in the absence of a closed form upgradation rule of the cluster representative, inner product inducing norms, feature weights corresponding to each of the clusters, a unique updating rule exists which is also continuous. This enables us to perform the convergence analysis for a much broader class of IPINCWD-based automated feature-weighted clustering algorithms.

5.2 Relation with Gustafson–Kessel like algorithms

The conventional fuzzy covariance matrix based GK algorithm (Gustafson and Kessel 1978) can be obtained as a special case of the proposed algorithm, with specific choices of h and g (Definition 1). This is possible if we take h to be identity and consider g to be identically 1. This choice of g removes the importance of feature weighting. In that case, weight updating becomes meaningless. If we choose h to be identity and consider a non-constant g, a feature-weighted version of the conventional Gustafson–Kessel algorithm is obtained as a special case of the proposed general class of automated feature-weighted IPINCWD-based clustering algorithm. This particular generalized and novel feature weighting scheme introduced in the framework of the GK algorithm can also be extended for various similar kind of clustering algorithms (Liu et al. 2007a, b, 2009a, b). Hence, a general feature weighting scheme can thus be derived to match the dissimilarity measures inspired by IPIN. It can also be integrated with the IPIN-based clustering, with a fixed non-singular matrix (Bezdek 1981) as the inner product inducing matrix.

5.3 Relation with existing feature weighting scheme

If we choose h to be identity and consider \(g(x)=x^{m}\), a special feature-weighted version of the conventional GK algorithm (with the conventional weight of the form \(w_{ij}^{m}\)) is obtained. This feature weighting scheme coincides with that corresponding to the weighting scheme introduced in Saha and Das (2015a) for IPIN. Hence, the feature weighting scheme corresponding to IPIN introduced in Saha and Das (2015a) is a special case of the generalized feature weighting scheme introduced in this article.

5.4 Extension in general divergence setup

To the best of our knowledge, the novel feature weighting scheme introduced in this article is the first of its kind. The earlier existing literature in feature weighting (mentioned in Sect. 2), generally deals with a specific choice of the form \(w_{ij}^{m}\, \forall i=i,2,\ldots ,n; j=1,2,\ldots ,c,\), which is just a special case of the weight function g introduced in the article. The theoretical study on automated feature weighting presented in Saha and Das (2015a) corresponding to clustering with separable distance functions, can also be generalized for this broad class of feature weighting scheme presented in this article.

6 Experimental results

In this section, we present a sample performance comparison (on several simulated and real-life datasets) of the proposed algorithm with 5 other pertinent clustering algorithms, just to highlight the usefulness of our proposal.

6.1 Benchmark dataset

Here, we consider a total of 10 datasets of which 8 are synthetic and 2 from the real world. Table 1 provides a brief description of the datasets.

Table 1 Summary of used datasets

6.2 Performance measures

The performance of the clustering algorithms is evaluated by using fuzzy and hard, both kind of partition-based validity functions. In order to achieve hard partition from the soft partition (where it is required), we assign the point to the cluster with the highest membership degree. In the case of a tie, it is randomly assigned to any of the clusters with equal probability.

Let \({\mathcal {T}}=\big \{t_{1},t_{2},\ldots ,t_{R} \big \}\) and \({\mathcal {S}}=\big \{s_{1},s_{2},\ldots ,s_{c} \big \}\) be two valid partitions of the given data. Let \({\mathcal {T}}\) be the actual partition and \({\mathcal {S}}\) be the obtained partition in some clustering algorithm, Now, we wish to evaluate the goodness of \({\mathcal {S}}\). \(n_{ij}\) is the number of objects present in both cluster \(t_{i}\) and \(s_{j}\); \(n_{i.}\) is the number of objects present in cluster \(t_{i}\); \(n_{.j}\) is the number of objects present in cluster \(s_{j}\).

In order to compare the clustering performance of the different algorithms, we use the Modified Partition Coefficient \((V_{MPC})\) (Dave 1996), Partition Entropy \((V_{PE})\) (Bezdek 1973), Xie–Beni Function \((V_{XB})\) (Xie and Beni 1991), and Adjusted Rand Index (ARI) (Yeung and Ruzzo 2001; Hubert and Arabie 1985). Table 2 provides a brief description of the cluster validity functions under consideration.

Table 2 Description of the cluster validity functions

6.3 Simulation procedures

The simulation results were obtained by using the following computational protocols over all the datasets.

Choice of \({\varvec{h}}\) and \({\varvec{g}}\) From the very definition of h and g, it is obvious that mathematically, there are infinitely many candidates for them. Since it is not possible to extensively compare all the possible weighted dissimilarity measures generated by considering various choices of h and g, we use very simple forms of these functions for the demonstration purpose only. In particular,we choose and fix h to be the identity function \((h(x)=x)\) and g to be the function given by \(g(x)=x^{\beta }\), \(\beta >1\).

Algorithms under consideration For a comparative analysis, we choose the following standard algorithms: FCM with fuzzy covariance matrix (IPINCD)(A0) (Gustafson and Kessel 1978; Saha and Das 2016a), Proposed FCM with specific choices of h and g mentioned earlier (IPINCWD) (A1), FCM with squared Euclidean Distance (FCM) (A2), FCM based on automated feature variable weighting (WFCM) (A3) (Nazari et al. 2013), k-means type algorithm with squared Euclidean distance (k-means)(A4) (MacQueen et al. 1967), k-means clustering with automated feature weights (w-k-means) (A5) (Huang et al. 2005).

Choice of the exponent of weights in weighted FCM weighted \({\varvec{k}}\)-means, and \(\varvec{\beta }\) We take integer values of the exponent, vary it from 2 to 10, and consider the one with the best average value of ARI.

In the comparative study regarding different \(\beta \) values, for \(\beta =0\) (if we remove the effect of weights), for the specific choice of h, A1 boils down to conventional FCM with fuzzy covariance matrix (Gustafson and Kessel 1978) i.e. A0.

6.4 Results and discussions

In what follows, we discuss the comparative performance of the algorithms compared over each of the datasets.

2elps_1gauss For this dataset, A1 achieves perfect clustering in all the 30 runs for \(\beta > 2\) as is evident from the plot of the points after clustering (Fig. 1a) and the ARI values (Fig. 1e). Algorithms A4 and A5 were far from being grossly accurate, though A2 and A3 managed to maintain good performance in terms of ARI, but on average, the new algorithm outperforms them for most of the values of \(\beta \). Considering fuzziness of the partitions, the best performance was shown by A3 with \(\beta = 2\) (Fig. 1b, c), but A3 with \(\beta = 2\) actually performed really poor, in terms of ARI (Fig. 1e). As far as the other values of \(\beta \) are concerned, the values of MPC and PE are almost same (Fig. 1b, c) for A1, A3, and A2. A0 performs the worst (Fig. 1b ,c) in terms of MPC and PE. If the geometric structure of clustering is concerned, the proposed algorithm performs better than A2 and A3 in terms of XB Index for all values of \(\beta \) (refer to Fig. 1d).

Fig. 1
figure 1

a The best clustering performance of our algorithm on 2elps_1gauss, b average value of \(V_{MPC}\), c average value of \(V_{PE}\), d average value of \(V_{XB}\), e average value of Adjusted Rand Index

Face For this dataset, A1 yields perfect clustering on all the 30 runs for all non-zero values of \(\beta \) under consideration. As far as the ARI values (Fig. 2e) are concerned, A1 outperforms other algorithms by a considerable margin. Among the other algorithms, A3 showed the worst performance in the class for \(\beta = 2\). If we take fuzziness in consideration, according to MPC and PE (Fig. 2b, c) A1 shows considerable improvement from A2, A0, and A3 for all \(\beta \ne 0\). Though A2 and A1 with \(\beta = 0\) algorithm performs slightly better than A1, in terms of the XB Index (Fig. 2d), our algorithm improves over A3 for \(\beta \ne 0\).

Fig. 2
figure 2

a The best clustering performance of our algorithm on the Face dataset, b average value of \(V_{MPC}\), c average value of \(V_{PE}\), d average value of \(V_{XB}\), e average value of Adjusted Rand Index

Spherical 5_2 In this dataset, the proposed algorithm A1 yields almost perfect partitioning in all the 30 runs for all \(\beta \ne 0\).The slight deviation from the perfect clustering [according to ARI (Fig. 3e)] is due to the inability to handle overlapping cluster structures. However, A1 performs much better than A0 [MPC and PE (Fig. 3b, c)], A4, and A5 [Minkowski Score and ARI (Fig. 3e, f)]. Due to the presence of perfectly spherical clusters, A2 and A3 performed well in this dataset, (in terms of XB Index too) but unlike A0, our algorithm (in spite of using the Mahalanobis distance) is adaptive enough to perform as good as them.

Fig. 3
figure 3

a The best clustering performance of our algorithm on Spherical 5_2, b average value of \(V_{MPC}\), c average value of \(V_{PE}\), d average value of \(V_{XB}\), e average value of Adjusted Rand Index

Spherical 6_2 Here we find that A1 is able to provide perfect clustering for all the 30 runs and for all the non-zero values of \(\beta \) under consideration. A2 and A3, due to the presence of well-separated spherical clusters, performs well. The overall performance of A4 and A5 is not satisfactory according to the ARI values (Fig. 4e). As A0 uses the Mahalanobis distance, it fails to recognize the spherical cluster pattern and its performance is not very impressive. On the other hand, despite using Mahalanobis distance, A1 remains adaptive enough to produce the best overall performance according to all the performance measures (Fig. 4b–e).

Fig. 4
figure 4

a The best clustering performance of our algorithm on Spherical 6_2, b average value of \(V_{MPC}\), c average value of \(V_{PE}\), d average value of \(V_{XB}\), e average value of Adjusted Rand Index

st900 In spite of the presence of overlapping clusters, this dataset is partitioned almost perfectly by A1 (Fig. 5a). However, such clusters make the job harder for A2 and A3 as can be observed from Fig. 5e. The proposed algorithm outperforms both A2 and A3 \(\forall \beta \ne 0\), by a significant margin. Though MPC and PE for A3 corresponding to \(\beta = 2,3\) appear high (Fig. 5b, c), their clustering performance remain fairly poor (Fig. 5e, f). A0, A4, and A5 did not perform satisfactorily, according to the ARI values (Fig. 5e). The best performance of our algorithm with respect to the XB index is better than that of the other algorithms under consideration.

Fig. 5
figure 5

a The best clustering performance of our algorithm on st900, b average value of \(V_{MPC}\), c average value of \(V_{PE}\), d average value of \(V_{XB}\), e average value of Adjusted Rand Index

elliptical_10_2 For this dataset, A1 performs perfect clustering in almost all runs for \(\beta = 5\). The presence of a single elliptical cluster is enough to affect the performance of A3 and A2 as even their best average performance according to the ARI value remains significantly lower than that of A1 (Fig. 6e). The overall performances of A4 and A5 are not satisfactory according to the ARI values(Fig. 6e). A0 was unable to perform well in all the measures under consideration. Coming to the fuzziness of the partitions, MPC and PE in A3 for \(\beta = 2,3\) are high (Fig. 6b, c), but their clustering performance remains considerably poorer than that of A1 (Fig. 6e). Here also, the best performance of our algorithm in XB Index is better than that of the other algorithms under consideration.

Fig. 6
figure 6

a The best clustering performance of our algorithm on elliptical_10_2, b average value of \(V_{MPC}\), c average value of \(V_{PE}\), d average value of \(V_{XB}\), e average value of Adjusted Rand Index

Step3_blocks This dataset has two actual features and one noise feature. Our algorithm is able to detect the noise feature nicely and gives us a near perfect clustering for \(\beta > 4\). According to the ARI values, our algorithm performs really well for \(\beta > 4\) (Fig. 7e). The weight of the noise feature is set to zero. On the other hand, the noise feature, as expected, deteriorates the performances of A2, A0, and A4 significantly. A3 and A5 perform much better than their unweighted counterparts. As far as fuzziness is concerned, A1 with \(\beta = 3,4,5,6\) performs much better than the rest of the algorithms (Fig. 7b, c). Amongst them, A1 with \(\beta = 5,6\) also shows greater ARI values (Fig. 7e).

Fig. 7
figure 7

a The best clustering performance of our algorithm on Step3_blocks, b average value of \(V_{MPC}\), c average value of \(V_{PE}\), d average value of \(V_{XB}\), e average value of Adjusted Rand Index

Step60_blocks This 60 dimensional dataset has 30 noise features each following an independent N(0, 1) distribution. There are three clusters, each consisting of 200 points. The 30 features, which are instrumental in determining the clustering are simulated from \(N({\mathbf {m}}_{i},\Sigma _{i}), i=1,2,3\), where \({\mathbf {m}}_{1}={\mathbf {1}}, \, {\mathbf {m}}_{2}=2{\mathbf {m}}_{1}\), and \({\mathbf {m}}_{3}=3{\mathbf {m}}_{1}\), \(\Sigma _{i}=\frac{1}{4}{\mathbf {I}}\). On this dataset, we observe similar clustering performance, as in Step3_blocks. This provides us with an example of a case, where our algorithm works in higher dimension also.

Iris Dataset The parallel coordinate plot of the best clustering performance by A1 shows that in this real world data set also, it is as effective as the synthetic dataset. A1 performed much better than A2, A0, A4, and A5 and was almost as good as A3, which was the top performer among the 5 other algorithms, in terms of ARI (Fig. 8e). A1 also performed almost as good as the other 3 algorithms in terms of XB Index, except for a few value of \(\beta (2,3,4)\). As far as fuzziness (MPC and PE) is concerned, A1 performed better than the rest of the algorithms, for \(\beta = 2,3,4,5,6\) (Fig. 8b, c).

Fig. 8
figure 8

a The best clustering performance of our algorithm on the iris dataset, b average value of \(V_{MPC}\), c average value of \(V_{PE}\), d average value of \(V_{XB}\), e average value of Adjusted Rand Index

Seed Dataset The parallel coordinate plot of the best clustering performance by A1 shows that it is able to give a more or less accurate clustering on this real world dataset. Here, we see that, in terms of ARI (Fig. 9e), A2 and A0 algorithms, perform the best, but even the best average performance by A3, A4, and A5 are far from the best values of the earlier ones. In the same context, our algorithm for \(\beta = 2\) is not only able to perform well but also it outperforms them in terms of the mean value. With respect to the XB index also, A1 is better than A3 for all the \(\beta \) s; and for \(\beta = 8,9,10\), it is actually pretty close to the best performances shown jointly by A2 and A1 with \(\beta = 0\). As far as fuzziness of the obtained partition is concerned (in terms of MPC and PE), A1 \(\forall \beta > 1\), performs much better than the other three fuzzy clustering algorithms (Fig. 9b, c).

Fig. 9
figure 9

a The best clustering performance of our algorithm on the seed dataset, b average value of \(V_{MPC}\), c average value of \(V_{PE}\), d average value of \(V_{XB}\), e average value of Adjusted Rand Index

6.5 Statistical comparison

For each of the 5 performance measures and the 10 datasets under consideration, we carry out the Wilcoxon’s rank-sum test (paired) between the best average performances by the algorithms under consideration, to see if we have a statistically significant improvement in our algorithm over the others. Here for A1, A3, and A5, the \(\beta \) producing the best average over the set of 30 runs is considered. In Tables 3, 4, 5 and 6, the P values obtained from the rank-sum test are reported in each column, beneath the main value, and within the parentheses.

6.5.1 Modified partition coefficient

For a specific dataset \({\mathcal {R}}\), let the median of \(V_{MPC}\) corresponding to algorithms A1 \((\beta > 1)\), A0, A2, and A3 be denoted by \(m_{11}, m_{12}, m_{13}\), and \(m_{14}\) respectively. We perform paired Wilcoxon’s rank-sum test on the following hypothesis testing setup:

$$\begin{aligned} {\mathbf {H}}_{11,N} : m_{11}=m_{12}\,\, \text {vs.}\,\,{\mathbf {H}}_{11,A} : m_{11}>m_{12}\\ {\mathbf {H}}_{12,N} : m_{11}=m_{13}\,\, \text {vs.}\,\,{\mathbf {H}}_{12,A} : m_{11}>m_{13}\\ \end{aligned}$$

and,

$$\begin{aligned} {\mathbf {H}}_{13,N} : m_{11}=m_{14}\,\, \text {vs.}\,\,{\mathbf {H}}_{13,A} : m_{11}>m_{14} \end{aligned}$$
Table 3 Average \(V_{MPC}\) along with the respective P values within parentheses

From the comparison of best average performances of the 3 fuzzy clustering algorithms under consideration with the best average performance of A1 \((\beta > 1)\), we see (Table 3) that for dataset 2–4, 7–10 our algorithm shows statistically significant improvement for the considered \(\beta \)s. For datasets 1 and 6, we have a significant improvement from A0. It outperforms both A2 and A0 on dataset 5. The mean of the best average performances over the 9 datasets shows that our algorithm is the best among all the fuzzy clustering methods under consideration according to \(V_{MPC}\).

6.5.2 Partition entropy

For a specific dataset \({\mathcal {R}}\), let the median of the \(V_{PE}\) corresponding to algorithms A1 \((\beta > 1)\), A0, A2, and A3 be denoted by \(m_{21}, m_{22}, m_{23}\), and \(m_{24}\) respectively. We perform paired Wilcoxon’s rank-sum test on the following hypothesis testing setup:

$$\begin{aligned} {\mathbf {H}}_{21,N} : m_{21}=m_{22}\,\, \text {versus}\,\,{\mathbf {H}}_{21,A} : m_{21}<m_{22}\\ {\mathbf {H}}_{22,N} : m_{21}=m_{23}\,\, \text {versus}\,\,{\mathbf {H}}_{22,A} : m_{21}<m_{23}\\ \end{aligned}$$

and,

$$\begin{aligned} {\mathbf {H}}_{23,N} : m_{21}=m_{24}\,\, \text {versus}\,\,{\mathbf {H}}_{23,A} : m_{21}<m_{24} \end{aligned}$$
Table 4 Average \(V_{PE}\) along with the respective P values within parentheses

Here also, we observe the same pattern as that for the case of \(V_{MPC}\) (Table 4). A1\((\beta > 1)\) shows statistically significant amount of improvement from the rest of the fuzzy clustering algorithm and in most of the cases. It outperforms A0, A2, and A3 in 10,7, and 6 datasets respectively. In mean of the best average performances, it appears best among the algorithms considered.

6.5.3 Xie–Beni Index

For a specific dataset \({\mathcal {R}}\), let the median of the \(V_{XB}\) corresponding to algorithms A1 \((\beta > 1)\), A1\((\beta =0)\), A2, and A3 be denoted by \(m_{31}, m_{32}, m_{33}\), and \(m_{34}\) respectively. We perform paired Wilcoxon’s rank-sum test on the following hypothesis testing setup:

$$\begin{aligned} {\mathbf {H}}_{31,N} : m_{31}=m_{32}\,\, \text {versus}\,\,{\mathbf {H}}_{31,A} : m_{31}<m_{32}\\ {\mathbf {H}}_{32,N} : m_{31}=m_{33}\,\, \text {versus}\,\,{\mathbf {H}}_{32,A} : m_{31}<m_{33} \end{aligned}$$

and,

$$\begin{aligned} {\mathbf {H}}_{33,N} : m_{31}=m_{34}\,\, \text {versus}\,\,{\mathbf {H}}_{33,A} : m_{31}<m_{34} \end{aligned}$$
Table 5 Average \(V_{XB}\) along with the respective P values within parentheses

Here we see (Table 5) that the proposed algorithm appears statistically superior to the rest of the algorithms in several cases. It performs better than the 3 other algorithms on dataset 5, 6; better than A3 and A0 on dataset 1; better than A3 on dataset 2,7, and 10 and better than A2 and A0 on dataset 3, 4. As far as the mean value of the best average performance over all the 9 datasets is concerned, our algorithm proved to be the best one.

6.5.4 Adjusted rand index

For a specific dataset \({\mathcal {R}}\), let the median of the ARI corresponding to algorithms A1 \((\beta > 1)\), A0, A2, A3, A4, and A5 be denoted by \(m_{41}, m_{42}, m_{43}\), \(m_{44}\), \(m_{45}\), and \(m_{46}\) respectively. We perform paired Wilcoxon’s rank-sum test on the following hypothesis testing setup:

$$\begin{aligned} {\mathbf {H}}_{41,N} : m_{41}=m_{42}\,\, \text {versus}\,\,{\mathbf {H}}_{41,A} : m_{41}>m_{42}\\ {\mathbf {H}}_{42,N} : m_{41}=m_{43}\,\, \text {versus}\,\,{\mathbf {H}}_{42,A} : m_{41}>m_{43}\\ {\mathbf {H}}_{43,N} : m_{41}=m_{44}\,\, \text {versus}\,\,{\mathbf {H}}_{43,A} : m_{41}>m_{44}\\ {\mathbf {H}}_{44,N} : m_{41}=m_{45}\,\, \text {versus}\,\,{\mathbf {H}}_{44,A} : m_{41}>m_{45} \end{aligned}$$

and,

$$\begin{aligned} {\mathbf {H}}_{45,N} : m_{41}=m_{46}\,\, \text {versus}\,\,{\mathbf {H}}_{45,A} : m_{41}>m_{46} \end{aligned}$$
Table 6 Average ARI along with the respective P values (within parentheses)

From Table 6, we observe that our algorithm achieves best average ARI value in all the datasets, along with an average value of 0.935.

7 An estimation of the runtime complexity

The computational complexity of the proposed algorithm depends on the choice of g and h. Hence, it is impossible to find a general expression for the asymptotic complexity of the proposed algorithm. Here we develop a theoretical expression of the computational overhead of the proposed clustering algorithm with the specific choice of h and g mentioned in Sect. 6.3.

Membership matrix upgradation Here we observe that independent of the choices of h and g, assuming that the function evaluation is of O(1), the computational complexity of the membership matrix update is O(\(nc^3d^2\)). The complexity of the membership updating rule can be obtained from the fact that, there are cd many coordinates of the cluster representatives and corresponding weights. Under the assumption, that the weights corresponding to each of the coordinates are equal for each of the clusters, the computational cost will be \(O(nc^2d^2)\).

Cluster representative upgradation With this specific choice of h, we have a closed form upgradation rule of the cluster representatives (which is nothing but the weighted average of the points, obtained by setting the derivative to zero and utilizing the inner product structure). Hence, in this case, the complexity of the cluster representative updating rule will be exactly same as that of the other algorithms under consideration and will be given by O(ncd).

Inner product inducing matrix upgradation In this scenario, due to the choice of h, we have a closed form upgradation rule corresponding to the inner product inducing matrix. This is obtained by just multiplying the weight matrix (say corresponding to jth cluster) with the matrix \(({\mathbf {x}}_i-{\mathbf {z}}_{j}),i=1,2,\ldots ,n\) and using the updating rule used in clustering with the fuzzy covariance matrix. The computational overhead of the updating rule, in this case, will be of \(O(nc^2d^2)\).

Weight vector upgradation Here the optimization problem with respect to weights is convex and the objective function is differentiable. Hence we use the gradient descent algorithm to solve the problem. Here, for the sake of computational simplicity, we run a fixed number of any iterative algorithm and then use the result as a close approximation of the actual optimizer. If we choose to use say r runs for approximating the solution, the complexity of the process will be O(ncdr). Note that we can use sophisticated optimization tools like stochastic gradient descent in order to get a better bound on the computational complexity (based on the user determined error), but that is beyond the purview of the present article.

Hence the computational complexity of the process turns out to be (under the assumption, that the weights corresponding to different vectors do not change over clusters, which is logical as the weighted FCM and k-means also do not change the weights for different clusters), \(O(nc^2d^2)\) (r is a constant, not of the order of n, c or d). Table 7 provides a comparative view of the asymptotic complexities of the algorithms under consideration.

Table 7 Asymptotic computational overhead of the algorithms compared

Hence, the asymptotic complexity of the proposed algorithm is similar to that of WFCM.

8 Conclusion

In this article, we start with a general class of dissimilarity measures based on IPINs. We introduce a novel feature weighting scheme and define a new class of the IPINCWD measures. Next, we develop the automated feature-weighted versions of the hard and fuzzy clustering algorithms with this class of IPINCWD measures in terms of the Lloyd heuristic and alternative optimization procedure respectively. We undertake a detailed discussion regarding the issue of existence and uniqueness corresponding to the sub-optimization problems that form the basic structure of the proposed clustering algorithms. We demonstrate that for any set of initial points, the sequence generated by the clustering operator converges (or has a converging subsequence) to a point in the set of optimal points.

The theoretical development of the generalized IPINCWD-based clustering algorithms provides us with a flexible class of dissimilarity measures which can be used to design data-specific clustering algorithms. This can perform better clustering in situations like non-spherical clusters with background noise, overlapping clusters, clusters with unequal size and density etc. Some spatial constraints may also be introduced, which may lead to a general class of image segmentation algorithms with a better take on the outliers. Given some dataset, what are the bounds on the expected convergence time of this algorithm? How can we approximate the updating rule corresponding to the cluster representatives, inner product inducing matrices, and feature weights corresponding to each of the clusters, when a closed form expression is not readily available? Even partial answers to such theoretical questions would have a significant practical impact and deserve further investigations. We wish to continue our future research in this direction.