1 Introduction

Partitional clustering involves the segmentation of a set of samples into clusters based on defined criteria such that samples in the same cluster exhibit greater similarity than those in different clusters. The k-means algorithm, a prominent unsupervised partitional clustering scheme [26], accomplishes this by categorizing data points into k disjoint clusters while minimizing the total distortion function. Distortion, in this context, is the sum of distances between data points and their assigned cluster center. This algorithm has gained widespread acclaim due to its simplicity, efficiency, ease of implementation, and empirical success in various fields such as data mining, pattern recognition, marketing research, and social network analysis [12, 20, 37]. Jain and Kar provided a comprehensive overview of clustering methods, addressing major challenges and key issues in the algorithm design [22]. The most commonly used heuristic for k-means clustering is the Lloyd’s method. It commences with a set of randomly chosen initial cluster centers, and in each iteration, assigns each data point to the cluster with the closest center, determined by a distance metric. The cluster representatives are then updated to the center of the newly formed clusters. This iterative process is repeated until a predefined stopping rule is met.

One of the major decisions in the design of the k-means method is the choice of an appropriate distance metric. The sum of squared Euclidean distances is the conventional measure of dissimilarity between data and has been widely adopted in studies involving k-means and deep learning based clustering [35]. In this paper, we propose a k-means clustering algorithm with the usage of a distance metric derived from the sub-one quasi-norm (\(\ell _p\) quasi-norm with \(p\in (0, 1)\)). In contrast to the Euclidean distance, this metric leverages similar data-items more effectively while assigning less weight to dissimilar ones. To illustrate the desired properties inherent in the proposed distance metric, we use the Iris dataset to exemplify its ability to yield more unbiased similarity values in classification tasks. Computational experiments on both synthetic and real-life datasets corroborate the enhanced performance of our proposed k-means method in achieving effective classification. Additional experiments were conducted on datasets with introduced random noise to underscore the resilience of the proposed k-means method to noise. Moreover, we also integrated our novel k-means method into a deep learning clustering model to further investigate its efficacy in handling more intricate tasks and real-world applications. These findings validate the context-dependent effectiveness of our proposed sub-one quasi-norm based k-means algorithm over standard k-means algorithms using other common distance metrics.

The rest of this paper is organized as follows. Section 2 reviews the search for an appropriate distance metric in the k-means clustering literature. Section 3 introduces the sub-one quasi-norm-based distance and analyzes its desired properties. In Sect. 4, we incorporate the sub-one quasi-norm-based distance metric into a k-means type heuristic and provide a step-by-step iterative relocation scheme. Section 5 presents numerical results to compare the performance of k-means algorithms with the proposed distance against other commonly used distance metrics. Section 6 concludes the paper.

2 Literature Review

The widely recognized k-means algorithm seeks to minimize the sum of squared Euclidean distances from each data point to its nearest cluster center [14, 21, 26]. Nonetheless, the Euclidean distance metric may not consistently produce high-quality cluster centers especially when the dataset contains noise or outliers. Numerous studies have proposed alternative mathematical metrics to improve distance calculations. In order to increase resistance to noise and outliers, Kersten suggested employing Manhattan distance instead of Euclidean distance in the k-means clustering algorithm [25]. Singh et al. explored the implementation of the k-means algorithm using Minkowski distance metric [33]. Recently, there has been a trend towards exploring various nonlinear distance metrics as an alternative to Euclidean distance [31]. Bobrowski and Bezdek extended the k-means clustering algorithm by exploring two similarity measures with nonhyperelliptical topologies: the square of the \(l_1\) norm and the square of the supremum norm \(l_\infty \) [5]. Hathaway et al. examined clustering with general \(l_p\) norm distances, with p varying from 0 to 1, and evaluated clustering performance with artificial datasets [18]. Aside from members of the Minkowski or p-norm family, Filippone et al. surveyed modified k-means clustering algorithms incorporating kernel-based similarity measures that produce nonlinear separating hypersurfaces between clusters [15]. de Amorim and Mirkin extended the existing k-means method with an automated feature weighting scheme using the corresponding Minkowski metric to improve resilience to irrelevant or noisy features [11].

Researchers have also suggested replacing Euclidean distance with various divergence measures that violate some metric properties for the k-means clustering problem. Dhillon et al. proposed a divisive algorithm based on Kullback–Leibler divergences, which are not symmetric and do not obey the triangle inequality [13]. Banerjee et al. presented an alternative formulation of the k-means clustering scheme using the general Bregman divergence that unifies all divergence measures [4]. Nielsen et al. investigated a parametric family of divergence measures for k-means clustering [29]. Chakraborty and Das derived the S-distance from the S-divergence defined on the open cone of positive definite matrices and proved the convergence of the induced k-means algorithm to a local optimum [7]. In fuzzy partitioning clustering, where a data sample can belong to multiple clusters with varying degrees of fuzzy membership, Saha and Das explored using geometric divergence to enhance the immunity of fuzzy k-means type algorithms towards noisy features [30]. Seal et al. proposed a Jeffrey-divergence based distance metric and developed a fuzzy c-means (FCM) clustering algorithm that guarantees convergence to a local minimum [31].

Recognizing the sensitivity of the commonly used Euclidean distance to the distortion of heavy outliers and its biased impact on the overall similarity values [10], alternative measures of dissimilarity have been studied in the machine learning literature. The non-convex \(\ell _p\) quasi-norm with \(p\in (0,1)\) has gained research interest due to its capacity for promoting sparsity in model parameters. This sub-one quasi-norm has found various applications in matrix completion [28], proximal support vector machine [9], vector reconstruction [24] in the fields of machine learning and compressed sensing. It has been shown to be more robust than the classical sparse-inducing term \(\ell _1\) norm, which induces the Manhattan distance [34]. The advantages of the sub-one quasi-norm in handling approximately sparse data, noisy measurements, and heavy outliers have been demonstrated. In light of these advantages, Jiang et al. [23] first proposed a sub-one quasi-norm-based similarity measure for collaborative filtering in a recommender system. This similarity measure has proven superior to commonly used distance metrics, as it makes better use of rating values and avoids overweighing dissimilar ratings between users, whether in fully co-rated, sparsely co-rated or cold-start scenarios. Our work represents another step towards exploiting the desired properties of \(\ell _p\) quasi-norm in learning models by extending the sub-one quasi-norm-based similarity measure to clustering analysis.

3 Sub-One Quasi-Norm-Based Distance Metric

This section describes the distance metric induced by the \(\ell _p\) quasi-norm with \(p\in (0, 1)\) and its desired properties.

For all \(\varvec{x}, \varvec{y} \in \mathbb {R}^m\), a well-defined function \(d:~\mathbb {R}^m \times \mathbb {R}^m\rightarrow \mathbb {R}_{+}\) that measures the distance between \(\varvec{x}\) and \(\varvec{y}\) has to satisfy the following three requirements:

  • Identity of indiscernibles: \(d(\varvec{x},\varvec{y})=0\iff \varvec{x}=\varvec{y}\);

  • Symmetry: \(d(\varvec{x},\varvec{y})=d(\varvec{y},\varvec{x})\);

  • Triangle inequality: \(d(\varvec{x},\varvec{z})\le d(\varvec{x},\varvec{y})+d(\varvec{y},\varvec{z})\).

Minkowski distance \((\sum _{i=1}^m{|x_i-y_i|}^p)^{1/p}\) constitutes a family of classical distance metrics derived from the \(\ell _p\) norm \({\Vert \varvec{x} \Vert }_p=(\sum _{i=1}^m{|x_i|}^p)^{1/p}\) where \(p\ge 1\). Familiar examples within this family include the Euclidean distance (induced by \(p=2\)) and the Manhattan distance (induced by \(p=1\)). In this research, we focus on \(p\in (0,1)\) for which \({\Vert \cdot \Vert }_p\) is no longer subadditive and forms a quasi-norm. For \(p\in (0,1)\), it holds that

$$\begin{aligned} {\Vert \varvec{x} \Vert }_p^p+{\Vert \varvec{y} \Vert }_p^p\le {\Vert x+y \Vert }_p^p \end{aligned}$$

for any \(\varvec{x}, \varvec{y} \in \mathbb {R}^m\). Hence, the function \({\Vert \cdot \Vert }_p^p\) induced by the \(\ell _p\) quasi-norm becomes subadditive and defines a distance function expressed as

$$\begin{aligned} {d_p(\varvec{x},\varvec{y})} = \sum _{i=1}^m{|x_i-y_i|}^p. \end{aligned}$$
Fig. 1
figure 1

Equal distance contour plot of \(l_{0.5}\)-distance and Euclidean distance

The Minkowski distance and the \(\ell _p\) quasi-norm-based distance metric exhibit distinct behaviors in similarity evaluation. As the value of p diminishes below 1, there is a heightened sensitivity of the \(\ell _p\) quasi-norm-based distance to no change or minor variations in a single dimension when assessing similarity. To illustrate this characteristic, we depict the contour of the \(\ell _{0.5}\) quasi-norm-based distance function alongside the Euclidean distance function in a two-dimensional space in Fig. 1. Notably, the point (0, 4) has the same Euclidean distance from the origin (0, 0) as the point (2.83, 2.83). However, it is as distant from the origin as the point (1, 1) in terms of the \(\ell _{0.5}\) quasi-norm-based distance. Therefore, the point (0, 4) is deemed more similar to the origin when \(\ell _{0.5}\) quasi-norm-based distance is adopted as a similarity measure, compared to the Euclidean distance. In other words, the \(\ell _p\) quasi-norm-based distance metric with \(p\in (0,1)\) emphasizes similarity along the x-axis between the points (0, 0) and (0, 4), while Euclidean distance emphasizes variations along the y-axis.

In the context of data classification tasks, leveraging \(\ell _p\) quasi-norm-based distance as a similarity measure offers two advantages. First, assigning greater importance to similar data-items helps mitigate the impact of substantial variations in few dimensions, which can aid in the identification of data samples belonging to the same category. Second, it enhances the robustness of clustering algorithms against noise and outliers by avoiding excessive emphasis on dissimilar data-items.

Table 1 Example of six iris plants
Fig. 2
figure 2

Class-stratified histogram of attributes in Iris dataset

We demonstrate the effectiveness of the proposed \(\ell _p\) quasi-norm-based distance metric using an illustrative example involving six iris plants from the well-known Iris dataset [16], which is widely recognized in the pattern recognition literature. Table 1 presents the sepal and petal length and width measurements (in centimeters) for plants A, B, C, D, E, and F. In this context, plants A and B belong to the Versicolor type, while plants C, D, E and F belong to the Virginica type. From Table 1, we can make the following observations: (i) Regarding the sepal length variable, the measurement of plant A falls in the range of plants C, D, E, and F, while the measurement of plant B stands out. (ii) Regarding the sepal width, plant A has the same measurement as plant B, but differs from plants C, D, E, and F. (iii) Regarding the petal length and width, the measurement of plant A is similar to that of plant B, but deviates from plants C, D, E, and F. To gain a deeper understanding of attribute value distribution for the Versicolor and Virginica iris species, we present histograms in Fig. 2 examining each attribute individually across all instances in the Iris dataset. The figure reveals that petal length has the widest distribution, whereas petal width exhibits the lowest mean among the four attributes. Although there is a notable overlap in the distributions of Versicolor and Virginica for sepal length and width, the measurements for petal length and width form fairly distinct groups. Thereby, distinguishing between the two species requires recognizing the similar petal length and width measurements within the same iris type.

An ideal distance metric should produce a smaller distance between plants A and B compared to that between plant A and other plants. To achieve this, this distance metric should not be excessively influenced by significant variations in sepal length but should recognize similar petal length and width measurements between iris plants of the sane type. Table 2 presents distances between plant A and other plants under \(\ell _p\) quasi-norm-based and other commonly used metrics (Manhattan, Euclidean and Chebyshev) for comparison. Larger distances indicate greater dissimilarity from plant A. It is evident that, except for the \(\ell _{0.5}\) quasi-norm-based distance, other metrics were not able to identify plant B as the most similar to plant A, consequently failing to classify them under the same iris type. These metrics tended to overemphasize dissimilarities in sepal length, a variable with a broad distribution, and failed to capture the shared sepal width as well as the close petal length and width measurements between plants A and B. Due to the concave nature of the contour plots of the \(l_p\) quasi-norm function, \(l_p\) quasi-norm-based distance makes better use of identical data-items while avoiding giving excessive weights to dissimilar data-items. This property is particularly advantageous when handling data with slight variations in a single variable, especially one possessing significant discriminatory power for classification, as demonstrated in our experimental results in Sect. 5.

Table 2 Distance to plant A under different distance metrics

4 Sub-One Quasi-Norm-Based k-Means Algorithm

Consider a set of vectors \(\mathcal {X}=\{\varvec{x}_i\}^n_{i=1}\) with \(\varvec{x}_i\in \mathbb {R}^m\), \(i=1,\ldots , n\). We want to partition \(\mathcal {X}\) into k many exhaustive and mutually exclusive classes (\(2\le k\le n\)). Adopting the \(\ell _p\) quasi-norm-based distance metric, we formalize the clustering problem in the following mathematical form:

$$\begin{aligned} \text {(P):}\quad \min&\quad f({\textbf {W}},{\textbf {Z}})=\sum _{i=1}^n\sum _{j=1}^kw_{ij}d_p(\varvec{x}_i, \varvec{z}_j)\\ \quad \text {s.t.}&\quad \sum _{j=1}^kw_{ij}=1, ~i=1,\ldots , n,\\&\quad w_{ij}\in \{0,1\}, ~i=1,\ldots , n,~j=1,\ldots , k, \end{aligned}$$

where \({\textbf {W}}= [w_{ij}] \in \mathbb {Z}_2^{n\times k}\), \({\textbf {Z}}=[\varvec{z}_1, \ldots , \varvec{z}_k]^\textsf{T} \in \mathbb {R}^{k\times m}\).

Regardless of the choice of the distance metric, Problem (P) is an NP-hard non-convex programming problem [2]. By relaxing the binary requirements of \(w_{ij}\), this problem can be reformulated as a reduced problem defined as:

$$\begin{aligned} \text {(RP):}\quad \min&\quad F({\textbf {W}}):=\text {inf}_{{\textbf {Z}}}f({\textbf {W}},{\textbf {Z}})\\ \quad \text {s.t.}&\quad {\textbf {W}}\in \mathcal {C}:=\{{\textbf {W}} \in \mathbb {R}^{n\times k} \mid \sum _{j=1}^kw_{ij}=1, w_{ij}\ge 0, ~i=1,\ldots , n,~j=1,\ldots , k\}. \end{aligned}$$

Problems (P) and (RP) are equivalent because, due to the concavity of \(F(\cdot )\), the optimal solutions for Problem (RP) precisely correspond to the extreme points of \(\mathcal {C}\). Meanwhile, the extreme points of \(\mathcal {C}\) inherently satisfy the constraints of Problem (P). Problem (RP) is typically addressed through a greedy alternating minimization algorithm. Here, we introduce a sub-one quasi-norm-based k-means clustering algorithm for solving (RP) based on the well-established Lloyd’s k-means method. The following iterative heuristics provide a detailed account of its implementation.

Algorithm 1
figure a

.

In Algorithm 1, Problem (P) is solved by alternating separate optimizations over the following two problems:

  • (P1): For fixed \({\textbf {Z}}\in \mathbb {R}^{k\times m}\), minimize \(f({\textbf {W}},{\textbf {Z}})\) subject to \({\textbf {W}}\in \mathcal {C}\);

  • (P2): For fixed \({\textbf {W}}\in \mathbb {Z}_2^{n\times k}\), minimize \(f({\textbf {W}},{\textbf {Z}})\) subject to \({\textbf {Z}}\in \mathbb {R}^{k\times m}\).

The solution to Problem (P1) is straightforward. Given any \({\textbf {Z}}\in \mathbb {R}^{k\times m}\), for each \(i=1,\ldots , n\), we set \(w_{ir}=1\) for the smallest r satisfying \(d_p(\varvec{x}_i,\varvec{z}_r)\le d_p(\varvec{x}_i,\varvec{z}_j)\) for all \(j=1,\ldots , k\). Subsequently, we set \(w_{ij}=0\) for all \(j\ne r\). However, the solution to Problem (P2) is not as straightforward, especially when the \(\ell _p\) quasi-norm-based distance function \(d_p\) with \(p\in (0,1)\) is adopted. Problem (P2) is an \(\ell _p\) quasi-norm minimization problem that has been proved to be strongly NP-hard [17]. Due to the non-convexity of \(\ell _p\) quasi-norm, a local optimum of Problem (P2) may not be a global optimal solution.

For a fixed \({\textbf {W}}\in \mathbb {Z}_2^{n\times k}\), as the distance function \(d_p\) is separable by variable, Problem (P2) can be decomposed into independent univariate minimization problems:

$$\begin{aligned} (\text {P2-}jl):\quad \min&\quad f_{jl}({z}_{jl}):=\sum _{i=1}^n w_{ij}{|{z}_{jl}-{x}_{il}}|^p \end{aligned}$$

for all \(j=1,\ldots , k\) and \(l=1,\ldots ,m\). \({\textbf {Z}}^*\) is an optimal solution to Problem (P2) if and only if \({z}^*_{jl}\) solves Problem (P2-jl) for all \(j=1,\ldots , k\) and \(l=1,\ldots ,m\). We note that the local minimizer of Problem (P2-jl) is easy to locate.

Lemma 1

For any \(j=1,\ldots , k\) and \(l=1,\ldots ,m\), Problem (P2-jl) attains its local minimum values at \({x}_{il}\), where \(i\in \{1,\ldots ,n\}\) satisfies \(w_{ij}\ne 0\).

Proof

Consider any j and l. The function \(f_{jl}(\cdot )\) is continuous and differentiable everywhere except at \({z}_{il}={x}_{il}\) for any \(i=1,\ldots ,n\) with \(w_{ij}\ne 0\). Define \(n_j:=\sum _{i=1}^nw_{ij}\). Let the sequence \(i_1,\ldots ,i_{n_j}\) be an ordered list of the set \(\{i\mid w_{ij}\ne 0,~i=1,\ldots ,n\}\) with \(x_{i_1l}\le \ldots \le x_{i_{n_j}l}\). On any open interval among \((-\infty ,~x_{i_1l}),~(x_{i_1l},~x_{i_2l}),\ldots ,(x_{i_{n_j}l},~+\infty )\), the function \(f_{jl}(\cdot )\) is differentiable and can be expressed as the sum of multiple strictly concave functions. As a result, \(f_{jl}({\cdot })\) is strictly concave on these open intervals, with its graph showing a downward jump at two endpoints of the interval. By summing up the derivatives, we obtain one-sided differentials: \(f_{jl}({x_{il}}^{+})=+\infty \) and \(f_{jl}({x_{il}}^{-})=-\infty \), for any \(i\in \{i_1,\ldots ,i_{n_j}\}\). Hence, for any \(i\in \{1,\ldots ,n\}\) with \(w_{ij}\ne 0\), \({z}_{jl}={x}_{il}\) serves as a local minimizer of Problem (P2-jl). Note that \(f_{jl}(\cdot )\) is not differentiable at its local minimum points. \(\square \)

To obtain the optimal solution \({\textbf {Z}}^*\) of Problem (P2), we address each individual subproblem (P2-jl) for all \(j=1,\ldots , k\), \(l=1,\ldots ,m\), and identify the component value \(z_{jl}\) that yield the overall minimum to Problem (P2-jl). Within each (P2-jl), we iterate over the set \(\{x_{i_1l},\ldots ,x_{i_{n_j}l}\}\) and select the smallest value that globally minimizes the objective function \(f_{jl}\).

As proved in Selim and Ismail [32], Algorithm 1 converges in a finite number of iterations to a partial optimal solution \(({\textbf {W}}^*,{\textbf {Z}}^*)\) defined as

$$\begin{aligned} f({\textbf {W}}^*,{\textbf {Z}}^*)\le f({\textbf {W}},{\textbf {Z}}^*) \quad&\text { for all } {\textbf {W}}\in \mathcal {C},\\ f({\textbf {W}}^*,{\textbf {Z}}^*)\le f({\textbf {W}}^*,{\textbf {Z}}) \quad&\text { for all } {\textbf {Z}}\in \mathbb {R}^{k\times m}. \end{aligned}$$

It is important to note that (\({\textbf {W}}^*\), \({\textbf {Z}}^*\)) is a partial optimal solution if and only if \({\textbf {W}}^*\) solves Problem (P1) given \({\textbf {Z}} = {\textbf {Z}}^*\) and \({\textbf {Z}}^*\) solves Problem (P2) given \({\textbf {W}} = {\textbf {W}}^*\). Selim and Ismail demonstrated that partial optimal solutions are Kuhn-Tucker points for Problem (P) provided that \(f({\textbf {W}},{\textbf {Z}})\) is differentiable at (\({\textbf {W}}^*\), \({\textbf {Z}}^*\)). However, when utilizing sub-one \(\ell _p\) quasi-norm based distance \(d_p\) in Problem (P), as discussed in Lemma 1, \(f({\textbf {W}},{\textbf {Z}})\) may not be differentiable at \({\textbf {Z}} = {\textbf {Z}}^*\). In this setup, we establish that partial optimal solutions still function as Kuhn-Tucker points for Problem (P).

Theorem 2

A point (\({\textbf {W}}^*\), \({\textbf {Z}}^*\)) is a Kuhn-Tucker point of Problem (P) if and only if it is a partial optimal solution.

Proof

For any \({\textbf {Z}} =[\varvec{z}_1, \ldots , \varvec{z}_k]^{\textsf{T}}\) and \(j=1,\ldots ,k\), let \(S_l(\varvec{z}_j)\) be an active set of \(\varvec{z}_j\) on the lth dimention, defined as

$$ S_l(\varvec{z}_j) = \{ i \mid z_{jl} \ne x_{il} \}.$$

The Kuhn-Tucker conditions of Problem (P) are given by

  1. (i)

    \(d_p(\varvec{x}_i,\varvec{z}_j)+\lambda _i\ge 0, ~i=1,\ldots , n,~j=1,\ldots , k.\)

  2. (ii)

    \(w_{ij}(d_p(\varvec{x}_i,\varvec{z}_j)+\lambda _i)=0, ~i=1,\ldots , n,~j=1,\ldots , k.\)

  3. (iii)

    \(\sum _{j=1}^kw_{ij}=1, ~i=1,\ldots , n.\)

  4. (iv)

    \(w_{ij}\ge 0, ~i=1,\ldots , n,~j=1,\ldots , k.\)

  5. (v)

    \(\sum _{i\in S_l(\varvec{z}_j)} pw_{ij}|z_{jl}-x_{il}|^{p-1}sgn(z_{jl}-x_{il})=0, ~l=1, \ldots , m, ~j=1,\ldots , k.\)

Assume (\({\textbf {W}}^*\), \({\textbf {Z}}^*\)) is a partial optimal solution. By the definition of partial optimal solutions, \({\textbf {W}}^*\) solves Problem (P1) given \({\textbf {Z}} = {\textbf {Z}}^*\), and thus \({\textbf {W}}^*\) satisfies the Kuhn-Tucker conditions of Problem (P1) characterized by (i)–(iv) when \({\textbf {Z}} = {\textbf {Z}}^*\). Likewise, \({\textbf {Z}}^*\) solves Problem (P2) given \({\textbf {W}} = {\textbf {W}}^*\), and thus \({\textbf {Z}}^*\) satisfies the Kuhn-Tucker condition of Problem (P2) characterized by (v) when \({\textbf {W}} = {\textbf {W}}^*\).

Assume (\({\textbf {W}}^*\), \({\textbf {Z}}^*\)) is a Kuhn-Tucker point of Problem (P). Suppose that (\({\textbf {W}}^*\), \({\textbf {Z}}^*\)) is not a partial optimal solution of Problem (P). Either \({\textbf {W}}^*\) does not solve Problem (P1) with \({\textbf {Z}} = {\textbf {Z}}^*\), or \({\textbf {Z}}^*\) does not solve Problem (P2) with \({\textbf {W}} = {\textbf {W}}^*\). If the former is true, then \({\textbf {W}}^*\) should not solve Kuhn-Tucker conditions of Problem (P1) with \({\textbf {Z}} = {\textbf {Z}}^*\), which are given by (i)–(iv). If the latter is true, then \({\textbf {Z}}^*\) should not solve Kuhn-Tucker condition of Problem (P2) with \({\textbf {W}} = {\textbf {W}}^*\), which is given by (v). This is a contradiction and hence (\({\textbf {W}}^*\), \({\textbf {Z}}^*\)) is a partial optimal solution. This completes the proof. \(\square \)

It is worth noting that when Algorithm 1 terminates at a partial optimal solution (\({\textbf {W}}^*\), \({\textbf {Z}}^*\)), if \({\textbf {Z}}^*\) is the unique solution of Problem (P2) with \({\textbf {W}} = {\textbf {W}}^*\), the local optimality of \({\textbf {W}}^*\) for Problem (RP) can be guaranteed. This, in turn, makes (\({\textbf {W}}^*\), \({\textbf {Z}}^*)\) a local minimizer of Problem (P), according to Selim and Ismail [32].

5 Numerical Experiments

This section provides empirical evidence regarding the effectiveness of our proposed sub-one quasi-norm-based k-means algorithm on both synthetic and real-life datasets, which also serves as a comparative analysis of the proposed \(\ell _p\) quasi-norm-based distance metric against other commonly used distance metrics. Additionally, we further investigate the applicability of our k-means method for handling complex tasks by implementing it within the deep learning based clustering algorithm.

5.1 Experiment Setup

Simulations were executed on all datasets employing the proposed Algorithm 1 and the traditional Lloyd’s k-means algorithms with the basic and commonly used distance metrics in most applications. Specifically, the proposed \(\ell _p\) quasi-norm-based distance metric with \(p=0.5\) was tested against conventional distance metrics, including the Manhattan distance (referred to as \(l_1\) distance) and the squared Euclidean distance (referred to as \(l_2\) distance), as detailed in Table 3. Moreover, when applicable, the S-distance, which is chosen among various divergence-based distance functions due to its demonstrated superiority over conventional distance metrics [7], was included in our comparisons.

The various k-means algorithms were executed on the same dataset to facilitate a direct comparison of their clustering results. Given that the Lloyd’s k-means clustering process produces a local optimal solution dependent on the initial cluster centers, the implementation of a k-means algorithm was repeated 10 times for each dataset with different initial cluster centers in each iteration. This ensures that any observed performance differences are not due to chance or potential biases stemming from a single initial cluster configuration. In every iteration, a common set of randomly chosen initial cluster centers was employed for for all instances of the k-means algorithm involved in the comparison. Every dataset employed in our experiments was equipped with known class labels for each instance. To safeguard against any impact on the algorithm performance due to the accurate or inaccurate chosen value for k, the input parameter k was set to be the true number of clusters when executing the k-mean algorithms.

Table 3 Different distance metrics under comparison

We employed the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) to evaluate the similarity between the cluster assignments generated by the k-means algorithms and the true labels. The ARI quantifies the agreement between two class assignments, where a score of 1 indicates a perfect match, and \(-1\) indicates a complete mismatch. Meanwhile, NMI scales the similarity between clustering outcomes and true labels on a range from 0 (indicating no mutual information) to 1 (reflecting perfect correlation). Higher ARI and NMI values signify that the clustering results closely align with the ground-truth.

5.2 Experiments on Synthetic Dataset

To validate the clustering performance of the proposed sub-one quasi-norm-based k-means algorithm, we conducted experiments on synthetic datasets constructed with mixtures of multivariate normal distributions. Initially, we generated multiple clusters following a multivariate Gaussian distribution, each with a mean of 0 and a covariance/correlation matrix denoted as

$$\begin{aligned} \theta = \begin{bmatrix} ~1~ &{} b~ &{} \ldots ~ &{} b~ \\ ~b~ &{} 1~ &{} \ldots ~ &{} b~ \\ ~\vdots ~ &{} \vdots ~ &{} \ddots ~ &{} \vdots ~ \\ ~b~ &{} b~ &{} \ldots ~ &{} 1~ \end{bmatrix}, \end{aligned}$$

where \(0<b<1\) is a predefined parameter given by \(b=\frac{K^2-1}{n-1+K^2}\), with \(K>0\) representing the ratio of the maximum eigenvalue to the minimal eigenvalue of \(\theta \). This results in multiple normally distributed clusters spreading along the direction of \((1,\cdots ,1)^T\), and each variable has identical isotropic covariances. To introduce diversity among clusters, we rotated them to different directions and varied their distances/densities by adding/multiplying them with distinct arbitrary vectors. The final dataset was then formed by overlapping multiple clusters generated in this manner to simulate a multi-cluster scenario.

We conducted experiments on a total of eleven synthetic datasets that encompass a diverse range of scenarios. The first five datasets (c2, c3, c4, c5, c6) are standard multi-cluster datasets, each consisting of 2, 3, 4, 5, 6 clusters, respectively. Each cluster contains 100 records. The remaining six datasets were specifically designed to incorporate various clustering challenges, as summarized below.

  • Imbalanced datasets (im3, im4): Dataset im3 comprises 300 records distributed across into three clusters, with the clusters containing 50, 100, and 150 records, respectively. Dataset im4 contains 300 records divided into four clusters, each consisting of 50, 100, 150, 200 records, respectively.

  • Datasets with noise (n3, n4): In these datasets, each comprising 3 and 4 clusters respectively, 20% of the data points were augmented with normally distributed noise characterized by a mean of 0 and a standard deviation of 0.1.

  • Datasets with outliers (o3, o4): Featuring 3 and 4 clusters respectively, each cluster in these datasets incorporates 20% artificial outliers. These outliers are generated from a uniform distribution on a simplex, where the simplex’s vertices were randomly chosen from the intersection of an \(\ell _2\) norm ball with a radius of 2 and a hyperplane defined by a bias term of 1 and a normal vector forming an angle of \(45^{\circ }\) with the major direction of the corresponding cluster.

Table 4 ARI/NMI levels at the synthetic datasets

As the S-distance metric is exclusively defined on \(\mathbb {R}_{++}^m\), we compare the performance scores obtained by the k-means algorithms using \(\ell _{0.5}\), \(\ell _{1}\), \(\ell _{2}\) distances. Table 4 reports the average and standard deviation of ARI and NMI for the clustering results. The table highlights the best ARI and NMI mean values across different algorithms for each experiment instance in bold. Notably, the proposed k-means algorithm utilizing the \(\ell _{0.5}\) distance metric generally demonstrates superior performance in comparison to other algorithms on these synthetic datasets. This superiority can be attributed to the non-convex shape of the \(\ell _p\) quasi-norm balls with \(p\in (0,1)\), which enables the \(\ell _p\) quasi-norm-based distance to make better use of similar data-items rather than over-emphasizing dissimilar ones. The enhanced clustering performance of our proposed k-means algorithm on datasets c2, c3, c4, c5, c6 is visually evident from Fig. 3. As the cluster number increases, resulting in clusters with significant variations in every dimension, clustering accuracy naturally decreases. Nevertheless, our proposed algorithm maintains its effectiveness. For datasets with a high cluster number, we observed suboptimal and comparable performances across all k-means algorithms. Additionally, the superior clustering performance of our proposed k-means algorithm on datasets im3, im4, n3, n4, o3, o4 is visually evident from Fig. 4. This highlights the algorithm’s efficacy in handling datasets with imbalanced clusters, noise, and outliers. The distinctive features of our proposed algorithm position it as a robust and versatile solution to capture complex patterns and overcome challenges posed by noise and outliers.

5.3 Experiments on Real-Life Dataset

To verify the effectiveness of the proposed k-means algorithm for clustering real-life data, we conducted experiments on several datasets from the UCI Machine Learning Repository [1, 8, 19]. The details of datasets are provided below.

  • Wine: This dataset contains 178 instances, each including 13 integer-valued constituents found in wines grown in the same region in Italy but derived from three different cultivars.

  • Iris: This dataset consists of three different types of iris plants: Setosa, Versicolor, Virginica. Each type contains 50 instances and each instance has four real-valued attributes.

  • Seeds: This dataset contains 210 instances, and each instance includes seven real-valued continuous geometric measurements of wheat kernels belonging to three different varieties of wheat.

  • Ceramic: This dataset includes 88 instances that can be classified into body or glaze based on 17 chemical compositions obtained from energy dispersive X-ray fluorescence.

  • Algerian forest fires: This dataset comprises 244 instances of forest regions in Algeria, which includes ten attributes and can be classified into fire and not fire classes.

  • Facebook live sellers: This dataset contains Facebook pages of Thai fashion and cosmetics retail sellers. The 6622 instances with nine integer-valued attribute can be divided into two classes: video and photo.

Table 5 provides the average and standard deviation of ARI and NMI for clustering results across the six real-life datasets, which underscores the superior performance of the proposed distance metric in capturing the underlying structures in specific real-life datasets. The table highlights the best ARI and NMI mean values across different algorithms for each experiment instance in bold. On the Wine, Ceramic, Forest fire, and Facebook live seller datasets, the k-means algorithm using the \(\ell _{0.5}\) quasi-norm-based distance demonstrates superior clustering accuracy compared to the other algorithms. However, on the Seed dataset, while the proposed k-means algorithm maintains comparable performance, it does not exhibit a discernible advantage over the other algorithms. In contrast, on dataset Iris, the utilization of the \(\ell _{0.5}\) quasi-norm-based distance metric yields improved clustering results compared to both Manhattan and Euclidean distances, although it falls short of surpassing the performance achieved by the S-distance.

Fig. 3
figure 3

The figure shows the results obtained on standard multi-cluster datasets with varying cluster numbers. Each row corresponds to a different dataset (c2, c3, c4, c5, c6). Column 1 illustrates the original cluster structure. Columns 2 and 3 indicate the class assignments generated by the Lloyd’s k-means algorithm with the squared Euclidean distance and the proposed k-means algorithm with the \(\ell _{0.5}\) quasi-norm-based distance metric, respectively

Fig. 4
figure 4

The figure shows the results obtained on the complex synthetic datasets. Each row corresponds to a different dataset (im3, im4, n3, n4, o3, o4). Column 1 illustrates the original cluster structure. Columns 2 and 3 indicate the class assignments generated by the the Lloyd’s k-means algorithm with the squared Euclidean distance and the proposed k-means algorithm with the \(\ell _{0.5}\) quasi-norm-based distance metric, respectively

To assess the robustness of the proposed sub-one quasi-norm based k-means clustering algorithm, we conducted experiments on intentionally corrupted datasets with added noise. Specifically, we introduced a pre-specified amount of noise into two real-life datasets, namely Iris and Seeds, where the \(\ell _p\) quasi-norm-based distance metric does not yield the best results in Table 5. We generated normally distributed noise with a mean of 0 and applied to approximately 10%, 20%, 30% of the data points in each dataset. To account for the different scales of variations across variables, the standard deviations of the added noise were set equal to the mean value within each variable. The clustering experiments involved 10 trials for each combination of distance metric and noise percentage. The presence of noise introduces challenges in recovering the underlying cluster structures of the dataset.

Table 6 presents the average and standard deviation of ARI and NMI for k-means clustering results on the noisy data, categorized by datasets, noise percentage, and distance metrics. The table highlights the best ARI and NMI mean values across different algorithms for each experiment instance in bold. A comparison between Tables 5 and 6 reveals a deterioration in the quality of clustering results when noise is present in the data. However, both datasets consistently indicate the proposed \(\ell _{0.5}\) quasi-norm-based distance produces the best clustering results when compared to other distance metrics. This underscores the effectiveness of the proposed k-means method in clustering data while mitigating the adverse effects of noise. This suggests that the k-means clustering method with \(\ell _p\) quasi-norm-based distance exhibits resistance to noise and outliers in the data, highlighting the efficacy and robustness of the proposed algorithm.

Table 5 ARI/NMI levels on the real-life datasets
Table 6 ARI/NMI levels on the Seeds and Iris datasets with noise

5.4 Experiments with Deep-Learning Models

To investigate the practical applicability of the sub-one quasi-norm-based metric, we have integrated the proposed k-means algorithm into advanced deep learning models. Evaluating the clustering algorithm within a deep learning context provides valuable insights into its applicability for more complex tasks and real-world scenarios, especially when dealing with high-dimensional data.

This evaluation was conducted on the cutting-edge generalized deep learning clustering (GDLC) algorithm [36]. The GDLC algorithm focused on dimensionality reduction by finding the product of two low-dimensional matrices that approximates the original high-dimensional matrix. This deep learning approach enables the representation of high-dimensional data in a lower-dimensional space and facilitates the completion of clustering tasks in this reduced space. In the GDLC framework, a nonlinear constrained non-negative matrix factorization based on stochastic gradient descent is initially employed as an element update method. The resulting generalized weights, generalized biases, and activation function are then combined to form a generalized deep learning network for updating elements in the low-dimensional matrix.

In the original GDLC algorithm, an \(\ell _2\)-norm-based k-means algorithm is employed for clustering. Notably, we replaced a segment of the existing GDLC’s clustering process with our proposed k-means algorithm. This modification allows us to compare the performance of two GDLC versions utilizing different metrics. Our experiments involve eight public datasets that pose a significant data challenge due to their high dimensional feature spaces. These datasets encompass six biographical datasets, one image dataset, and one text dataset [3, 6, 27].

  • ALLAML: This dataset contains a total of 72 instances classified into two categories, ALL and AML. Each instance includes 7129 gene expression values.

  • LUNG_DISCRETE: This dataset features 73 instances distributed among seven classes. Each instance consists of 325 gene expression values.

  • ARCENE: This dataset contains 200 instances with 10000 features derived from mass-spectrometric data, categorized in two classes: cancer pattern and normal pattern.

  • leukemia: This dataset consists of 72 bone marrow instances based on probes from human genes, with two classes: acute lymphoblastic and acute myeloid.

  • colon: This dataset contains 62 instances with 2000 genes, divided into two classes: tumor tissues and normal tissues.

  • nci9: This biological dataset encompasses 60 instances, each characterized by 9712 features and classified into nine distinct classes.

  • ORL: This dataset comprises 400 instances representing ten distinct images of 40 different subjects. Each instance is characterized by 1024 features and falls into one of eleven classes.

  • RELATHE: This dataset comprises 1427 instances, each defined by 4322 features, and is organized into two distinct newsgroups.

Table 7 ARI/NMI levels for the GDLC algorithm using different distances

Table 7 presents the average and standard deviation of ARI and NMI for the clustering results resulting from 10 random implementations of the GDLC algorithm. The table highlights the best ARI and NMI mean values across different algorithms for each experiment instance in bold. On the ALLAML, LUNG_DISCRETE, ARCENE, leukemia, colon, and nci9 datasets, the incorporation of our proposed k-means algorithm into GDLC outperforms the original GDLC algorithm in terms of ARI or NMI. This underscores the effectiveness of our proposed metric in capturing distinctive patterns exhibited by certain data types, such as biological data involving gene expression patterns or molecular profiles. However, the updated GDLS algorithm demonstrates relatively lower performance on the ORL and RELATHE datasets, indicating challenges in handling the inherent complexities associated with text and image datasets. These datasets typically possess different feature spaces and structures that require clustering algorithms to adapt to diverse modalities. This discrepancy in performance across data types indicates the sensitivity of our proposed k-means algorithm to the intrinsic characteristics and structures present in different types of data, emphasizing its context-dependent applicability.

6 Summary

The k-means algorithm is a widely-used clustering algorithm that partitions a given dataset into k clusters based on their similarity. In this study, we propose an enhancement to the k-means algorithm by employing the \(\ell _p\) quasi-norm-based distance metric, specifically with \(p\in (0,1)\). Compared to commonly used distance metrics, our proposed sub-one quasi-norm-based distance metric excels at capturing similarities among data items while mitigating the overemphasis on dissimilarities. We substantiate the effectiveness of our proposed k-means algorithm through theoretical analysis, providing a proof of convergence to the Kuhn-Tucker point. Extensive experimental results on synthetic and real-life datasets further highlight the superiority of our algorithm over other distance metrics across diverse scenarios. Notably, for datasets with added noise, the sub-one \(\ell _p\) quasi-norm-based distance metric successfully addresses a known drawback of k-means, demonstrating its robustness against noise. We also integrated our proposed k-means method into the generalized deep learning clustering algorithm that indicates its sensitivity to the inherent characteristics and structures of data and its context-dependent applicability. As a prospective avenue, extending the implementation of the sub-one \(\ell _p\) quasi-norm-based distance to Fuzzy c-means (FCM) type algorithms presents a promising direction for future research.