1 Introduction

The aim of clustering is to find out a subset of items in a given dataset which are more similar than others using similarity measures. The various authors have applied different criteria or similarity measures to identify the items in clusters. But the sum of the squared distances is widely accepted similarity measure for clustering problems. The cluster analysis has proven its significance in many areas such as pattern recognition [45, 47], image processing [38], process monitoring [40], machine learning [1], quantitative structure activity relationship [9], document retrieval [18], bioinformatics [16], image segmentation [34] and many more. Due to wide area of clustering in different domains, a large number of algorithms have been developed by various researchers and applied successfully for clustering. Generally, the clustering algorithms can be classified into two groups: hierarchical clustering algorithms and partition-based clustering algorithms [4, 5, 26, 29]. In hierarchical algorithms, a tree structure of data is formed by merging or splitting data based on some similarity criterion. In partition-based algorithms, clustering is done by relocating data between clusters to the clustering criterion, i.e. Euclidian distance. From the literature, it has been found that partition-based algorithms are more efficient and popular than hierarchical algorithms [19]. The most popular and widely used partition-based algorithm is \(K\)-means algorithm. It is easy, fast, and simple to implement. In addition to it, there is also one more characteristic that is linear time complexity [11, 20]. In \(K\)-means algorithm, a dataset is divided into \(K\) number of predefined clusters and used to minimize the intra-cluster distance based on Euclidean distance [19]. But, this algorithm has some limitations such as the results of \(K\)-means algorithm is highly dependent on the initial cluster centers and also get stuck in local minima [21, 35]. Thus, to overcome the pitfalls of the \(K\)-means algorithm, several heuristic algorithms have been developed. \(K\)-harmonic mean algorithm has proposed for clustering instead of \(K\)-means in [46]. A simulated annealing (SA)-based approach has been developed in [36]. A tabu search (TS)-based method was introduced in [2, 39]. A genetic algorithm (GA)-based methods were presented in [6, 27, 30, 31]. Fathian et al. [10] have developed a clustering algorithm based on honey-bee mating optimization (HBMO). Shelokar et al. [37] proposed an ant colony optimization (ACO)-based approach for clustering. The particle swarm optimization (PSO) is applied for clustering in [44]. Hatamlou et al. employed a big bang-big crunch algorithm for data clustering in [13]. Karaboga and Ozturk presented a novel clustering approach based on artificial bee colony (ABC) algorithm in [23]. A data clustering based on gravitational search algorithm was presented in [14, 15]. But every algorithm has some drawbacks, for example, \(K\)-means algorithm sucks in local optima, convergence is highly dependent on initial positions in case of genetic algorithm; in ACO, the solution vector has been affected as the number of iterations increased, etc.

The aim of this research work is to explore the capability of charged system search (CSS) algorithm for data clustering. The CSS algorithm is the latest meta-heuristic optimization technique developed by Kaveh and Talatahari [24]. This technique is based on three principles: Coulomb law, Gauss law and Newton second law of motion. Every meta-heuristic algorithm contains two unique features, i.e. exploration and exploitation. The exploration is referred to generate the promising searching space, while the exploitation can be defined as determination of the most promising solution set. Thus in CSS, the exploration process is carried out using Coulomb and Gauss laws, while Newton second law of motion is applied to perform exploitation process. The performance of the proposed algorithm has been evaluated on two artificial datasets and several real datasets from UCI repository and compared with some existing algorithms in which quality of solution is improved using CSS algorithm.

2 CSS algorithm for clustering

In this section, CSS algorithm is explained to solve the clustering problem. The aim of this algorithm is to find out the optimal cluster points to assign \(N\) numbers of items to \(K\) cluster centers in \(R^{n}\). In CSS algorithm, the sum of square of Euclidean distances is taken as the objective function for clustering problem and items are assigned to a cluster center with minimized Euclidean distance among cluster centers. The algorithm starts with defining the initial position and velocities of \(K\) number of charged particles (CPs). The initial positions of CPs are defined in a random manner. In CSS algorithm, CPs are assumed to be a charged sphere of radius ‘\(a\)’ and its initial velocity is set to zero. Thus, the algorithm starts with randomly defined center points and ends with optimal cluster centers. Consider Table 1 which illustrates a dataset, used to explain the working of CSS algorithm for clustering with \(N=10\), \(n=4\) and the number cluster centers \(K = 3\). To obtain the optimal cluster centers, the CPs use resultant electric force (attracting force vector), mass and moving probability of particles and cluster centers. After the first iteration, the velocities of CPs are determined and locations of CPs are also moved. The objective function is calculated again using the new positions of CPs and also compared with the old CPs position that are stored in memory pool, called as charged memory (CM). The CM is now updated with the new positions of CPs and excludes the worst CPs from CM. As the algorithm grows, the positions of CPs are updated along with the content in CM. This process progresses until the maximum iteration or no better position of CPs has been generated.

Table 1 A dataset to explain the CSS algorithm for clustering with \(N=10\), \(n=4\) and \(K=3\)

2.1 Algorithm details

As described earlier, the algorithm starts with the identification of the initial positions and velocities of CPs in random fashion. Thus, to identify the initial positions of randomly defined CPs (cluster centers), the given equation has been used to initialize the CPs position in original CSS. The modified equation can be given as.

$$\begin{aligned} C_k&= X_{\min ,i} +r_i \times ( {X_{i,\max } -X_{i,\min } }), \nonumber \\&\hbox {where } i=1,2 \ldots , n \hbox { and } k=1,2 \ldots , K \end{aligned}$$
(1)

In the above equation, \(C_{k}\) represents the number of cluster centers, \({r}_{i}\) is a random function whose values lies between 0 and 1, \(X_{\mathrm{min}, i}\) and \(X_{\mathrm{max}, i}\) represent the minimum value and maximum value of the \(i\)th attribute of dataset; \(K\) is the total number of cluster centers in a dataset. The initial positions of CPs are given in Table 2.

Table 2 Initial position of CPs using equation 1

It is assumed that the initial velocities of CPs are set to zero.

$$\begin{aligned} V_k =0, \quad k=1,2,3,\ldots , K \end{aligned}$$
(2)

In CSS algorithm, it is noted that the CPs are described as charged spheres. So, every CP contains some mass, and mass of each CP is calculated using the following equation.

$$\begin{aligned} m_k =\frac{\hbox {fit}(k)-\hbox {fit}({\hbox {worst}})}{\hbox {fit}({\hbox {best}})-\hbox {fit}({\hbox {worst}})} \end{aligned}$$
(3)

where fit(\(k\)) represents the fitness of \(k\)th instance of dataset, fit (best) represents the best fitness value and fit (worst) represents worst fitness value of dataset.

The mass of initial positioned CPs is 0.91315, 0.70944 and 1.3907. The sum of the squared Euclidean distance is used as objective function in CSS algorithm to find the closeness of particles to CPs and assigned the particles to CPs with minimum objective value. Table 3 provides the value of objective function of initial positioned CPs for our example dataset. Euclidean distance can be given as

$$\begin{aligned} d_{i,k} =\sum \limits _{k=1}^K {\sum \limits _{j=1}^N {\sum \limits _{i=1}^n {\sqrt{\left\| {X_{j,i} -C_{k,i} } \right\| ^2} } } } \end{aligned}$$
(4)
Table 3 Normalized value of objective function

The information contained in given string is used to arrange the items into different clusters which are given below.

3

1

1

2

3

3

2

3

2

1

From the above string, it is observed that the first, fifth, sixth and eighth particles belong to the cluster third; second, third and tenth particles belong to cluster first and fourth, seventh and ninth particle belong to cluster second. Hence at this step, the dataset is divided into three different clusters and stores the values of positions of CPs in a new variable called charge memory (CM) which can be used to memorize the positions of CPs. Later on, these CPs position will be used for comparisons with newly generated CPs position and the best positions are included in the CM and excluding the worst positions from CM. Here, the size of the CM is equal to the \(N/4\). The main work of CM is to keep track of the number of good positions of CPs which are obtained during the execution of CSS algorithm and after the execution of algorithm; the optimal number of CPs position (i.e. \(K\) number of CPs) is determined using minimized objective function values. The above discussion relates to the initialization of CSS algorithm for clustering problem.

Now, we will describe the main steps of the CSS algorithm, i.e., how the new positions and velocities of CPs are to be generated. From the study of various meta-heuristic algorithms, it is found that every meta-heuristic algorithm contains two approaches, i.e. exploration and exploitation in which one approach initiated local search while the other approach carried out the global search. The local search tends to the exploration of random search space such that the most promising solution space can be occupied while the global search refers to the exploitation of good solution vectors from the promising solution space. Hence in case of CSS algorithm, the local search, i.e. exploration is initiated using Coulomb and Gauss laws while the global search, i.e. exploitation is performed by Newton second law of motion. The local search of CSS algorithm starts by measuring the electric force \(E_{i,k}\) generated by CP. The electric force \(E_{i,k}\) generated at a point either inside the CPs or outside CPs. So, this direction of electric force \(E_{i,k}\) is described as moving probability (\(P_{i,k}\)) of CPs while the Coulomb and Gauss laws are applied to measure the total electric force generated on a CP, called actual electric force \(F_{k}\). The moving probability \(P_{i,k}\) for each CP can be determined using Eq. (5).

$$\begin{aligned} \begin{aligned} p_{ik}&=\left\{ \begin{array}{ll} 1 &{}\quad \hbox {if}, \frac{\hbox {fit}(i)-\hbox {fit}(\hbox {best})}{{\hbox {fit}}(k)-\hbox {fit}(i)}>\hbox {rand}\,V\,\hbox {fit}(k)>\hbox {fit}(i) \\ 0 &{}\quad \hbox {otherwise} \end{array} \right. \end{aligned} \end{aligned}$$
(5)

The value of moving probability \(P_{i,k}\) is either 0 or 1 and it gives the information about the movement of CPs. Table 4 shows the moving probability \(P_{i,k}\) values for each particle to each cluster center.

Table 4 Moving probability \(P_{ik}\) of each CP with each item of dataset

The Coulomb and Gauss laws are employed to determine the value of actual electric force \(F_{k}\) generated on CPs. The Coulomb law is used to calculate the force outside the CP and Gauss law is used to calculate the force inside the CP. The generalized equation to determine the actual electric force \(F_{k}\) can be given as:

$$\begin{aligned} F_k&= q_k \mathop \sum \limits _{i,i\ne k} \left( {\frac{q_i }{a^3}\times i_1 +\frac{q_i }{r_{ik}^2 }\times i_2 }\right) \times p_{ik} \times ( {X_i -C_k }),\nonumber \\&\left\{ \begin{array}{l} {k=1, 2, 3,\ldots , K} \\ {i_1 =1, i_2 =0 \leftrightarrow r_{ik} <a} \\ {i_1 =0, i_2 =1 \leftrightarrow r_{ik} \ge a} \\ \end{array} \right. \end{aligned}$$
(6)

Here, \(q_{i}\) and \(q_{k}\) represent the fitness of \(i\)th and \(k\)th CP, \(r_{i,k}\) represents the separation distance between \(i\)th and \(k\)th CPs, \(i_{1}\) and \(i_{2}\) are the two variables whose values are either 0 or 1, ‘\(a\)’ represents the radius of CPs and it is assumed that each CP has uniform volume charge density but changes in every iteration. The value of \(q_{i}\), \(r_{i,k}\) and ‘\(a\)’ is evaluated as follows:

$$\begin{aligned}&q_i=\frac{\hbox {fit}(i)-\hbox {fit}(\hbox {worst})}{\hbox {fit}(\hbox {best})-\hbox {fit}(\hbox {worst})}\ i=1,\,2,\,3,\ldots ,\,N \end{aligned}$$
(7)
$$\begin{aligned}&r_{ik} =\frac{\Vert X_i -X_k\Vert }{\Vert (X_i +X_k )/2-X_\mathrm{best}\Vert +\in }\end{aligned}$$
(8)
$$\begin{aligned}&a=0.10\times \max (\{x_{i,\mathrm{max}} -x_{i,\mathrm{min}} \vert i=1,2,3,\ldots ,n\}) \end{aligned}$$
(9)

Tables 5, 6 and 7 provide the values of \(q_{i}\), \(r_{i,k}\), \(i_{1}\), \(i_{2}\) and (\(X_{i}-C_{k}\)) variables which are used to calculate the actual electric force \(F_{k}\), applied on the CPs and the value of ‘\(a\)’ is 0.4647, 0.91302 and 1.4191 which is calculated using Eq. (9). The values of electric force \(F_{k}\) are used with Newton second law of motion to determine the new position of CPs and velocities of CPs.

Table 5 Values of magnitude of charge (\(q_\mathrm{i})\) of each CP and separation distance \((r_\mathrm{i,k})\)
Table 6 Values of \(I_{1}\) and \(I_{2}\)
Table 7 Values of \(X_{i}-C_{k}\) for each cluster center \(K\)

The actual electric force \(F_{k}\) generated on CPs is measured using above discussion and the values of \(F_{k}\) on initial CPs (cluster centers) are 0.85871, 0.29861 and 0.33496.

Newton second law of motion is employed to get the new positions and velocities of CPs. This is referred to as exploitation of solution vectors from the random space search. The new positions of CPs and velocities are obtained from Eqs. (10) and (11). \(Z_\mathrm{a}\) and \(Z_\mathrm{v}\) act as the control parameters which are used to control the exploration and exploitation process of CSS algorithm. These parameters also affect the values of previous velocities and actual resultant force generated on a CP. These values may be either increased or decreased. Thus, \(Z_\mathrm{a}\) is the control parameter belonging to the actual electric force \(F_{k}\) and controls the exploitation process of CSS algorithm. The large value of \(Z_\mathrm{a}\) increases the convergence speed of algorithm, while small value increases the computational time of algorithm. \(Z_\mathrm{v}\) is the control parameter for exploration process and acts with the velocities of CPs. Here, it is noted that \(Z_\mathrm{a}\) is the increased function parameter while \(Z_\mathrm{v}\) is the decreased function parameter. Table 8 provides the new positions of CPs which are evaluated using CSS algorithm

$$\begin{aligned} C_{k,\mathrm{new}}&= \hbox {rand}_1 \times Z_\mathrm{a} \times \frac{F_k }{m_k }*\Delta t^2\nonumber \\&+\hbox {rand}_2 \times Z_\mathrm{v} *V_{k,\mathrm{old}} \times \Delta t+C_{k,\mathrm{old}}\end{aligned}$$
(10)
$$\begin{aligned} V_{k,\mathrm{new}}&= \frac{C_{k,\mathrm{new}} - C_{k,\mathrm{old}} }{\Delta t}, \end{aligned}$$
(11)

where rand\(_{1}\) and rand\(_{2}\) are the two random functions whose values lie in between 0 and 1, \(Z_\mathrm{a}\) and \(Z_\mathrm{v}\) are the control parameters which control the influence of actual electric force and previous velocities, \(m_{k}\) is the mass of \(k\)th CPs which is equal to the \(q_{k}\) and \(\Delta t\) represents the time step which is set to 1.

The new positions of CPs are mentioned in Table 8 and the value of control parameters \(Z_\mathrm{a}\) and \(Z_\mathrm{v}\) is determined using Eq. (12). The new velocities (\(V_{k,\mathrm{new}})\) of each CP are 0.4273, 0.0498, 0.1596.

$$\begin{aligned} Z_\mathrm{a}&= (1-\hbox {iteration}/\hbox {iteration max}),\nonumber \\ Z_\mathrm{v}&= ({1+\hbox {iteration}/\hbox {iteration max}}) \end{aligned}$$
(12)
Table 8 New position of CPs

Hence from the above discussion, the process of algorithm can be categorized into three sections: Initialization, Search and Termination condition. The initialization section deals with the CSS algorithm parameters; positions and velocities of initial CPs; determine the value of objective function and rank them; store the positions of CPs into CM. In the search section, the new positions and velocities of CPs are determined using moving probability \(P_{i,k}\) and actual electric force \(F_{k}\). The value of objective function is evaluated using newly generated CPs, compared with previous CPs; rank them and store the best CPs in CM. The termination condition of the algorithm is either maximum number of iterations or repeated positions of CPs. The flowchart of CSS algorithm for data clustering is depicted in Fig. 1.

Fig. 1
figure 1

Flowchart of CSS algorithm for data clustering

figure a

3 Cluster analysis

The objective of the clustering algorithm is to group the similar data objects together. In literature, there are many techniques which have been employed in clustering analysis, such as partition-based clustering, hierarchical clustering, density-based clustering, and artificial intelligence-based clustering. In this section, some artificial intelligence-based clustering methods have been described which have been used for comparing the results of our proposed algorithm. These are \(K\)-means, genetic algorithm (GA), PSO and ACO.

3.1 Application of \(K\)-means in clustering

\(K\)-means algorithm is one of the most popular and oldest methods developed by Macqueen [28] which has been widely applied in data clustering. It is one of the simple, fast and robust methods. In \(K\)-means algorithm, sum of Euclidean distance is used as similarity criteria to find the predefined number of cluster centers in a given dataset. \(K\)-means algorithm is started with randomly initialized cluster centers, and then the data vectors are arranged into predefined number of cluster cents according to minimum Euclidean distances. The cluster centers are updated using means of data vectors within cluster centers and this process is repeated until there is no improvement in cluster centers.

3.2 Applications of Genetic Algorithm (GA) in clustering

John Holland has proposed a genetic algorithm based on the Darwin theory of evolution [17] and has been applied in many function optimization problems. In GA, the string of chromosome describes the parameters of random search space. An objective function is associated with each string that defined the importance of string. Three operators are applied in GA to find the optimal solutions which are selection, crossover and mutation. To solve the clustering problem, Murthy and Chowdhury [31] have applied genetic algorithm-based method and evaluated the performance of using three experiments on synthetic and real-life data sets. The results conclude that the GA has improved the final outcome of \(K\)-means. Reported literature has shown that Al-Sultana and Maroof Khan [3] have studied several algorithms such as \(K\)-means algorithm, simulated annealing algorithm, tabu search algorithm, and genetic algorithm and compared the performance of these algorithms for the clustering problems. Krishna and Narasimha Murty [27] have proposed the GKA algorithm for clustering, and proved that the hybrid algorithm can converge to an optimal solution. Maulik and Bandyopadhyay [30] have applied the GA algorithm in clustering and evaluated the performance of this algorithm using four artificial and three real-life datasets. The results show that genetic algorithm is more superior to \(K\)-means algorithm. Tseng and Yang [42, 43] have proposed a genetic algorithm-based approach which can automatically be used to find the proper number of clusters, and outcome of their algorithm superior than \(K\)-means.

3.3 Applications of particle swarm optimization in clustering

Particle swarm optimization is a population-based stochastic search algorithm introduced in [25] and has been widely used to solve a broad range of optimization problems. This algorithm is based on social behavior of birds, insects, etc. and presented dispersion of individual swarm knowledge among all the members of group. Such as, if one member finds a desirable path to go, then rest of swarm will follow this path. In PSO, this behavior of animals is described by particles and each particle is associated with certain positions and velocities in a random search space. The algorithm is started using randomly initialized population. Each particle in PSO moves through the searching space and remembers the best position it has observed. Each particle shares the good positions to each other and updates their own position and velocity based on these good positions. The updation of velocity is based on the historical behaviors of the particles itself as well as their neighbors. Thus, the particles move towards better searching areas over the searching process. To investigate the performance of PSO algorithm, Van der Merwe and Engelbrecht [44] have applied the PSO algorithm for clustering in two ways—first, the PSO algorithm is used to obtain optimal cluster centers for predefined number of cluster centers and second, PSO is used to refine the initial cluster centers for \(K\)-means algorithm. The results show that the PSO approaches have better convergence rate. A hybrid algorithm PSO-SA has been developed to obtain good cluster partition in [32]. In this algorithm, SA is applied to obtain global solution in PSO and the proposed algorithm has provided optimal solution. Niknam and Amiri [33] have proposed a hybrid approach based on PSO, SA and \(K\)-means for cluster analysis and the experimental results show that the proposed approach has obtained better result to PSO, SA, PSO-SA and \(K\)-means.

3.4 Applications of ant colony optimization in clustering

Ant colony optimization is a meta-heuristic algorithm proposed by Dorigo et al. [8] for combinatorial optimization problems. This algorithm simulates the behavior of real ants, i.e. how ants find the shortest path between food sources to nest. In ACO, artificial ants are used to construct the solutions by traversing the fully connected graph \(G (V, E)\), where \(V\) is a set of vertices and \(E\) is a set of edges. Each artificial ant moves from vertex to vertex along the edges of the graph and constructs a partial solution. However, each ant also deposits a certain amount of pheromone on each vertex that they have traversed. The amount of pheromone deposit depends on the quality of the solution obtain. Ants have used the pheromone information to find out the promising regions in the search space. To inspect the efficiency of ACO algorithm in clustering domain, an ant-based clustering approach has been proposed in [37]. The simulation results indicate that the proposed algorithm has provided better results in terms of quality of solutions. Kao et al. [22] have proposed an ACO-based algorithm for clustering and named it ACOC. The performance of ACOC algorithm is compared with \(K\)-means and Shelokar ACO algorithm in which ACOC has given better results. Tsai et al. [41] have applied the ant system with different favorable strategy for data clustering and named as ACO with different favor (ACODF). ACODF algorithm has the following desirable strategies. It first uses differently favorable ants to solve the clustering problem. Then, the proposed ant colony system adopted the simulated annealing concept for ants to decrease the visit to the number of cities and get the local optimal solutions.

4 Experimental results

This section describes the results of the CSS algorithm for data clustering problem. To assess the performance of CSS algorithm, it is applied on ten datasets. These datasets are ART1, ART2, iris, wine, CMC, glass, breast cancer Wisconsin, liver disease (LD), thyroid and vowel in which iris, wine, CMC, glass, liver disease (LD), thyroid, vowel and breast cancer Wisconsin datasets are real that are downloaded from UCI repository while rest of the two datasets are artificial, i.e., ART1 and ART2. The characteristics of these datasets are discussed in Table 9. The proposed algorithm is implemented in Matlab 2010a environment on a core i5 processor with 4 GB using window operating system. For every dataset, the algorithm runs 20 times individually to check the effectiveness of proposed algorithm using randomly generated cluster centers. The parameter settings for CSS algorithm are mentioned in Table 10. The sum of intra-cluster distances and \(f\)-measure is used to evaluate the quality of solutions for clustering algorithm. The sum of intra-cluster distances can be defined as distances between the instances placed in a cluster to the corresponding cluster center. The results are measured in terms of best case, average cases, worst case solutions and standard deviation. The quality of clustering is directly related to the minimum sum of distances. The accuracy of clustering is measured using \(f\)-measure. To ensure the effectiveness and adaptability of CSS algorithm in clustering domain, the investigational results of CSS algorithm are compared with \(K\)-means, GA, PSO and ACO algorithms which are given in Table 11.

Table 9 Characteristics of datasets
Table 10 Parameters setting for CSS algorithm
Table 11 Comparisons of different clustering algorithm with CSS algorithm

4.1 Datasets

4.1.1 ART1

It is a two-dimensional artificial dataset, generated in matlab to authenticate the proposed algorithm. This dataset includes 300 instances with the two attributes and three classes. The classes in dataset are disseminated using \(\mu \) and \(\lambda \), where \(\mu \) is the mean vector and \(\lambda \) is the variance matrix. The data have generated using \(\mu 1 = [3, 1]\), \(\mu 2 = [0, 3]\), \(\mu 3 = [1.5, 2.5]\) and \(\lambda 1 = [0.3, 0.5]\), \(\lambda 2 = [0.7, 0.4]\), \(\lambda 3 = [0.4, 0.6]\). Figure 2a depicts the distribution of data into ART1 and Fig. 2b shows the clustering of same data using CSS method.

Fig. 2
figure 2

a Distribution of data in ART1. b Clustered the ART1 data using CSS

4.1.2 ART2

It is three-dimensional artificial data which include 300 instances with three attributes and three classes. The data have generated using \(\mu 1 = [10, 25, 12]\), \(\mu 2 = [11, 20, 15]\), \(\mu 3 = [14, 15, 18]\) and \(\lambda 1 = [3.4, -0.5, -1.5]\), \(\lambda 2 = [-0.5, 3.2, 0.8]\), \(\lambda 3 = [-1.5, 0.1, 1.8]\). Figure 3a represents the distribution of data in ART2 dataset and Fig. 3b, c shows the clustering of same data using CSS method.

Fig. 3
figure 3

a Distribution of data in ART2. b Clustered the ART2 data using CSS (horizontal view). c Clustered the ART2 data using CSS (vertical view as \(X, Y\) co-ordinate are in horizontal plane and \(Z\) coordinate in vertical plane)

4.1.3 Iris dataset

Iris dataset contains three varieties of the iris flowers which are setosa, versicolour and virginica. The dataset contains 150 instances with three classes and four attributes in which each class contains 50 instances. The attributes of iris dataset are sepal length, sepal width, petal length, and petal width.

4.1.4 Wine dataset

It contains the chemical analysis of wine in the same region of Italy using three different cultivators. This dataset contains 178 instances with thirteen attributes and three classes. The attributes of dataset are alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines and proline.

4.1.5 Glass

This dataset consists of six different types of glass information. The dataset contains 214 instances and 7 classes. It contains nine attributes which are refractive index, sodium, magnesium, aluminium, silicon, potassium, calcium, barium, and iron.

4.1.6 Breast cancer Wisconsin

This dataset characterizes the behavior of cell nuclei present in the image of breast mass. It contains 683 instances with 2 classes, i.e. malignant and benign and 9 attributes. The attributes are clump thickness, cell size uniformity, cell shape uniformity, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. Malignant class consists of 444 instances while benign consists of 239 instances.

4.1.7 Contraceptive method choice

It is a subset of National Indonesia Contraceptive Prevalence Survey data that had been performed in 1987. This dataset contains the information about married women who were either pregnant (but did not know about pregnancy) or not pregnant. It contains 1,473 instances and three classes, i.e., no use, long-term method and short term method. Each class contains 629, 334 and 510 instances, respectively. It has nine attributes which are Age, Wife’s education, Husband’s education, Number of children ever born, Wife’s religion, Wife’s now working, Husband’s occupation, Standard-of-living index and Media exposure.

4.1.8 Thyroid

This dataset contains the information about the thyroid diseases and classifies the patient into three classes—normal, hypothyroidism and hyperthyroidism. The dataset consist of 215 instances with five features. The features are the medical tests which have been used to categorize the patients. The features are T3resin, Thyroxin, Triiodothyronine, Thyroid stimulating and TSH value.

4.1.9 Liver disorder

This dataset is collected by BUPA medical research company. It consists of 345 instances with six features and two classes. The features of the LD dataset are mcv, alkphos, sgpt, sgot, gammagt and drinks.

4.1.10 Vowel

This dataset consists of 871 data instances of Indian Telugu vowel sounds with three features which correspond to the first, second, and third vowel frequencies and six classes.

4.2 Performance measures

4.2.1 Sum of intra-cluster distances

It is sum of distances between the data instances present in one cluster to its corresponding cluster center. Minimum sum of intra-cluster distance indicates the better the quality of the solution. The results are measured in terms of best, average and worst solutions.

4.2.2 Standard deviation

Standard deviation provides the information about the dispersion of data instances present in cluster from its cluster center. The minimum value of standard deviation indicates that the data instances are close to its center, while large value indicates that the data are far from its center points.

4.2.3 \(F\)-measure

\(F\)-measure is calculated by the recall and precision of an information retrieval system [7, 12]. It is weighted harmonic mean of recall and precision. To determine the value of \(f\)-measure, every cluster describes a result of query while every class describes a set of credentials for query. Thus, if each cluster \(j\) consists of a set of \(n_{j}\) data instances as a result of a query and each class i consists of a set of \(n_{i}\) data instances required for a query then \(n_{ij}\) gives the number of instances of class \(i\) within cluster \(j\). The recall and precision, for each cluster \(j\) and class \(i\) are defined as:

$$\begin{aligned} \hbox {Recall } ({r( {i,j})})=\frac{n_{i,j} }{n_i}\quad \hbox {and}\quad \hbox {Precision }({p( {i,j})})=\frac{n_{i,j} }{n_j } \end{aligned}$$
(13)

The value of \(F\)-measure \((F(i, j))\) is computed as

$$\begin{aligned} F( {i,j})=\frac{2 \times (\hbox {Recall}\times \hbox {Precision})}{(\text{ Recall }+\text{ Precision })} \end{aligned}$$
(14)

Finally, the value of \(F\)-measure for a given clustering algorithm which consists of \(n\) number of data instances is given as

$$\begin{aligned} F( {i,j})=\mathop \sum \limits _{i=1}^n \frac{n_i }{n} {\max }_{i} (F(i,j)) \end{aligned}$$
(15)

From Table 11, it can be seen that the results obtained from the CSS algorithm are better as compared with the other algorithms. The best values achieved by the algorithm for iris, wine, cancer, CMC, glass, LD, thyroid and vowel datasets are 96.47, 16,282.12, 2,946.48, 5,679.46, 223.58, 207.09, 9,997.25 and 149,535.61. The CSS algorithm gives better results with iris, wine, cancer, CMC, glass, and thyroid datasets while for the LD and vowel datasets, PSO algorithm gives better performance than CSS algorithm. But from the simulation results, it is observed that CSS algorithm obtains minimum value of best distance parameter for the LD dataset and worst distance parameter for vowel dataset among all methods being compared. The standard deviation parameter shows how much the data are far from the cluster centers. The value of standard deviation parameter for CSS algorithm is also smaller than the other methods. Moreover, the CSS algorithm provides better f-measure values than others which show higher accuracy of the said algorithm. To prove the viability of the results given in Table 11, the best centers obtained by the CSS algorithm are given in Tables 12, 13, 14, 15, 16, 17 and 18.

Table 12 Cluster center generated using CSS method for ART1 and ART2 dataset
Table 13 Cluster center of Iris, Wine and CMC dataset using CSS algorithm
Table 14 Cluster center of Glass dataset using CSS algorithm
Table 15 Cluster center of Cancer dataset using CSS algorithm
Table 16 Cluster center of Thyroid dataset using CSS algorithm
Table 17 Cluster center of Vowel dataset using CSS algorithm
Table 18 Cluster center of LD dataset using CSS algorithm

5 Conclusion

In this paper, a CSS algorithm is applied to solve the clustering problem. In the proposed algorithm, Newton second law of motion is used to get the optimal cluster centers but it is the actual electric force \(F_k\) which plays a vital role to obtain the optimal cluster centers. Hence, the working of proposed algorithm is divided into two steps, first step involves calculation of the value of actual electric force using Coulomb and Gauss laws. In second step, the optimal cluster centers using Newton second law of motion are obtained. The CSS algorithm can be applied for data clustering when number of cluster centers (\(K\)) is already known. The performance of the CSS algorithm is tested on the several datasets and compared with \(K\)-means, GA, PSO and ACO, in which proposed algorithm provides better results and the quality of solutions obtained by the proposed algorithm is found to be superior in comparison to the other algorithms.