A charged system search approach for data clustering

Kumar, Yugal; Sahoo, G.

doi:10.1007/s13748-014-0049-2

A charged system search approach for data clustering

Regular Paper
Published: 07 April 2014

Volume 2, pages 153–166, (2014)
Cite this article

Download PDF

Progress in Artificial Intelligence Aims and scope Submit manuscript

A charged system search approach for data clustering

Download PDF

Yugal Kumar¹ &
G. Sahoo¹

1641 Accesses
22 Citations
Explore all metrics

Abstract

This paper presents a charged system search optimization method for finding the optimal cluster centers in a given dataset. CSS algorithm utilizes the Coulomb and Gauss laws from electrostatics to initiate the local search, and Newton second law of motion from mechanics is employed for global search. The efficiency and capability of the proposed algorithm are evaluated on seven datasets and compared with existing $K$-means, GA, PSO and ACO algorithms. From the experimental results, it is found that the proposed algorithm provides more accurate and effective results than other methods being compared.

Hybridization of magnetic charge system search and particle swarm optimization for efficient data clustering using neighborhood search strategy

Article 05 June 2015

Automatic Clustering Using a Genetic Algorithm with New Solution Encoding and Operators

Comparison of K-means Clustering Initialization Approaches with Brute-Force Initialization

1 Introduction

The aim of clustering is to find out a subset of items in a given dataset which are more similar than others using similarity measures. The various authors have applied different criteria or similarity measures to identify the items in clusters. But the sum of the squared distances is widely accepted similarity measure for clustering problems. The cluster analysis has proven its significance in many areas such as pattern recognition [45, 47], image processing [38], process monitoring [40], machine learning [1], quantitative structure activity relationship [9], document retrieval [18], bioinformatics [16], image segmentation [34] and many more. Due to wide area of clustering in different domains, a large number of algorithms have been developed by various researchers and applied successfully for clustering. Generally, the clustering algorithms can be classified into two groups: hierarchical clustering algorithms and partition-based clustering algorithms [4, 5, 26, 29]. In hierarchical algorithms, a tree structure of data is formed by merging or splitting data based on some similarity criterion. In partition-based algorithms, clustering is done by relocating data between clusters to the clustering criterion, i.e. Euclidian distance. From the literature, it has been found that partition-based algorithms are more efficient and popular than hierarchical algorithms [19]. The most popular and widely used partition-based algorithm is $K$-means algorithm. It is easy, fast, and simple to implement. In addition to it, there is also one more characteristic that is linear time complexity [11, 20]. In $K$-means algorithm, a dataset is divided into $K$ number of predefined clusters and used to minimize the intra-cluster distance based on Euclidean distance [19]. But, this algorithm has some limitations such as the results of $K$-means algorithm is highly dependent on the initial cluster centers and also get stuck in local minima [21, 35]. Thus, to overcome the pitfalls of the $K$-means algorithm, several heuristic algorithms have been developed. $K$-harmonic mean algorithm has proposed for clustering instead of $K$-means in [46]. A simulated annealing (SA)-based approach has been developed in [36]. A tabu search (TS)-based method was introduced in [2, 39]. A genetic algorithm (GA)-based methods were presented in [6, 27, 30, 31]. Fathian et al. [10] have developed a clustering algorithm based on honey-bee mating optimization (HBMO). Shelokar et al. [37] proposed an ant colony optimization (ACO)-based approach for clustering. The particle swarm optimization (PSO) is applied for clustering in [44]. Hatamlou et al. employed a big bang-big crunch algorithm for data clustering in [13]. Karaboga and Ozturk presented a novel clustering approach based on artificial bee colony (ABC) algorithm in [23]. A data clustering based on gravitational search algorithm was presented in [14, 15]. But every algorithm has some drawbacks, for example, $K$-means algorithm sucks in local optima, convergence is highly dependent on initial positions in case of genetic algorithm; in ACO, the solution vector has been affected as the number of iterations increased, etc.

The aim of this research work is to explore the capability of charged system search (CSS) algorithm for data clustering. The CSS algorithm is the latest meta-heuristic optimization technique developed by Kaveh and Talatahari [24]. This technique is based on three principles: Coulomb law, Gauss law and Newton second law of motion. Every meta-heuristic algorithm contains two unique features, i.e. exploration and exploitation. The exploration is referred to generate the promising searching space, while the exploitation can be defined as determination of the most promising solution set. Thus in CSS, the exploration process is carried out using Coulomb and Gauss laws, while Newton second law of motion is applied to perform exploitation process. The performance of the proposed algorithm has been evaluated on two artificial datasets and several real datasets from UCI repository and compared with some existing algorithms in which quality of solution is improved using CSS algorithm.

2 CSS algorithm for clustering

In this section, CSS algorithm is explained to solve the clustering problem. The aim of this algorithm is to find out the optimal cluster points to assign $N$ numbers of items to $K$ cluster centers in $R^{n}$. In CSS algorithm, the sum of square of Euclidean distances is taken as the objective function for clustering problem and items are assigned to a cluster center with minimized Euclidean distance among cluster centers. The algorithm starts with defining the initial position and velocities of $K$ number of charged particles (CPs). The initial positions of CPs are defined in a random manner. In CSS algorithm, CPs are assumed to be a charged sphere of radius ‘$a$’ and its initial velocity is set to zero. Thus, the algorithm starts with randomly defined center points and ends with optimal cluster centers. Consider Table 1 which illustrates a dataset, used to explain the working of CSS algorithm for clustering with $N=10$, $n=4$ and the number cluster centers $K = 3$. To obtain the optimal cluster centers, the CPs use resultant electric force (attracting force vector), mass and moving probability of particles and cluster centers. After the first iteration, the velocities of CPs are determined and locations of CPs are also moved. The objective function is calculated again using the new positions of CPs and also compared with the old CPs position that are stored in memory pool, called as charged memory (CM). The CM is now updated with the new positions of CPs and excludes the worst CPs from CM. As the algorithm grows, the positions of CPs are updated along with the content in CM. This process progresses until the maximum iteration or no better position of CPs has been generated.

Table 1 A dataset to explain the CSS algorithm for clustering with $N=10$, $n=4$ and $K=3$

A charged system search approach for data clustering

Abstract

Similar content being viewed by others

Hybridization of magnetic charge system search and particle swarm optimization for efficient data clustering using neighborhood search strategy

Automatic Clustering Using a Genetic Algorithm with New Solution Encoding and Operators

Comparison of K-means Clustering Initialization Approaches with Brute-Force Initialization

1 Introduction

2 CSS algorithm for clustering

2.1 Algorithm details

3 Cluster analysis

3.1 Application of \(K\)-means in clustering

3.2 Applications of Genetic Algorithm (GA) in clustering

3.3 Applications of particle swarm optimization in clustering

3.4 Applications of ant colony optimization in clustering

4 Experimental results

4.1 Datasets

4.1.1 ART1

4.1.2 ART2

4.1.3 Iris dataset

4.1.4 Wine dataset

4.1.5 Glass

4.1.6 Breast cancer Wisconsin

4.1.7 Contraceptive method choice

4.1.8 Thyroid

4.1.9 Liver disorder

4.1.10 Vowel

4.2 Performance measures

4.2.1 Sum of intra-cluster distances

4.2.2 Standard deviation

4.2.3 \(F\)-measure

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation