Parallel K-Means Clustering Algorithm on DNA Dataset
Clustering is a division of data into groups of similar objects. K-means has been used in many clustering work because of the ease of the algorithm. Our main effort is to parallelize the k-means clustering algorithm. The parallel version is implemented based on the inherent parallelism during the Distance Calculation and Centroid Update phases. The parallel K-means algorithm is designed in such a way that each P participating node is responsible for handling n/P data points. We run the program on a Linux Cluster with a maximum of eight nodes using message-passing programming model. We examined the performance based on the percentage of correct answers and its speed-up performance. The outcome shows that our parallel K-means program performs relatively well on large datasets.
KeywordsMaster Node Artificial Dataset Inherent Parallelism Positional Weight Matrice Distribute Memory Multiprocessor
Unable to display preview. Download preview PDF.
- 2.Wan, X., Bridges, S.M., Boyle, J., Boyle, A.: Interactive Clustering for Exploration of Genomic Data, Mississippi State University, Mississippi State, MS USA (2002)Google Scholar
- 3.Alsabti, K., Ranka, S., Singh, V.: An Efficient K-Means Clustering Algorithm (1997), http://www.cise.ufl.edu/~ranka/
- 4.Murakami, K., Takagi, T.: Clustering and Detectionof 5’ Splices Sitesof mRNA by K Wight Matrices Model. In: Pac. Symp. BioComputing, pp. 171–181 (1999)Google Scholar
- 5.Kantabutra, S., Couch, A.L.: Parallel K-means Clustering Algorithm on NOWs. NECTEC Technical Journal 1(6), 243–248 (2002)Google Scholar
- 6.Stoffel, K., Belkoniene, A.: Parallel K/H-means Clustering for Large Data Sets. In: Proceedings of the European Conference on Parallel Processing EuroPar 1999 (1999)Google Scholar
- 7.Kantabutra, S., Naramittakapong, C., Kornpitak, P.: Pipeline K-means Algorithm on NOWs. In: Proceeding of the Third International Symposium on Communication and Information Technology (ISCIT 2003), Hatyai, Songkla,Thailand (2003)Google Scholar
- 8.Forman, G., Zhang, B.: Linear Speed-Up for a parallel Non-Approximate Recasting of Center-Based Clustering Algorithm, including K-Means, K-Harmonic Means and EM. In: ACM SIGKDD Workshop on Distributed and Parallel Knowledge Discovery (KDD 2000), Boston, MA (2000)Google Scholar
- 9.Chaudari, P., Dass, S.: Statistical Analysis of Large DNA sequences using distribution of DNA words. Currebt Science 80(9), 1161–1166 (2001)Google Scholar