Advertisement

Parallel K-Means Clustering Algorithm on DNA Dataset

  • Fazilah Othman
  • Rosni Abdullah
  • Nur’Aini Abdul Rashid
  • Rosalina Abdul Salam
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3320)

Abstract

Clustering is a division of data into groups of similar objects. K-means has been used in many clustering work because of the ease of the algorithm. Our main effort is to parallelize the k-means clustering algorithm. The parallel version is implemented based on the inherent parallelism during the Distance Calculation and Centroid Update phases. The parallel K-means algorithm is designed in such a way that each P participating node is responsible for handling n/P data points. We run the program on a Linux Cluster with a maximum of eight nodes using message-passing programming model. We examined the performance based on the percentage of correct answers and its speed-up performance. The outcome shows that our parallel K-means program performs relatively well on large datasets.

Keywords

Master Node Artificial Dataset Inherent Parallelism Positional Weight Matrice Distribute Memory Multiprocessor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  2. 2.
    Wan, X., Bridges, S.M., Boyle, J., Boyle, A.: Interactive Clustering for Exploration of Genomic Data, Mississippi State University, Mississippi State, MS USA (2002)Google Scholar
  3. 3.
    Alsabti, K., Ranka, S., Singh, V.: An Efficient K-Means Clustering Algorithm (1997), http://www.cise.ufl.edu/~ranka/
  4. 4.
    Murakami, K., Takagi, T.: Clustering and Detectionof 5’ Splices Sitesof mRNA by K Wight Matrices Model. In: Pac. Symp. BioComputing, pp. 171–181 (1999)Google Scholar
  5. 5.
    Kantabutra, S., Couch, A.L.: Parallel K-means Clustering Algorithm on NOWs. NECTEC Technical Journal 1(6), 243–248 (2002)Google Scholar
  6. 6.
    Stoffel, K., Belkoniene, A.: Parallel K/H-means Clustering for Large Data Sets. In: Proceedings of the European Conference on Parallel Processing EuroPar 1999 (1999)Google Scholar
  7. 7.
    Kantabutra, S., Naramittakapong, C., Kornpitak, P.: Pipeline K-means Algorithm on NOWs. In: Proceeding of the Third International Symposium on Communication and Information Technology (ISCIT 2003), Hatyai, Songkla,Thailand (2003)Google Scholar
  8. 8.
    Forman, G., Zhang, B.: Linear Speed-Up for a parallel Non-Approximate Recasting of Center-Based Clustering Algorithm, including K-Means, K-Harmonic Means and EM. In: ACM SIGKDD Workshop on Distributed and Parallel Knowledge Discovery (KDD 2000), Boston, MA (2000)Google Scholar
  9. 9.
    Chaudari, P., Dass, S.: Statistical Analysis of Large DNA sequences using distribution of DNA words. Currebt Science 80(9), 1161–1166 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Fazilah Othman
    • 1
  • Rosni Abdullah
    • 1
  • Nur’Aini Abdul Rashid
    • 1
  • Rosalina Abdul Salam
    • 1
  1. 1.School of Computer ScienceUniversiti Sains MalaysiaPenangMalaysia

Personalised recommendations