The Journal of Supercomputing

, Volume 43, Issue 1, pp 21–41 | Cite as

Exploitation of a parallel clustering algorithm on commodity hardware with P2P-MPI

  • Stéphane Genaud
  • Pierre Gançarski
  • Guillaume Latu
  • Alexandre Blansché
  • Choopan Rattanapoka
  • Damien Vouriot
Article

Abstract

The goal of clustering is to identify subsets called clusters which usually correspond to objects that are more similar to each other than they are to objects from other clusters. We have proposed the MACLAW method, a cooperative coevolution algorithm for data clustering, which has shown good results (Blansché and Gançarski, Pattern Recognit. Lett. 27(11), 1299–1306, 2006). However the complexity of the algorithm increases rapidly with the number of clusters to find. We propose in this article a parallelization of MACLAW, based on a message-passing paradigm, as well as the analysis of the application performances with experiment results. We show that we reach near optimal speedups when searching for 16 clusters, a typical problem instance for which the sequential execution duration is an obstacle to the MACLAW method. Further, our approach is original because we use the P2P-MP1 grid middleware (Genaud and Rattanapoka, Lecture Notes in Comput. Sci., vol. 3666, pp. 276–284, 2005) which both provides the message passing library and infrastructure services to discover computing resources. We also put forward that the application can be tightly coupled with the middleware to make the parallel execution nearly transparent for the user.

Keywords

Clustering Evolutionary algorithms Grid Parallel algorithms Java 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA Google Scholar
  2. 2.
    Blansché A, Gançarski P (2006) MACLAW: a modular approach for clustering with local attribute weighting. Pattern Recognit Lett 27(11):1299–1306 CrossRefGoogle Scholar
  3. 3.
    Cappello F et al (2005) Grid’5000: a large scale, reconfigurable, controlable and monitorable grid platform. In: Proceedings of the 6th IEEE/ACM international workshop on grid computing Grid’2005, November 2005. http://www.grid5000.org
  4. 4.
    Carpenter B, Getov V, Judd G, Skjellum T, Fox G (2000) MPJ: MPI-like message passing for Java. Concurr Pract Experience 12(11), September Google Scholar
  5. 5.
    Chan EY, Ching WK, Ng MK, Huang JZ (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37:943–952 MATHCrossRefGoogle Scholar
  6. 6.
    Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Revised papers from large-scale parallel data mining, workshop on large-scale parallel KDD systems, SIGKDD Springer, New York, pp 245–260 Google Scholar
  7. 7.
    Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan1 M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97 CrossRefGoogle Scholar
  8. 8.
    Forman G, Zhang B (2000) Linear speedup for a parallel non-approximate recasting of centerbased clustering algorithms, including k-means, k-harmonic means, and em. In: ACM SIGKDD workshop on distributed and parallel knowledge discovery, KDD-2000 Google Scholar
  9. 9.
    Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. J Roy Stat Soc 66(4):815–849 MATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Frigui H, Nasraoui O (2004) Unsupervised learning of prototypes and attribute weights. Pattern Recognit 34:567–581 CrossRefGoogle Scholar
  11. 11.
    Gabriel E, Resch M, Beisel T, Keller R (1998) Distributed computing in an heterogeneous computing environment. In: EuroPVM/MPI. Lecture notes in comput sci, vol 1497. Springer, New York, pp 180–187 Google Scholar
  12. 12.
    Genaud S, Rattanapoka C (2005) A peer-to-peer framework for robust execution of message passing parallel programs. In: Di Martino B et al (eds) EuroPVM/MPI 2005. Lecture notes in comput sci, vol 3666. Springer, New York, pp 276–284, September Google Scholar
  13. 13.
    Genaud S, Rattanapoka C (2007) Fault management in P2P-MPI. In: Proceedings of international conference on grid and pervasive computing, GPC’07. Lecture notes in comput sci. Springer, May Google Scholar
  14. 14.
    Genaud S, Rattanapoka C (2007) P2P-MPI: a peer-to-peer framework for robust execution of message passing parallel programs. J Grid Comput 5:27–42 CrossRefGoogle Scholar
  15. 15.
    Gnanadesikan R, Kettenring JR, Tsao SL (1995) Weighting and selection of variables for cluster analysis. J Classif 12(1):113–136 MATHCrossRefGoogle Scholar
  16. 16.
    Howe N, Cardie C (1997) Examining locally varying weights for nearest neighbor algorithms. In: ICCBR, pp 455–466 Google Scholar
  17. 17.
    Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(2):657–668 CrossRefGoogle Scholar
  18. 18.
  19. 19.
    Karonis NT, Toonen BT, Foster I (2003) MPICH-G2: a grid-enabled implementation of the message passing interface. J Parallel Distributed Comput special issue on Comput Grids 63(5):551–563, May MATHCrossRefGoogle Scholar
  20. 20.
    Kielmann T, Hofman RFH, Bal HE, Plaat A, Bhoedjang RAF (1999) MagPIe: MPI’s collective communication operations for clustered wide area systems. ACM SIGPLAN Notices 34(8):131–140, August CrossRefGoogle Scholar
  21. 21.
    Kruengkrai C, Jaruskulchai C (2002) A parallel learning algorithm for text classification. In: Eighth ACM SIGKDD international conference on knowledge discovery and data mining, July Google Scholar
  22. 22.
    MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, Berkeley, CA, 1967. University of California Press, pp 281–297 Google Scholar
  23. 23.
    MPI (1995) A message passing interface standard, version 1.1. Technical report, University of Tennessee, Knoxville, TN, USA, Jun Google Scholar
  24. 24.
    Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD explorations, newsletter of the ACM special interest group on knowledge discovery and data mining 6(1):90–106 Google Scholar
  25. 25.
    Shudo K, Tanaka Y, Sekiguchi S (2005) P3: P2P-based middleware enabling transfer and aggregation of computational resource. In: 5th intl workshop on global and peer-to-peer computing, in conjunc with CCGrid05. IEEE, May Google Scholar
  26. 26.
    Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Stéphane Genaud
    • 1
  • Pierre Gançarski
    • 2
  • Guillaume Latu
    • 1
  • Alexandre Blansché
    • 2
  • Choopan Rattanapoka
    • 1
  • Damien Vouriot
    • 2
  1. 1.LSIIT-ICPSLouis Pasteur University, Strasbourg – UMR 7005 CNRS-ULPIllkirchFrance
  2. 2.LSIIT-AFDLouis Pasteur University, Strasbourg – UMR 7005 CNRS-ULPIllkirchFrance

Personalised recommendations