Skip to main content

An Accelerated MapReduce-Based K-prototypes for Big Data

  • Conference paper
  • First Online:
Software Technologies: Applications and Foundations (STAF 2016)

Abstract

Big data are often characterized by a huge volume and a variety of attributes namely, numerical and categorical. To address this issue, this paper proposes an accelerated MapReduce-based k-prototypes method. The proposed method is based on pruning strategy to accelerate the clustering process by reducing the unnecessary distance computations between cluster centers and data points. Experiments performed on huge synthetic and real data sets show that the proposed method is scalable and improves the efficiency of the existing MapReduce-based k-prototypes method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://hadoop.apache.org/.

  2. 2.

    https://projets.pasteur.fr/projects/rap-r/wiki/SyntheticDataGeneration.

  3. 3.

    https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data.

  4. 4.

    http://archive.ics.uci.edu/ml/datasets/Covertype.

References

  1. Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)

    Article  Google Scholar 

  2. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endowment 5(7), 622–633 (2012)

    Article  Google Scholar 

  3. Ben Haj Kacem, M.A., Ben N’cir, C.E., Essoussi, N.: MapReduce-based k-prototypes clustering method for big data. In: Proceedings of Data Science and Advanced Analytics, pp. 1–7(2015)

    Google Scholar 

  4. Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data k-means clustering using mapReduce. J. Supercomput. 70(3), 1249–1259 (2014)

    Article  Google Scholar 

  5. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  6. Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)

    Article  Google Scholar 

  7. Gorodetsky, V.: Opportunities, challenges and solutions. In: Information and Communication Technologies in Education, Research, and Industrial Applications, pp. 3–22

    Google Scholar 

  8. Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z.: An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120, 590–596 (2013)

    Article  Google Scholar 

  9. Hadian, A., Shahrivari, S.: High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. J. Supercomput. 69(2), 845–863 (2014)

    Article  Google Scholar 

  10. Hamerly, G., Drake, J. Accelerating Lloyd’s algorithm for k-means clustering. In: Partitional Clustering Algorithms, pp. 41–78 (2015)

    Google Scholar 

  11. Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–34(1997)

    Google Scholar 

  12. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)

    Article  Google Scholar 

  13. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)

    Article  Google Scholar 

  14. Kim, Y., Shim, K., Kim, M.S., Lee, J.S.: DBCURE-MR: an efficient density-based clustering algorithm for large data using mapReduce. Inf. Syst. 42, 15–35 (2014)

    Article  Google Scholar 

  15. Li, C., Biswas, G.: Unsupervised learning with mixed numeric and nominal data. Knowl. Data Eng. 14(4), 673–690 (2002)

    Article  Google Scholar 

  16. Li, Q., Wang, P., Wang, W., Hu, H., Li, Z., Li, J.: An efficient k-means clustering algorithm on mapReduce. In: Proceedings of Database Systems for Advanced Applications, pp. 357–371 (2014)

    Google Scholar 

  17. Ludwig, S.A.: MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. Cybern. 6(6), 923–934 (2015)

    Article  Google Scholar 

  18. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 14, no. 1, pp. 281–297 (1967)

    Google Scholar 

  19. Shahrivari, S., Jalili, S.: Single-pass and linear-time k-means clustering based on mapReduce. Inf. Syst. 60, 1–12 (2016)

    Article  Google Scholar 

  20. Vattani, A.: K-means requires exponentially many iterations even in the plane. Discrete Comput. Geom. 45(4), 596–616 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  21. Xu, R., Wunsch, D.C.: Clustering algorithms in biomedical research: a review. Biomed. Eng. IEEE Rev. 3, 120–154 (2010)

    Article  Google Scholar 

  22. Xu, X., Jäger, J., Kriegel, H.P.: A fast parallel clustering algorithm for large spatial databases. In: High Performance Data Mining, pp. 263–290 (2002)

    Google Scholar 

  23. Zhao, W., Ma, H., He, Q. Parallel k-means clustering based on mapReduce. In: Proceedings of Cloud Computing, pp. 674–679 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Aymen Ben HajKacem .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

HajKacem, M.A.B., N’cir, CE.B., Essoussi, N. (2016). An Accelerated MapReduce-Based K-prototypes for Big Data. In: Milazzo, P., Varró, D., Wimmer, M. (eds) Software Technologies: Applications and Foundations. STAF 2016. Lecture Notes in Computer Science(), vol 9946. Springer, Cham. https://doi.org/10.1007/978-3-319-50230-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-50230-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-50229-8

  • Online ISBN: 978-3-319-50230-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics