Analysis of Mahout Big Data Clustering Algorithms

  • Ishan Sharma
  • Rajeev Tiwari
  • Hukam Singh Rana
  • Abhineet Anand
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 624)

Abstract

Log data generated from any of the source or communicating devices is huge; to analyze such data we need to categorize them in some clusters. Depending upon clusters, data analytics can be done. Enabling the analytics in data helps in identification of business patterns and behavior of customers. Analyzing such big data is a major task, so distributed computing is used in Hadoop platform and machine learning library Mahout is used. Weighting technique TF-IDF is used for vectorization of data, and clusters are formed using clustering algorithms for doing analysis. Clustering algorithms K-mean, fuzzy K-Mean, LDA, and spectral clustering in Mahout are used and analyzed on basis of execution time, number of clusters, static or dynamic cluster creation.

Keywords

Big data Clustering Mahout Vectorization 

References

  1. 1.
    Digital Universe Study (on behalf of EMC Corporation): Big Data, Bigger Digital Shadows, and BiggestGrowth in the Far East. 2012. http://idcdocserv.com/1414.
  2. 2.
  3. 3.
  4. 4.
    Robin Anil, Sean Owen, and Ted Dunning. Mahout in Action. Manning, 2012.Google Scholar
  5. 5.
  6. 6.
    Ibrahim, Shadi, et al. “Governing energy consumption in hadoop through cpu frequency scaling: An analysis.” Future Generation Computer Systems 54 (2016): 219-232.Google Scholar
  7. 7.
  8. 8.
    Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. “The Google file system.” ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.Google Scholar
  9. 9.
    Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: simplified data processing on large clusters.” Communications of the ACM 51.1 (2008): 107–113.Google Scholar
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
    Sharma, Ishan, Rajeev Tiwari, and Abhineet Anand. “Open Source Big Data Analytics Technique.” Proceedings of the International Conference on Data Engineering and Communication Technology. Springer Singapore, 2017.Google Scholar
  16. 16.
    Esteves, Rui Maximo, Rui Pais, and Chunming Rong. “K-means clustering in the cloud–a Mahout test.” Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on. IEEE, 2011.Google Scholar
  17. 17.
    Barrachina, Arantxa Duque, and Aisling ODriscoll. “A big data methodology for categorising technical support requests using Hadoop and Mahout.” Journal of Big Data 1.1 (2014): 1.Google Scholar
  18. 18.
  19. 19.
  20. 20.
    K-Means Clustering, https://home.deib.polimi.it/matteucc/Clustering/tutorial html/kmeans.html.
  21. 21.
    Chang, Chih-Tang, Jim ZC Lai, and Mu-Der Jeng. “A fuzzy k-means clustering algorithm using clustercenter displacement.” J. Inf. Sci. Eng. 27.3 (2011): 995–1009.Google Scholar
  22. 22.

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Ishan Sharma
    • 1
  • Rajeev Tiwari
    • 1
  • Hukam Singh Rana
    • 1
  • Abhineet Anand
    • 1
  1. 1.Department of Computer ScienceUPESDehradunIndia

Personalised recommendations