Skip to main content

Enhancement of the K-Means Algorithm for Mixed Data in Big Data Platforms

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 868))

Abstract

Big data research has emerged as an important discipline in information systems research and management. Yet, while the torrent of data being generated on the Internet is increasingly unstructured and non-numeric in the form of images and texts, research indicates there is an increasing need to develop more efficient algorithms for treating mixed data in big data. In this paper, we apply the classical K-means algorithm to both numeric and categorical attributes in big data platforms. We first present an algorithm which handles the problem of mixed data. We then utilize big data platforms to implement the algorithm. This provides us with a solid basis for performing more targeted profiling for business and research purposes using big data, so that decision makers will be able to treat mixed data, i.e. numerical and categorical data, to explain phenomena within the big data ecosystem.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://hadoop.apache.org/.

  2. 2.

    http://mahout.apache.org/.

References

  1. Abbasi, A., Sarker, S., Chiang, R.H.: Big data research in information systems: toward an inclusive research agenda. J. Assoc. Inf. Syst. 17(2) (2016)

    Article  Google Scholar 

  2. Agarwal, R., Dhar, V.: Editorial—big data, data science, and analytics: the opportunity and challenge for IS research. Inf. Syst. Res. 25(3), 443–448 (2014)

    Article  Google Scholar 

  3. Ahmad, A., Dey, L.: A K-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)

    Article  Google Scholar 

  4. Berkhin, P.: A survey of clustering data mining techniques. In: Grouping Multidimensional Data, pp. 25–71. Springer, Berlin (2006)

    Google Scholar 

  5. Cai, X., Nie, F., Huang, H.: Multi-view K-means clustering on big data. IJCAI (2013)

    Google Scholar 

  6. Cisco: The Zettabyte era: trends and analysis. White paper (2016)

    Google Scholar 

  7. Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data K-means clustering using MapReduce. J. Supercomput. 70(3), 1249–1259 (2014)

    Article  Google Scholar 

  8. Cukier, K., Mayer-Schoenberger, V.: The rise of big data: how it’s changing the way we think about the world. Foreign Aff. 92(3), 28–40 (2013)

    Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  10. Demchenko, Y., Ngo, C., Membrey, P.: Architecture framework and components for the big data ecosystem. J. Syst. Netw. Eng. 1–31 (2013)‏

    Google Scholar 

  11. Di Tullio, D., Staples, D.S.: The governance and control of open source software projects. J. Manag. Inf. Syst. 30(3), 49–80 (2013)

    Article  Google Scholar 

  12. Engelberg, G., Koren, O., Perel, N.: Big data performance evaluation analysis using Apache Pig. Int. J. Softw. Eng. Appl. 10(11), 429–440 (2016)

    Google Scholar 

  13. Füller, J., Hutter, K., Hautz, J., Matzler, K.: User roles and contributions in innovation-contest communities. J. Manag. Inf. Syst. 31(1), 273–308 (2014)

    Article  Google Scholar 

  14. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google File System, ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43 (2003)

    Article  Google Scholar 

  15. Guo, S., Guo, X., Fang, Y., Vogel, D.: How doctors gain social and economic returns in online health-care communities: a professional capital perspective. J. Manag. Inf. Syst. 34(2), 487–519 (2017)

    Article  Google Scholar 

  16. Henschen, D.: Why Sears is going all-in on Hadoop. InformationWeek (2012)

    Google Scholar 

  17. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2, 283–304 (1998)

    Article  Google Scholar 

  18. Kendal, D., Koren, O., Perel, N.: Pig vs. hive use case analysis. Int. J. Database Theory Appl. 9(12), 267–276 (2016)

    Article  Google Scholar 

  19. Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Apache™ Hadoop®! ecosystem. J. Big Data 2(1), 24 (2015)

    Article  Google Scholar 

  20. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  21. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute (2011)

    Google Scholar 

  22. Preethi, R.A., Elavarasi, J.: Big data analytics using Hadoop tools—Apache Hive vs Apache Pig. Int. J. Emerg. Technol. Comput. Sci. Electron. 24(3) (2017)

    Google Scholar 

  23. Rai, A.: Synergies between big data and theory. Manag. Inf. Syst. Q. 40(2), iii–ix (2016)

    Google Scholar 

  24. Ralambondrain, H.: A conceptual version of the K-means algorithm. Pattern Recogn. Lett. 16(11), 1147–1157 (1995)

    Article  Google Scholar 

  25. Saboo, A.R., Kumar, V., Park, I.: Using big data to model time-varying effects for marketing resource (re) allocation. MIS Q. 40(4) (2016)

    Article  Google Scholar 

  26. San, O.M., Huynh, V.-N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 14(2), 241–248 (2004)

    MathSciNet  MATH  Google Scholar 

  27. Tambe, P.: Big data investment, skills, and firm value. Alok Gupta, pp. 1452–1469 (2014)

    Article  Google Scholar 

  28. White, T.: Hadoop: The Definitive Guide, 4th edn. OReilly Media, Sebastopol (2015)

    Google Scholar 

  29. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nir Perel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Koren, O., Hallin, C.A., Perel, N., Bendet, D. (2019). Enhancement of the K-Means Algorithm for Mixed Data in Big Data Platforms. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-030-01054-6_71

Download citation

Publish with us

Policies and ethics