Enhancement of the K-Means Algorithm for Mixed Data in Big Data Platforms

  • Oded Koren
  • Carina Antonia Hallin
  • Nir PerelEmail author
  • Dror Bendet
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 868)


Big data research has emerged as an important discipline in information systems research and management. Yet, while the torrent of data being generated on the Internet is increasingly unstructured and non-numeric in the form of images and texts, research indicates there is an increasing need to develop more efficient algorithms for treating mixed data in big data. In this paper, we apply the classical K-means algorithm to both numeric and categorical attributes in big data platforms. We first present an algorithm which handles the problem of mixed data. We then utilize big data platforms to implement the algorithm. This provides us with a solid basis for performing more targeted profiling for business and research purposes using big data, so that decision makers will be able to treat mixed data, i.e. numerical and categorical data, to explain phenomena within the big data ecosystem.


Big data Mixed data Hadoop K-means 


  1. 1.
    Abbasi, A., Sarker, S., Chiang, R.H.: Big data research in information systems: toward an inclusive research agenda. J. Assoc. Inf. Syst. 17(2) (2016)CrossRefGoogle Scholar
  2. 2.
    Agarwal, R., Dhar, V.: Editorial—big data, data science, and analytics: the opportunity and challenge for IS research. Inf. Syst. Res. 25(3), 443–448 (2014)CrossRefGoogle Scholar
  3. 3.
    Ahmad, A., Dey, L.: A K-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)CrossRefGoogle Scholar
  4. 4.
    Berkhin, P.: A survey of clustering data mining techniques. In: Grouping Multidimensional Data, pp. 25–71. Springer, Berlin (2006)Google Scholar
  5. 5.
    Cai, X., Nie, F., Huang, H.: Multi-view K-means clustering on big data. IJCAI (2013)Google Scholar
  6. 6.
    Cisco: The Zettabyte era: trends and analysis. White paper (2016)Google Scholar
  7. 7.
    Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data K-means clustering using MapReduce. J. Supercomput. 70(3), 1249–1259 (2014)CrossRefGoogle Scholar
  8. 8.
    Cukier, K., Mayer-Schoenberger, V.: The rise of big data: how it’s changing the way we think about the world. Foreign Aff. 92(3), 28–40 (2013)Google Scholar
  9. 9.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  10. 10.
    Demchenko, Y., Ngo, C., Membrey, P.: Architecture framework and components for the big data ecosystem. J. Syst. Netw. Eng. 1–31 (2013)‏Google Scholar
  11. 11.
    Di Tullio, D., Staples, D.S.: The governance and control of open source software projects. J. Manag. Inf. Syst. 30(3), 49–80 (2013)CrossRefGoogle Scholar
  12. 12.
    Engelberg, G., Koren, O., Perel, N.: Big data performance evaluation analysis using Apache Pig. Int. J. Softw. Eng. Appl. 10(11), 429–440 (2016)Google Scholar
  13. 13.
    Füller, J., Hutter, K., Hautz, J., Matzler, K.: User roles and contributions in innovation-contest communities. J. Manag. Inf. Syst. 31(1), 273–308 (2014)CrossRefGoogle Scholar
  14. 14.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google File System, ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43 (2003)CrossRefGoogle Scholar
  15. 15.
    Guo, S., Guo, X., Fang, Y., Vogel, D.: How doctors gain social and economic returns in online health-care communities: a professional capital perspective. J. Manag. Inf. Syst. 34(2), 487–519 (2017)CrossRefGoogle Scholar
  16. 16.
    Henschen, D.: Why Sears is going all-in on Hadoop. InformationWeek (2012)Google Scholar
  17. 17.
    Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2, 283–304 (1998)CrossRefGoogle Scholar
  18. 18.
    Kendal, D., Koren, O., Perel, N.: Pig vs. hive use case analysis. Int. J. Database Theory Appl. 9(12), 267–276 (2016)CrossRefGoogle Scholar
  19. 19.
    Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Apache™ Hadoop®! ecosystem. J. Big Data 2(1), 24 (2015)CrossRefGoogle Scholar
  20. 20.
    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  21. 21.
    Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute (2011)Google Scholar
  22. 22.
    Preethi, R.A., Elavarasi, J.: Big data analytics using Hadoop tools—Apache Hive vs Apache Pig. Int. J. Emerg. Technol. Comput. Sci. Electron. 24(3) (2017)Google Scholar
  23. 23.
    Rai, A.: Synergies between big data and theory. Manag. Inf. Syst. Q. 40(2), iii–ix (2016)Google Scholar
  24. 24.
    Ralambondrain, H.: A conceptual version of the K-means algorithm. Pattern Recogn. Lett. 16(11), 1147–1157 (1995)CrossRefGoogle Scholar
  25. 25.
    Saboo, A.R., Kumar, V., Park, I.: Using big data to model time-varying effects for marketing resource (re) allocation. MIS Q. 40(4) (2016)CrossRefGoogle Scholar
  26. 26.
    San, O.M., Huynh, V.-N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 14(2), 241–248 (2004)MathSciNetzbMATHGoogle Scholar
  27. 27.
    Tambe, P.: Big data investment, skills, and firm value. Alok Gupta, pp. 1452–1469 (2014)CrossRefGoogle Scholar
  28. 28.
    White, T.: Hadoop: The Definitive Guide, 4th edn. OReilly Media, Sebastopol (2015)Google Scholar
  29. 29.
    Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Oded Koren
    • 1
  • Carina Antonia Hallin
    • 2
  • Nir Perel
    • 1
    Email author
  • Dror Bendet
    • 1
  1. 1.School of Industrial Engineering and ManagementShenkar – Engineering, Design, ArtRamat GanIsrael
  2. 2.Department of International Economics, Government and BusinessCopenhagen Business SchoolFrederiksbergDenmark

Personalised recommendations