Abstract
Big data research has emerged as an important discipline in information systems research and management. Yet, while the torrent of data being generated on the Internet is increasingly unstructured and non-numeric in the form of images and texts, research indicates there is an increasing need to develop more efficient algorithms for treating mixed data in big data. In this paper, we apply the classical K-means algorithm to both numeric and categorical attributes in big data platforms. We first present an algorithm which handles the problem of mixed data. We then utilize big data platforms to implement the algorithm. This provides us with a solid basis for performing more targeted profiling for business and research purposes using big data, so that decision makers will be able to treat mixed data, i.e. numerical and categorical data, to explain phenomena within the big data ecosystem.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Abbasi, A., Sarker, S., Chiang, R.H.: Big data research in information systems: toward an inclusive research agenda. J. Assoc. Inf. Syst. 17(2) (2016)
Agarwal, R., Dhar, V.: Editorial—big data, data science, and analytics: the opportunity and challenge for IS research. Inf. Syst. Res. 25(3), 443–448 (2014)
Ahmad, A., Dey, L.: A K-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)
Berkhin, P.: A survey of clustering data mining techniques. In: Grouping Multidimensional Data, pp. 25–71. Springer, Berlin (2006)
Cai, X., Nie, F., Huang, H.: Multi-view K-means clustering on big data. IJCAI (2013)
Cisco: The Zettabyte era: trends and analysis. White paper (2016)
Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data K-means clustering using MapReduce. J. Supercomput. 70(3), 1249–1259 (2014)
Cukier, K., Mayer-Schoenberger, V.: The rise of big data: how it’s changing the way we think about the world. Foreign Aff. 92(3), 28–40 (2013)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Demchenko, Y., Ngo, C., Membrey, P.: Architecture framework and components for the big data ecosystem. J. Syst. Netw. Eng. 1–31 (2013)
Di Tullio, D., Staples, D.S.: The governance and control of open source software projects. J. Manag. Inf. Syst. 30(3), 49–80 (2013)
Engelberg, G., Koren, O., Perel, N.: Big data performance evaluation analysis using Apache Pig. Int. J. Softw. Eng. Appl. 10(11), 429–440 (2016)
Füller, J., Hutter, K., Hautz, J., Matzler, K.: User roles and contributions in innovation-contest communities. J. Manag. Inf. Syst. 31(1), 273–308 (2014)
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google File System, ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43 (2003)
Guo, S., Guo, X., Fang, Y., Vogel, D.: How doctors gain social and economic returns in online health-care communities: a professional capital perspective. J. Manag. Inf. Syst. 34(2), 487–519 (2017)
Henschen, D.: Why Sears is going all-in on Hadoop. InformationWeek (2012)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2, 283–304 (1998)
Kendal, D., Koren, O., Perel, N.: Pig vs. hive use case analysis. Int. J. Database Theory Appl. 9(12), 267–276 (2016)
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Apache™ Hadoop®! ecosystem. J. Big Data 2(1), 24 (2015)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute (2011)
Preethi, R.A., Elavarasi, J.: Big data analytics using Hadoop tools—Apache Hive vs Apache Pig. Int. J. Emerg. Technol. Comput. Sci. Electron. 24(3) (2017)
Rai, A.: Synergies between big data and theory. Manag. Inf. Syst. Q. 40(2), iii–ix (2016)
Ralambondrain, H.: A conceptual version of the K-means algorithm. Pattern Recogn. Lett. 16(11), 1147–1157 (1995)
Saboo, A.R., Kumar, V., Park, I.: Using big data to model time-varying effects for marketing resource (re) allocation. MIS Q. 40(4) (2016)
San, O.M., Huynh, V.-N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 14(2), 241–248 (2004)
Tambe, P.: Big data investment, skills, and firm value. Alok Gupta, pp. 1452–1469 (2014)
White, T.: Hadoop: The Definitive Guide, 4th edn. OReilly Media, Sebastopol (2015)
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Koren, O., Hallin, C.A., Perel, N., Bendet, D. (2019). Enhancement of the K-Means Algorithm for Mixed Data in Big Data Platforms. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-030-01054-6_71
Download citation
DOI: https://doi.org/10.1007/978-3-030-01054-6_71
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01053-9
Online ISBN: 978-3-030-01054-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)