Enhancement of the K-Means Algorithm for Mixed Data in Big Data Platforms

Koren, Oded; Hallin, Carina Antonia; Perel, Nir; Bendet, Dror

doi:10.1007/978-3-030-01054-6_71

Oded Koren¹⁷,
Carina Antonia Hallin¹⁸,
Nir Perel¹⁷ &
…
Dror Bendet¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 868))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

1608 Accesses
4 Citations

Abstract

Big data research has emerged as an important discipline in information systems research and management. Yet, while the torrent of data being generated on the Internet is increasingly unstructured and non-numeric in the form of images and texts, research indicates there is an increasing need to develop more efficient algorithms for treating mixed data in big data. In this paper, we apply the classical K-means algorithm to both numeric and categorical attributes in big data platforms. We first present an algorithm which handles the problem of mixed data. We then utilize big data platforms to implement the algorithm. This provides us with a solid basis for performing more targeted profiling for business and research purposes using big data, so that decision makers will be able to treat mixed data, i.e. numerical and categorical data, to explain phenomena within the big data ecosystem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://hadoop.apache.org/.
2.
http://mahout.apache.org/.

References

Abbasi, A., Sarker, S., Chiang, R.H.: Big data research in information systems: toward an inclusive research agenda. J. Assoc. Inf. Syst. 17(2) (2016)
Article Google Scholar
Agarwal, R., Dhar, V.: Editorial—big data, data science, and analytics: the opportunity and challenge for IS research. Inf. Syst. Res. 25(3), 443–448 (2014)
Article Google Scholar
Ahmad, A., Dey, L.: A K-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)
Article Google Scholar
Berkhin, P.: A survey of clustering data mining techniques. In: Grouping Multidimensional Data, pp. 25–71. Springer, Berlin (2006)
Google Scholar
Cai, X., Nie, F., Huang, H.: Multi-view K-means clustering on big data. IJCAI (2013)
Google Scholar
Cisco: The Zettabyte era: trends and analysis. White paper (2016)
Google Scholar
Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data K-means clustering using MapReduce. J. Supercomput. 70(3), 1249–1259 (2014)
Article Google Scholar
Cukier, K., Mayer-Schoenberger, V.: The rise of big data: how it’s changing the way we think about the world. Foreign Aff. 92(3), 28–40 (2013)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Demchenko, Y., Ngo, C., Membrey, P.: Architecture framework and components for the big data ecosystem. J. Syst. Netw. Eng. 1–31 (2013)‏
Google Scholar
Di Tullio, D., Staples, D.S.: The governance and control of open source software projects. J. Manag. Inf. Syst. 30(3), 49–80 (2013)
Article Google Scholar
Engelberg, G., Koren, O., Perel, N.: Big data performance evaluation analysis using Apache Pig. Int. J. Softw. Eng. Appl. 10(11), 429–440 (2016)
Google Scholar
Füller, J., Hutter, K., Hautz, J., Matzler, K.: User roles and contributions in innovation-contest communities. J. Manag. Inf. Syst. 31(1), 273–308 (2014)
Article Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google File System, ACM SIGOPS Operating Systems Review, vol. 37, pp. 29–43 (2003)
Article Google Scholar
Guo, S., Guo, X., Fang, Y., Vogel, D.: How doctors gain social and economic returns in online health-care communities: a professional capital perspective. J. Manag. Inf. Syst. 34(2), 487–519 (2017)
Article Google Scholar
Henschen, D.: Why Sears is going all-in on Hadoop. InformationWeek (2012)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2, 283–304 (1998)
Article Google Scholar
Kendal, D., Koren, O., Perel, N.: Pig vs. hive use case analysis. Int. J. Database Theory Appl. 9(12), 267–276 (2016)
Article Google Scholar
Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for machine learning with big data in the Apache™ Hadoop®! ecosystem. J. Big Data 2(1), 24 (2015)
Article Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute (2011)
Google Scholar
Preethi, R.A., Elavarasi, J.: Big data analytics using Hadoop tools—Apache Hive vs Apache Pig. Int. J. Emerg. Technol. Comput. Sci. Electron. 24(3) (2017)
Google Scholar
Rai, A.: Synergies between big data and theory. Manag. Inf. Syst. Q. 40(2), iii–ix (2016)
Google Scholar
Ralambondrain, H.: A conceptual version of the K-means algorithm. Pattern Recogn. Lett. 16(11), 1147–1157 (1995)
Article Google Scholar
Saboo, A.R., Kumar, V., Park, I.: Using big data to model time-varying effects for marketing resource (re) allocation. MIS Q. 40(4) (2016)
Article Google Scholar
San, O.M., Huynh, V.-N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 14(2), 241–248 (2004)
MathSciNet MATH Google Scholar
Tambe, P.: Big data investment, skills, and firm value. Alok Gupta, pp. 1452–1469 (2014)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide, 4th edn. OReilly Media, Sebastopol (2015)
Google Scholar
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Industrial Engineering and Management, Shenkar – Engineering, Design, Art, Ramat Gan, Israel
Oded Koren, Nir Perel & Dror Bendet
Department of International Economics, Government and Business, Copenhagen Business School, Frederiksberg, Denmark
Carina Antonia Hallin

Authors

Oded Koren
View author publications
You can also search for this author in PubMed Google Scholar
Carina Antonia Hallin
View author publications
You can also search for this author in PubMed Google Scholar
Nir Perel
View author publications
You can also search for this author in PubMed Google Scholar
Dror Bendet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nir Perel .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, UK
Supriya Kapoor
The Science and Information (SAI) Organization, Bradford, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koren, O., Hallin, C.A., Perel, N., Bendet, D. (2019). Enhancement of the K-Means Algorithm for Mixed Data in Big Data Platforms. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-030-01054-6_71

Download citation

DOI: https://doi.org/10.1007/978-3-030-01054-6_71
Published: 09 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01053-9
Online ISBN: 978-3-030-01054-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics