Abstract
The presented paper describes the design and implementation of distributed k-means clustering algorithm for text documents analysis. Motivation for the research effort presented in this paper is to propose a distributed approach based on current in-memory distributed computing technologies. We have used our Jbowl java text mining library and GridGain as a framework for distributed computing. Using these technologies we have designed and implemented k-means distributed clustering algorithm in two modifications and performed the experiments on the standard text data collections. Experiments were conducted in two testing environments—a distributed computing infrastructure and on a multi-core server.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Paralič, J., Furdík, K., Tutoky, G., Bednár, P., Sarnovský, M., Butka, P., Babič, F.: Text Mining (in Slovak: Dolovanie znalostí z textov). Equilibria, Košice (2010)
Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report TR-07-35. Deparment of Computer Science, Virginia Tech (2007)
Joshi, M.N.: Parallel K-means algorithm on distributed memory multiprocessors, Project Report, Computer Science Department, University of Minnesota, Twin Cities (2003)
Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10 (2006)
Rui, M.E., Rui, P., Chunming, R.: K-means clustering in the cloud—a Mahout test. In: Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA’11). IEEE Computer Society, Washington (2011)
Bednar, P., Butka, P.: Task-based execution engine for JBOWL. In: Proceedings of WIKT, Bratislava, Slovakia, pp. 65–68 (2009)
Butka, P., Bednar, P., Babic, F.: Use of task-based text-mining execution engine in support of knowledge creation processes. In: Proceedings of Znalosti, Bratislava, pp. 289–292 (2009)
GridGain 3.0.: High performance cloud computing whitepaper (2011). http://www.gridgain.com/media/gridgain_white_paper.pdf
Sarnovský, M., Kačur, T.: Cloud-based classification of text documents using the Gridgain platform. In: Proceedings of 7th IEEE International Symposium on Applied Computational Intelligence and Informatics, SACI 2012, Timişoara, Romania (2012)
Butka, P., Pocs, J., Pocsova, J.: Distributed version of algorithm for generalized one-sided concept lattices. In: Intelligent Distributed Computing VII Book Series. Studies in Computational Intelligence, vol. 511, pp. 119–129 (2014)
Butka, P., Pocs, J., Pocsova, J.: Distributed computation of generalized one-sided concept lattices on sparse data tables. Comput. Inform. 34(1), 77–98 (2015)
Butka, P., Pócsová, J., Pócs, J.: Proposal of the information retrieval system based on the generalized one-sided concept lattices. In: Topics in Intelligent Engineering and Informatics. Applied Computational Intelligence in Engineering and Information Technology, vol. 1. Springer, Berlin (2012)
Sarnovsky, M., Ulbrik, Z.: Cloud-based clustering of text documents using the GHSOM algorithm on the GridGain platform. In: SACI (2013), pp. 309–313 (2013)
Sarnovský, M., Butka, P.: Cloud computing as a platform for distributed data analysis. In: 7th Workshop on Intelligent and Knowledge Oriented Technologies, WIKT 2012 (2012)
Srinath, N.K.: MapReduce design of K-means clustering algorithm. In: Proceedings of International Conference on Information Science and Applications (ICISA) 2013 (2013)
Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology (EDBT’12), pp. 38–49. ACM, New York (2012)
Acknowledgements
The work presented in this paper was partially supported by the Slovak Grant Agency of Ministry of Education and Academy of Science of the Slovak Republic under grant No. 1/1147/12 (50 %) and as the result of the Project implementation: University Science Park TECHNICOM for Innovation Applications Supported by Knowledge Technology, ITMS: 26220220182, supported by the Research & Development Operational Programme funded by the ERDF (50 %).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Sarnovsky, M., Carnoka, N. (2016). Distributed Algorithm for Text Documents Clustering Based on k-Means Approach. In: Grzech, A., Borzemski, L., Świątek, J., Wilimowska, Z. (eds) Information Systems Architecture and Technology: Proceedings of 36th International Conference on Information Systems Architecture and Technology – ISAT 2015 – Part II. Advances in Intelligent Systems and Computing, vol 430. Springer, Cham. https://doi.org/10.1007/978-3-319-28561-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-28561-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28559-7
Online ISBN: 978-3-319-28561-0
eBook Packages: EngineeringEngineering (R0)