Distributed Algorithm for Text Documents Clustering Based on k-Means Approach

Sarnovsky, Martin; Carnoka, Noema

doi:10.1007/978-3-319-28561-0_13

Martin Sarnovsky⁶ &
Noema Carnoka⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 430))

491 Accesses
9 Citations

Abstract

The presented paper describes the design and implementation of distributed k-means clustering algorithm for text documents analysis. Motivation for the research effort presented in this paper is to propose a distributed approach based on current in-memory distributed computing technologies. We have used our Jbowl java text mining library and GridGain as a framework for distributed computing. Using these technologies we have designed and implemented k-means distributed clustering algorithm in two modifications and performed the experiments on the standard text data collections. Experiments were conducted in two testing environments—a distributed computing infrastructure and on a multi-core server.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Paralič, J., Furdík, K., Tutoky, G., Bednár, P., Sarnovský, M., Butka, P., Babič, F.: Text Mining (in Slovak: Dolovanie znalostí z textov). Equilibria, Košice (2010)
Google Scholar
Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report TR-07-35. Deparment of Computer Science, Virginia Tech (2007)
Google Scholar
Joshi, M.N.: Parallel K-means algorithm on distributed memory multiprocessors, Project Report, Computer Science Department, University of Minnesota, Twin Cities (2003)
Google Scholar
Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10 (2006)
Google Scholar
Rui, M.E., Rui, P., Chunming, R.: K-means clustering in the cloud—a Mahout test. In: Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA’11). IEEE Computer Society, Washington (2011)
Google Scholar
Bednar, P., Butka, P.: Task-based execution engine for JBOWL. In: Proceedings of WIKT, Bratislava, Slovakia, pp. 65–68 (2009)
Google Scholar
Butka, P., Bednar, P., Babic, F.: Use of task-based text-mining execution engine in support of knowledge creation processes. In: Proceedings of Znalosti, Bratislava, pp. 289–292 (2009)
Google Scholar
GridGain 3.0.: High performance cloud computing whitepaper (2011). http://www.gridgain.com/media/gridgain_white_paper.pdf
Sarnovský, M., Kačur, T.: Cloud-based classification of text documents using the Gridgain platform. In: Proceedings of 7th IEEE International Symposium on Applied Computational Intelligence and Informatics, SACI 2012, Timişoara, Romania (2012)
Google Scholar
Butka, P., Pocs, J., Pocsova, J.: Distributed version of algorithm for generalized one-sided concept lattices. In: Intelligent Distributed Computing VII Book Series. Studies in Computational Intelligence, vol. 511, pp. 119–129 (2014)
Google Scholar
Butka, P., Pocs, J., Pocsova, J.: Distributed computation of generalized one-sided concept lattices on sparse data tables. Comput. Inform. 34(1), 77–98 (2015)
MathSciNet Google Scholar
Butka, P., Pócsová, J., Pócs, J.: Proposal of the information retrieval system based on the generalized one-sided concept lattices. In: Topics in Intelligent Engineering and Informatics. Applied Computational Intelligence in Engineering and Information Technology, vol. 1. Springer, Berlin (2012)
Google Scholar
Sarnovsky, M., Ulbrik, Z.: Cloud-based clustering of text documents using the GHSOM algorithm on the GridGain platform. In: SACI (2013), pp. 309–313 (2013)
Google Scholar
Sarnovský, M., Butka, P.: Cloud computing as a platform for distributed data analysis. In: 7th Workshop on Intelligent and Knowledge Oriented Technologies, WIKT 2012 (2012)
Google Scholar
Srinath, N.K.: MapReduce design of K-means clustering algorithm. In: Proceedings of International Conference on Information Science and Applications (ICISA) 2013 (2013)
Google Scholar
Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology (EDBT’12), pp. 38–49. ACM, New York (2012)
Google Scholar

Download references

Acknowledgements

The work presented in this paper was partially supported by the Slovak Grant Agency of Ministry of Education and Academy of Science of the Slovak Republic under grant No. 1/1147/12 (50 %) and as the result of the Project implementation: University Science Park TECHNICOM for Innovation Applications Supported by Knowledge Technology, ITMS: 26220220182, supported by the Research & Development Operational Programme funded by the ERDF (50 %).

Author information

Authors and Affiliations

Department of Cybernetics and Artificial Intelligence, Faculty of Electrotechnics and Informatics, Technical University Kosice, Letna 9/a, 04001, Kosice, Slovakia
Martin Sarnovsky & Noema Carnoka

Authors

Martin Sarnovsky
View author publications
You can also search for this author in PubMed Google Scholar
Noema Carnoka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Sarnovsky .

Editor information

Editors and Affiliations

Faculty of Computer Science and Manageme, Wrocław University of Technology, Wrocła, Wrocław, Poland
Adam Grzech
Faculty of Computer Science and Manageme, Wrocław University of Technology, Wrocła, Wroclaw, Poland
Leszek Borzemski
Faculty of Computer Science and Manageme, Wrocław University of Technology, Wrocła, Wrocław, Poland
Jerzy Świątek
Faculty of Computer Science and Manageme, Wrocław University of Technology, Wrocła, Wrocław, Poland
Zofia Wilimowska

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarnovsky, M., Carnoka, N. (2016). Distributed Algorithm for Text Documents Clustering Based on k-Means Approach. In: Grzech, A., Borzemski, L., Świątek, J., Wilimowska, Z. (eds) Information Systems Architecture and Technology: Proceedings of 36th International Conference on Information Systems Architecture and Technology – ISAT 2015 – Part II. Advances in Intelligent Systems and Computing, vol 430. Springer, Cham. https://doi.org/10.1007/978-3-319-28561-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-28561-0_13
Published: 24 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28559-7
Online ISBN: 978-3-319-28561-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics