Skip to main content

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 430))

Abstract

The presented paper describes the design and implementation of distributed k-means clustering algorithm for text documents analysis. Motivation for the research effort presented in this paper is to propose a distributed approach based on current in-memory distributed computing technologies. We have used our Jbowl java text mining library and GridGain as a framework for distributed computing. Using these technologies we have designed and implemented k-means distributed clustering algorithm in two modifications and performed the experiments on the standard text data collections. Experiments were conducted in two testing environments—a distributed computing infrastructure and on a multi-core server.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://sourceforge.net/projects/jbowl/.

  2. 2.

    http://www.gridgain.com/.

References

  1. Paralič, J., Furdík, K., Tutoky, G., Bednár, P., Sarnovský, M., Butka, P., Babič, F.: Text Mining (in Slovak: Dolovanie znalostí z textov). Equilibria, Košice (2010)

    Google Scholar 

  2. Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report TR-07-35. Deparment of Computer Science, Virginia Tech (2007)

    Google Scholar 

  3. Joshi, M.N.: Parallel K-means algorithm on distributed memory multiprocessors, Project Report, Computer Science Department, University of Minnesota, Twin Cities (2003)

    Google Scholar 

  4. Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10 (2006)

    Google Scholar 

  5. Rui, M.E., Rui, P., Chunming, R.: K-means clustering in the cloud—a Mahout test. In: Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA’11). IEEE Computer Society, Washington (2011)

    Google Scholar 

  6. Bednar, P., Butka, P.: Task-based execution engine for JBOWL. In: Proceedings of WIKT, Bratislava, Slovakia, pp. 65–68 (2009)

    Google Scholar 

  7. Butka, P., Bednar, P., Babic, F.: Use of task-based text-mining execution engine in support of knowledge creation processes. In: Proceedings of Znalosti, Bratislava, pp. 289–292 (2009)

    Google Scholar 

  8. GridGain 3.0.: High performance cloud computing whitepaper (2011). http://www.gridgain.com/media/gridgain_white_paper.pdf

  9. Sarnovský, M., Kačur, T.: Cloud-based classification of text documents using the Gridgain platform. In: Proceedings of 7th IEEE International Symposium on Applied Computational Intelligence and Informatics, SACI 2012, Timişoara, Romania (2012)

    Google Scholar 

  10. Butka, P., Pocs, J., Pocsova, J.: Distributed version of algorithm for generalized one-sided concept lattices. In: Intelligent Distributed Computing VII Book Series. Studies in Computational Intelligence, vol. 511, pp. 119–129 (2014)

    Google Scholar 

  11. Butka, P., Pocs, J., Pocsova, J.: Distributed computation of generalized one-sided concept lattices on sparse data tables. Comput. Inform. 34(1), 77–98 (2015)

    MathSciNet  Google Scholar 

  12. Butka, P., Pócsová, J., Pócs, J.: Proposal of the information retrieval system based on the generalized one-sided concept lattices. In: Topics in Intelligent Engineering and Informatics. Applied Computational Intelligence in Engineering and Information Technology, vol. 1. Springer, Berlin (2012)

    Google Scholar 

  13. Sarnovsky, M., Ulbrik, Z.: Cloud-based clustering of text documents using the GHSOM algorithm on the GridGain platform. In: SACI (2013), pp. 309–313 (2013)

    Google Scholar 

  14. Sarnovský, M., Butka, P.: Cloud computing as a platform for distributed data analysis. In: 7th Workshop on Intelligent and Knowledge Oriented Technologies, WIKT 2012 (2012)

    Google Scholar 

  15. Srinath, N.K.: MapReduce design of K-means clustering algorithm. In: Proceedings of International Conference on Information Science and Applications (ICISA) 2013 (2013)

    Google Scholar 

  16. Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology (EDBT’12), pp. 38–49. ACM, New York (2012)

    Google Scholar 

Download references

Acknowledgements

The work presented in this paper was partially supported by the Slovak Grant Agency of Ministry of Education and Academy of Science of the Slovak Republic under grant No. 1/1147/12 (50 %) and as the result of the Project implementation: University Science Park TECHNICOM for Innovation Applications Supported by Knowledge Technology, ITMS: 26220220182, supported by the Research & Development Operational Programme funded by the ERDF (50 %).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Sarnovsky .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Sarnovsky, M., Carnoka, N. (2016). Distributed Algorithm for Text Documents Clustering Based on k-Means Approach. In: Grzech, A., Borzemski, L., Świątek, J., Wilimowska, Z. (eds) Information Systems Architecture and Technology: Proceedings of 36th International Conference on Information Systems Architecture and Technology – ISAT 2015 – Part II. Advances in Intelligent Systems and Computing, vol 430. Springer, Cham. https://doi.org/10.1007/978-3-319-28561-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28561-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28559-7

  • Online ISBN: 978-3-319-28561-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics