Skip to main content

Part of the book series: Studies in Computational Intelligence ((SCI,volume 460))

Abstract

Computational methods have evolved over the years giving developers and researchers more sophisticated and faster ways to solve hard data processing tasks. However, with new data collecting and storage technologies, the amount of gathered data increases everyday making the analysis of it a more and more complex task. One of the main forms of storing data is plain unstructured text and one of the most common ways of analyzing this kind of data is through Text Mining. Text Mining is similar to other types of data mining but the problem is that differently from other forms of data that are properly structured (such as XML) in text mining data in the best case scenario is semi-structured. In order for them to derive valuable information, text mining systems have to execute a lot of complex natural language processing algorithms. In this chapter we focus on text processing tools dealing with stemming algorithms. Stemming is the step that deals with finding the stem (or root) of the word which is essential in every text processing procedure. Stemming algorithms are complex and require high computational effort. In this chapter we present an Apache Mahout plugin for a stemming algorithm making possible to execute the algorithm in a cloud computing environment. We investigate the performance of the algorithm in the cloud and show that the new approach significantly reduces the execution time of the original algorithm over a large dataset of text documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apache Hadoop Foundation, Apache Hadoop (2011), from Hadoop http://hadoop.apache.org/ (retrieved November 15, 2011)

  2. Bakshi, K.: Cisco Cloud Computing - Data Center Strategy, Architecture, and Solutions. Cisco Systems, Inc., California (2009)

    Google Scholar 

  3. Borthakur, D. Hadoop Distributed File System, from Hadoop Distributed File System (2012), http://hadoop.apache.org/ (retrieved March 2012)

  4. CNN. CNNTECH (2006), from CNN http://articles.cnn.com/2006-11-01/tech/100millionwebsites_1_web-site-cern-tim-berners-lee?_s=PM:TECH (retrieved November 23, 2011)

  5. Dean, J., Ghemawat, S.: Map Reduce: Simplified Data Processing on Large Clusters. Google, Inc. (2004)

    Google Scholar 

  6. Dialogic, I. (2011) from Dialogic Inc. http://www.dialogic.com/~/media/products/docs/whitepapers/12023-cloud-computing-wp.pdf (retrieved November 22, 2011)

  7. Dikaiakos, M.D., Pallis, G., Katsaros, D., Mehra, P., Vakali, A.: Cloud Computing, Distributed Internet Computing for IT. Internet Computing 13(5) (2009)

    Google Scholar 

  8. Eddy, S.R.: Hidden Markov Models. Current Opinion in Structural Biology 6, 361–365 (1996)

    Article  MathSciNet  Google Scholar 

  9. Habert, B., Adda, G., Adda-Decker, M., Marueuil, P.B., Ferrari, S., Ferret, O., et al.: Towards Tokenization Evaluation. In: Proceedings of LREC (1998)

    Google Scholar 

  10. Han, L., Saengngam, T., van Hemert, J.: Parallel Data Intensive Applications in Cloud: A data mining use case study in the Life Sciences (Extended Abstract). In: UK-eScience AHM Meeting, in Cardiff (2010)

    Google Scholar 

  11. Jackson, P., Moulinier, I.: Natural language processing for online applications: Text retrieval, Extraction and Categorization. Language 5(1), 178–178 (2002)

    Google Scholar 

  12. Jeffry, D., Sanjay, G.: MapReduce: A Flexible Data Processing Tool. Communications of the ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  13. Jin, C., Buyya, R.: MapReduce Programming Model for.NET-Based Cloud Computing. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 417–428. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  14. Kunz, C.: Yahoo Webmap Research (2008), from Yahoo http://www.research.yahoo.com/files/YahooWebmap.pdf (retrieved January 25, 2012)

  15. Lovins, J.B.: Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)

    Google Scholar 

  16. Manning, C.D., Raghava, P., Schütz, H.: An Introduction To Information Retrieval. Cambridge University Press, Cambridge (2009)

    Google Scholar 

  17. Melucci, M., Orio, N.: A novel method for stemmer generation based on hidden Markov. In: Proceedings of CIKM 2003, 12th International Conference on Information and Knowledge Management. ACM, New York (2003)

    Google Scholar 

  18. Mell, P., Grance, T.: NIST Working Definition of Cloud Computing. National Institute of Standards and Technology, Information Technology Laboratory, Rockville, Columbia (2009)

    Google Scholar 

  19. Noll, G.M.: Running Hadoop On Ubuntu Linux (2011), from Michael Noll Hadoop Tutorials http://www.michael-noll.com/tutorials/ (retrieved January 10, 2012)

  20. Ntais, G.: Development of a Stemmer for the Greek Language. MSc Thesis, Stockholm University (2006)

    Google Scholar 

  21. Peng, B., Cui, B., Li, X.: Implementation Issues of A Cloud Computing Platform. IEEE Data Eng. Bull. 32(1), 59–66 (2009)

    Google Scholar 

  22. Porter, M.: Porter Stemmer, http://tartarus.org/~martin/ (retrieved November 15, 2011)

  23. Rabiner, L.: A tutorial in Hidden Markov Models and Selected Applications in Speech Recognition. Readings in speech recognition. Morgan Kaufmann Publishers Inc., San Francisco (1990)

    Google Scholar 

  24. Sadiku, J.: A Novel Stemming Algorithm for Albanian in a Data Mining Approach for Document Classification. MSc thesis, University of New York, in Tirana (2011)

    Google Scholar 

  25. Smirnov, I.: Overview of Stemming Algorithms. Mechanical Translation. DePaul University, Chicago (2008)

    Google Scholar 

  26. Yang, P.: Facebook Blogspot (2011), from Facebook Blogspot (retrieved January 5, 2012)

    Google Scholar 

  27. Zhang, S., Zhang, C., Yang, Q.: Data Preparation for Data Mining. Applied Artificial Intelligence 17(5-6), 375–381 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Akil Rajdho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Rajdho, A., Biba, M. (2013). Plugging Text Processing and Mining in a Cloud Computing Framework. In: Bessis, N., Xhafa, F., Varvarigou, D., Hill, R., Li, M. (eds) Internet of Things and Inter-cooperative Computational Technologies for Collective Intelligence. Studies in Computational Intelligence, vol 460. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34952-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34952-2_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34951-5

  • Online ISBN: 978-3-642-34952-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics