Plugging Text Processing and Mining in a Cloud Computing Framework

Rajdho, Akil; Biba, Marenglen

doi:10.1007/978-3-642-34952-2_15

Akil Rajdho^6,7 &
Marenglen Biba⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 460))

2140 Accesses
1 Citations

Abstract

Computational methods have evolved over the years giving developers and researchers more sophisticated and faster ways to solve hard data processing tasks. However, with new data collecting and storage technologies, the amount of gathered data increases everyday making the analysis of it a more and more complex task. One of the main forms of storing data is plain unstructured text and one of the most common ways of analyzing this kind of data is through Text Mining. Text Mining is similar to other types of data mining but the problem is that differently from other forms of data that are properly structured (such as XML) in text mining data in the best case scenario is semi-structured. In order for them to derive valuable information, text mining systems have to execute a lot of complex natural language processing algorithms. In this chapter we focus on text processing tools dealing with stemming algorithms. Stemming is the step that deals with finding the stem (or root) of the word which is essential in every text processing procedure. Stemming algorithms are complex and require high computational effort. In this chapter we present an Apache Mahout plugin for a stemming algorithm making possible to execute the algorithm in a cloud computing environment. We investigate the performance of the algorithm in the cloud and show that the new approach significantly reduces the execution time of the original algorithm over a large dataset of text documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apache Hadoop Foundation, Apache Hadoop (2011), from Hadoop http://hadoop.apache.org/ (retrieved November 15, 2011)
Bakshi, K.: Cisco Cloud Computing - Data Center Strategy, Architecture, and Solutions. Cisco Systems, Inc., California (2009)
Google Scholar
Borthakur, D. Hadoop Distributed File System, from Hadoop Distributed File System (2012), http://hadoop.apache.org/ (retrieved March 2012)
CNN. CNNTECH (2006), from CNN http://articles.cnn.com/2006-11-01/tech/100millionwebsites_1_web-site-cern-tim-berners-lee?_s=PM:TECH (retrieved November 23, 2011)
Dean, J., Ghemawat, S.: Map Reduce: Simplified Data Processing on Large Clusters. Google, Inc. (2004)
Google Scholar
Dialogic, I. (2011) from Dialogic Inc. http://www.dialogic.com/~/media/products/docs/whitepapers/12023-cloud-computing-wp.pdf (retrieved November 22, 2011)
Dikaiakos, M.D., Pallis, G., Katsaros, D., Mehra, P., Vakali, A.: Cloud Computing, Distributed Internet Computing for IT. Internet Computing 13(5) (2009)
Google Scholar
Eddy, S.R.: Hidden Markov Models. Current Opinion in Structural Biology 6, 361–365 (1996)
Article MathSciNet Google Scholar
Habert, B., Adda, G., Adda-Decker, M., Marueuil, P.B., Ferrari, S., Ferret, O., et al.: Towards Tokenization Evaluation. In: Proceedings of LREC (1998)
Google Scholar
Han, L., Saengngam, T., van Hemert, J.: Parallel Data Intensive Applications in Cloud: A data mining use case study in the Life Sciences (Extended Abstract). In: UK-eScience AHM Meeting, in Cardiff (2010)
Google Scholar
Jackson, P., Moulinier, I.: Natural language processing for online applications: Text retrieval, Extraction and Categorization. Language 5(1), 178–178 (2002)
Google Scholar
Jeffry, D., Sanjay, G.: MapReduce: A Flexible Data Processing Tool. Communications of the ACM 53(1), 72–77 (2010)
Article Google Scholar
Jin, C., Buyya, R.: MapReduce Programming Model for.NET-Based Cloud Computing. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 417–428. Springer, Heidelberg (2009)
Chapter Google Scholar
Kunz, C.: Yahoo Webmap Research (2008), from Yahoo http://www.research.yahoo.com/files/YahooWebmap.pdf (retrieved January 25, 2012)
Lovins, J.B.: Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)
Google Scholar
Manning, C.D., Raghava, P., Schütz, H.: An Introduction To Information Retrieval. Cambridge University Press, Cambridge (2009)
Google Scholar
Melucci, M., Orio, N.: A novel method for stemmer generation based on hidden Markov. In: Proceedings of CIKM 2003, 12th International Conference on Information and Knowledge Management. ACM, New York (2003)
Google Scholar
Mell, P., Grance, T.: NIST Working Definition of Cloud Computing. National Institute of Standards and Technology, Information Technology Laboratory, Rockville, Columbia (2009)
Google Scholar
Noll, G.M.: Running Hadoop On Ubuntu Linux (2011), from Michael Noll Hadoop Tutorials http://www.michael-noll.com/tutorials/ (retrieved January 10, 2012)
Ntais, G.: Development of a Stemmer for the Greek Language. MSc Thesis, Stockholm University (2006)
Google Scholar
Peng, B., Cui, B., Li, X.: Implementation Issues of A Cloud Computing Platform. IEEE Data Eng. Bull. 32(1), 59–66 (2009)
Google Scholar
Porter, M.: Porter Stemmer, http://tartarus.org/~martin/ (retrieved November 15, 2011)
Rabiner, L.: A tutorial in Hidden Markov Models and Selected Applications in Speech Recognition. Readings in speech recognition. Morgan Kaufmann Publishers Inc., San Francisco (1990)
Google Scholar
Sadiku, J.: A Novel Stemming Algorithm for Albanian in a Data Mining Approach for Document Classification. MSc thesis, University of New York, in Tirana (2011)
Google Scholar
Smirnov, I.: Overview of Stemming Algorithms. Mechanical Translation. DePaul University, Chicago (2008)
Google Scholar
Yang, P.: Facebook Blogspot (2011), from Facebook Blogspot (retrieved January 5, 2012)
Google Scholar
Zhang, S., Zhang, C., Yang, Q.: Data Preparation for Data Mining. Applied Artificial Intelligence 17(5-6), 375–381 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of New York in Tirana, Tirana, Albania
Akil Rajdho & Marenglen Biba
School of Computing and Mathematical Sciences, University of Greenwich, London, UK
Akil Rajdho

Authors

Akil Rajdho
View author publications
You can also search for this author in PubMed Google Scholar
Marenglen Biba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akil Rajdho .

Editor information

Editors and Affiliations

, School Of Computer Science, University of Derby, Derby, DE22 1GB, United Kingdom
Nik Bessis
, Departament De Llenguatges I, Universitat Politècnica De Catalunya, C/Jordi Girona, Campus Nord, ed. Omega 1-3, Barcelona, 08034, Spain
Fatos Xhafa
School Of Electrical & Computer Engineer, Division Of Communication,, National Technical University Of Athens, Iroon Polytechniou 9, Athens, 15780, Greece
Dora Varvarigou
School Of Computer Science, and Mathematics, University of Derby, Kedleston Road, Derby, DE22 1GB, United Kingdom
Richard Hill
School Of Engineering And Design, Department Of Electronic And, Brunel University, Middlesex, Uxbridge, UB8 3PH, United Kingdom
Maozhen Li

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rajdho, A., Biba, M. (2013). Plugging Text Processing and Mining in a Cloud Computing Framework. In: Bessis, N., Xhafa, F., Varvarigou, D., Hill, R., Li, M. (eds) Internet of Things and Inter-cooperative Computational Technologies for Collective Intelligence. Studies in Computational Intelligence, vol 460. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34952-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-34952-2_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34951-5
Online ISBN: 978-3-642-34952-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics