A Scalable Text Classification Using Naive Bayes with Hadoop Framework

Temesgen, Mulualem Mheretu; Lemma, Dereje Teferi

doi:10.1007/978-3-030-26630-1_25

A Scalable Text Classification Using Naive Bayes with Hadoop Framework

Mulualem Mheretu Temesgen¹⁰ &
Dereje Teferi Lemma¹¹

Conference paper
First Online: 02 August 2019

544 Accesses
3 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1026))

Abstract

Automated text classification is the labeling of documents to the predefined class label or category using machine learning algorithms. It is one of the important domains in machine learning where the algorithm is applied to classify documents to the appropriate category or genre of the document. For example, the document might be news items and the class/category/genre might be business news, sport news, health news, financial news and social news. Due to the volume of this textual data and its presumed exponential growth, classical data mining techniques may not provide optimal performance in terms of efficiency. To this end, scalable machine learning library apache mahout with hadoop can be used to improve the performance of the algorithm and computation time. In this study Naïve Bayes classification algorithm is implemented on top of hadoop to build automatic document categorizer using Mapreduce programing model. Addis Ababa university institutional repository/Electronic thesis and dissertations text document is used for training and evaluation dataset. The proposed model achieved an accuracy of 79.06%. The result shows that the system can categorize large thesis documents into its predefined class with promising accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
1 zetta bytes = 10²¹ bytes or 1000 exabyte = 1 million petabytes = 1billion terabyte.
2.
http://etd.aau.edu.et/.

References

Withanawasam, J.: Apache Mahout Essential. Packet, Birmingham (2015)
Google Scholar
Lucidworks: Text Classification with Mahout and Lucene. https://lucidworks.com/2013/10/30/road-to-revolution-text-classification-powered-by-apache-mahout-and-lucene/. 10 Apr 2019
Tiwary, C.: Learning Apache Mahout_Acquire Practical Skills in Big Data Analytics and Explore Data Science with Apache Mahout. Packet, Birmingham (2015)
Google Scholar
Jiang, E.P.: Content-based spam email classification using machine-learning algorithms. In: Text Mining: Applications and Theory, pp. 37–56. Wiley (2010)
Google Scholar
Liu, B., Blasch, E., Chen, Y., Shen, D., Chen, G.: Scalable sentiment classification for big data analysis using Naïve Bayes classifier. In: 2013 IEEE International Conference on Big Data, Silicon Valley, CA, pp. 99–104 (2013)
Google Scholar
Prabhat, A., Khullar, V.: Sentiment classification on big data using Naïve Bayes and logistic regression. In: 2017 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, pp. 1–5 (2017)
Google Scholar
Wongso, R., Luwinda, F.A., Trisnajaya, B.C., Rudy, O.R.: News article text classification in Indonesian language. Procedia Comput. Sci. 116, 137–143 (2017)
Article Google Scholar
Ghazi, M.R., Gangodkar, D.: Hadoop, MapReduce and HDFS: a developers perspective. Procedia Comput. Sci. 48, 45–50 (2015)
Article Google Scholar
Kanavos, A., Nodarakis, N., Sioutas, S., Tsakalidis, A., Tsolis, D., Tzimas, G.: Large scale implementations for Twitter sentiment classification. Algorithms 10, 33 (2017)
Article MathSciNet Google Scholar
Owen, S., et al.: Mahout in Action. Manning Publisher, Shelter Island (2012)
Google Scholar
Ingersoll, G., et al.: Training Naive Bayes using Apache Mahout, pp. 1–10. Mannining Publication, Shelter Island (2015)
Google Scholar
Kim, S.-B., Rim, H.-C., Yook, D., Lim, H.-S.: Effective methods for improving Naive Bayes text classifiers. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 414–423. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45683-X_45
Chapter Google Scholar
Lewis, D.D.: Effective methods for improving Naïve Bayes text classifiers. Mach. Learn. 1398, 414–423 (2002)
MATH Google Scholar
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero one loss. Mach. Learn. 29, 103–130 (1997)
Article Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Chapter Google Scholar
Wei, L., et al.: Text classification using support vector machine with mixture of Kernel. J. Soft. Eng. Appl. 5, 55–58 (2012)
Article Google Scholar
Ikonomakis, E.K., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (2005)
Google Scholar
Tegegnie, A.K., Tarekegn, A.N., Alem, T.A.: A comparative study of flat and hierarchical classification for Amharic news text using SVM. Int. J. Inf. Eng. 9, 36–42 (2017)
Google Scholar
Asker, L., Argaw, A.A., Gambäck, B.: Applying machine learning to Amharic text classification. ResearchGate (2014)
Google Scholar
Eyassu, B., Gamback, B.: Classifying Amharic text using self organizing map. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor (2005)
Google Scholar
Salur, M.U., Tokat, S., Aydilek, İ.B.: Text classification on mahout with Naïve-Bayes machine learning algorithm. In: International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, pp. 1–5 (2017)
Google Scholar
Santoso, J., Yuniarno, E.M., Hariadi, M.: Large scale text classification using map reduce and Naive Bayes algorithm for domain specified ontology building. In: 2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, pp. 428–432. IEEE Computer Society, Hangzhou (2015)
Google Scholar
Chen, H., Fu, D.: An improved Naive Bayes classifier for large scale text. In: 2018 2nd International Conference on Artificial Intelligence: Technologies and Applications (ICAITA 2018) (2018)
Google Scholar
Feng, M., Wu, G.: A distributed Chinese Naive Bayes classifier based on word embedding. In: International Conference on Machinery, Materials and Computing Technologies. Atlantis Press (2016)
Google Scholar
Gunarathne, T.: Hadoop Mapreduce v2 Cookbook. Packet, Birmingham (2015)
Google Scholar
Gupta, A.: Learning Apache Mahout Classification. Packet, Birmingham (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Collage of Computing and Informatics, Assosa University, Assosa, Ethiopia
Mulualem Mheretu Temesgen
School of Information Science, Addis Ababa University, Addis Ababa, Ethiopia
Dereje Teferi Lemma

Authors

Mulualem Mheretu Temesgen
View author publications
You can also search for this author in PubMed Google Scholar
Dereje Teferi Lemma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mulualem Mheretu Temesgen .

Editor information

Editors and Affiliations

Council for Scientific and Industrial Research, Meraka ICT Institute, Pretoria, South Africa
Fisseha Mekuria
Department of Future Technologies, University of Turku, Turku, Finland
Ethiopia Nigussie
ICT4D Research Center, Bahir Dar University, Bahir Dar , Ethiopia
Tesfa Tegegne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Temesgen, M.M., Lemma, D.T. (2019). A Scalable Text Classification Using Naive Bayes with Hadoop Framework. In: Mekuria, F., Nigussie, E., Tegegne, T. (eds) Information and Communication Technology for Development for Africa. ICT4DA 2019. Communications in Computer and Information Science, vol 1026. Springer, Cham. https://doi.org/10.1007/978-3-030-26630-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-26630-1_25
Published: 02 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26629-5
Online ISBN: 978-3-030-26630-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics