Skip to main content

A Scalable Text Classification Using Naive Bayes with Hadoop Framework

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1026))

Abstract

Automated text classification is the labeling of documents to the predefined class label or category using machine learning algorithms. It is one of the important domains in machine learning where the algorithm is applied to classify documents to the appropriate category or genre of the document. For example, the document might be news items and the class/category/genre might be business news, sport news, health news, financial news and social news. Due to the volume of this textual data and its presumed exponential growth, classical data mining techniques may not provide optimal performance in terms of efficiency. To this end, scalable machine learning library apache mahout with hadoop can be used to improve the performance of the algorithm and computation time. In this study Naïve Bayes classification algorithm is implemented on top of hadoop to build automatic document categorizer using Mapreduce programing model. Addis Ababa university institutional repository/Electronic thesis and dissertations text document is used for training and evaluation dataset. The proposed model achieved an accuracy of 79.06%. The result shows that the system can categorize large thesis documents into its predefined class with promising accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    1 zetta bytes = 1021 bytes or 1000 exabyte = 1 million petabytes = 1billion terabyte.

  2. 2.

    http://etd.aau.edu.et/.

References

  1. Withanawasam, J.: Apache Mahout Essential. Packet, Birmingham (2015)

    Google Scholar 

  2. Lucidworks: Text Classification with Mahout and Lucene. https://lucidworks.com/2013/10/30/road-to-revolution-text-classification-powered-by-apache-mahout-and-lucene/. 10 Apr 2019

  3. Tiwary, C.: Learning Apache Mahout_Acquire Practical Skills in Big Data Analytics and Explore Data Science with Apache Mahout. Packet, Birmingham (2015)

    Google Scholar 

  4. Jiang, E.P.: Content-based spam email classification using machine-learning algorithms. In: Text Mining: Applications and Theory, pp. 37–56. Wiley (2010)

    Google Scholar 

  5. Liu, B., Blasch, E., Chen, Y., Shen, D., Chen, G.: Scalable sentiment classification for big data analysis using Naïve Bayes classifier. In: 2013 IEEE International Conference on Big Data, Silicon Valley, CA, pp. 99–104 (2013)

    Google Scholar 

  6. Prabhat, A., Khullar, V.: Sentiment classification on big data using Naïve Bayes and logistic regression. In: 2017 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, pp. 1–5 (2017)

    Google Scholar 

  7. Wongso, R., Luwinda, F.A., Trisnajaya, B.C., Rudy, O.R.: News article text classification in Indonesian language. Procedia Comput. Sci. 116, 137–143 (2017)

    Article  Google Scholar 

  8. Ghazi, M.R., Gangodkar, D.: Hadoop, MapReduce and HDFS: a developers perspective. Procedia Comput. Sci. 48, 45–50 (2015)

    Article  Google Scholar 

  9. Kanavos, A., Nodarakis, N., Sioutas, S., Tsakalidis, A., Tsolis, D., Tzimas, G.: Large scale implementations for Twitter sentiment classification. Algorithms 10, 33 (2017)

    Article  MathSciNet  Google Scholar 

  10. Owen, S., et al.: Mahout in Action. Manning Publisher, Shelter Island (2012)

    Google Scholar 

  11. Ingersoll, G., et al.: Training Naive Bayes using Apache Mahout, pp. 1–10. Mannining Publication, Shelter Island (2015)

    Google Scholar 

  12. Kim, S.-B., Rim, H.-C., Yook, D., Lim, H.-S.: Effective methods for improving Naive Bayes text classifiers. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 414–423. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45683-X_45

    Chapter  Google Scholar 

  13. Lewis, D.D.: Effective methods for improving Naïve Bayes text classifiers. Mach. Learn. 1398, 414–423 (2002)

    MATH  Google Scholar 

  14. Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero one loss. Mach. Learn. 29, 103–130 (1997)

    Article  Google Scholar 

  15. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683

    Chapter  Google Scholar 

  16. Wei, L., et al.: Text classification using support vector machine with mixture of Kernel. J. Soft. Eng. Appl. 5, 55–58 (2012)

    Article  Google Scholar 

  17. Ikonomakis, E.K., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (2005)

    Google Scholar 

  18. Tegegnie, A.K., Tarekegn, A.N., Alem, T.A.: A comparative study of flat and hierarchical classification for Amharic news text using SVM. Int. J. Inf. Eng. 9, 36–42 (2017)

    Google Scholar 

  19. Asker, L., Argaw, A.A., Gambäck, B.: Applying machine learning to Amharic text classification. ResearchGate (2014)

    Google Scholar 

  20. Eyassu, B., Gamback, B.: Classifying Amharic text using self organizing map. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor (2005)

    Google Scholar 

  21. Salur, M.U., Tokat, S., Aydilek, İ.B.: Text classification on mahout with Naïve-Bayes machine learning algorithm. In: International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, pp. 1–5 (2017)

    Google Scholar 

  22. Santoso, J., Yuniarno, E.M., Hariadi, M.: Large scale text classification using map reduce and Naive Bayes algorithm for domain specified ontology building. In: 2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, pp. 428–432. IEEE Computer Society, Hangzhou (2015)

    Google Scholar 

  23. Chen, H., Fu, D.: An improved Naive Bayes classifier for large scale text. In: 2018 2nd International Conference on Artificial Intelligence: Technologies and Applications (ICAITA 2018) (2018)

    Google Scholar 

  24. Feng, M., Wu, G.: A distributed Chinese Naive Bayes classifier based on word embedding. In: International Conference on Machinery, Materials and Computing Technologies. Atlantis Press (2016)

    Google Scholar 

  25. Gunarathne, T.: Hadoop Mapreduce v2 Cookbook. Packet, Birmingham (2015)

    Google Scholar 

  26. Gupta, A.: Learning Apache Mahout Classification. Packet, Birmingham (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mulualem Mheretu Temesgen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Temesgen, M.M., Lemma, D.T. (2019). A Scalable Text Classification Using Naive Bayes with Hadoop Framework. In: Mekuria, F., Nigussie, E., Tegegne, T. (eds) Information and Communication Technology for Development for Africa. ICT4DA 2019. Communications in Computer and Information Science, vol 1026. Springer, Cham. https://doi.org/10.1007/978-3-030-26630-1_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26630-1_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26629-5

  • Online ISBN: 978-3-030-26630-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics