Skip to main content
Log in

Automatic categorization of web text documents using fuzzy inference rule

  • Published:
Sādhanā Aims and scope Submit manuscript

Abstract

The digital world is flooded with a huge number of documents belonging to multifarious categories. Most of these documents are uncategorized, which is a hindrance to efficient retrieval. In the case of news texts (one of the largest and most common sources of text information), it is often observed that a text does not belong to one particular category and has contents from multiple domains. This demands a text categorization system to segregate it into its respective domains for efficient information retrieval. The main challenge lies in handling the overlap of vocabulary among different domains at the time of categorization, which we have tackled using an approach based on fuzzy logic. In the present work a fuzzy rule inference system is presented, which works with newly proposed statistical features for segregating documents that belong to more than one or an undefined category. The generated model was defuzzified using five different techniques for determining the category of a document and the highest accuracy of 98.63% for the Centroid method was obtained. Experimentation was also carried out on standard English datasets (Reuters-21578 R8 and 20 Newsgroups). We obtain better results than those of reported works, thereby pointing to the language independence of our system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15

Similar content being viewed by others

References

  1. Post M, Callison-Burch C and Osborne M 2012 Constructing parallel corpora for six Indian languages via crowdsourcing. In: Proceedings of the Workshop on Statistical Machine Translation, pp. 401–409

  2. Babbel 2019 https://www.babbel.com/en/magazine/the-10-most-spoken-languages-in-the-world/

  3. Ethnologue 2019 https://www.ethnologue.com/language/ben

  4. Pal A R, Saha D and Dash N S 2015 Automatic classification of Bengali sentences based on sense definitions present in Bengali wordnet. Int. J. Control Theory Comput. Model. 05: 1–13

    Google Scholar 

  5. Wu K, Zhou M, Lu X S and Huang L 2017 A fuzzy logic based text classification method for social media data. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, pp. 1942–1947

  6. Prusa J D and Khoshgoftaar T M 2016 Designing a better data representation for deep neural networks and text classification. In: Proceedings of the IEEE International Conference on Information Reuse and Integration, pp. 411-416

  7. Bidi N and Elberrichi Z 2016 Feature selection for text classification using genetic algorithms. In: Proceedings of the IEEE International Conference on Modelling Identification and Control, pp. 806–810

  8. Wu H, Gu X and Gu Y 2017 Balancing between over-weighting and under-weighting in supervised term weighting. Inf. Process. Manag. 02: 547–557

    Article  Google Scholar 

  9. Jiang M, Pan Z and Li N 2017 Multi-label text categorization using l21-norm minimization extreme learning machine. Neurocomputing 261: 4–10

    Article  Google Scholar 

  10. Parvin H, Dahbashi A, Parvin S and Minaei-Bidgoli B 2012 Improving Persian text classification and clustering using Persian thesaurus. In: Proceedings of the International Conference on Distributed Computing and Artificial Intelligence, pp. 493–500

  11. Gupta N and Gupta V 2012 Punjabi text classification using naive Bayes, centroid and hybrid approach. In: Proceedings of the International Workshop on Computer Networks & Communications, pp. 109-122

  12. ArunaDevi K and Saveeth R 2014 A novel approach on Tamil text classification using C-feature. Int. J. Sci. Res. Dev. 02: 343–345

    Google Scholar 

  13. Patil J J and Bogiri N 2015 Automatic text categorization: Marathi documents In: Proceedings of the International Conference on Energy Systems and Applications, pp. 689–694

  14. Sarmah J, Saharia N and Shikhar K 2012 A novel approach for document classification using Assamese wordnet. In: Proceedings of the International Global Wordnet Conference, pp. 324–329

  15. Kabir F, Siddique S, Kotwal M R A and Huda M N 2015 Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of the International Conference on Cognitive Computing and Information Processing, pp. 1–4

  16. Islam M S, Jubayer F E M and Ahmed S I 2017 A comparative study on different types of approaches to Bengali document categorization. In: Proceedings of the International Conference on Engineering, Research, Innovation and Education, page 06

  17. Alam M T and Islam M M 2018 BARD: Bangla article classification using a new comprehensive dataset. In: Proceedings of the International Conference on Bangla Speech and Language Processing, pp. 1–5

  18. Sathe J B and Mali M P 2017 A hybrid sentiment classification method using neural network and fuzzy logic. In: Proceedings of the IEEE International Conference on Intelligent Systems and Control, pp. 93–96

  19. Kavuri D, Kumar P A and Rao D V S 2012 Text and image classification using fuzzy similarity based self-constructing algorithm. Int. J. Eng. Sci. Adv. Technol. 02: 1572–1576

    Google Scholar 

  20. Wilges B, Mateus G, Nassar S, Cislaghi R and Bastos R C 2016 Fuzzy modeling for multilabel text classification supported by classification algorithms. J. Comput. Sci. 12: 341–349

    Article  Google Scholar 

  21. Tetali A, Madhukumar B P N and Chandrakumar K 2012 Classification of text using fuzzy based incremental feature clustering algorithm. Int. J. Adv. Res. Comput. Eng. Technol. 01: 313–318

    Google Scholar 

  22. Dhar A, Dash N S and Roy K 2018 A fuzzy logic-based Bangla text classification for web text documents. J. Adv. Ling. Stud. 07: 159–187

    Google Scholar 

  23. wikipedia 2019 https://en.wikipedia.org/wiki/languages\_used \_on\_the\_internet

  24. Zadeh L 1965 Fuzzy sets. Inf. Control 8: 338–353

    Article  MathSciNet  Google Scholar 

  25. Sampath A K and Gomathi N 2017 Fuzzy-based multi-kernel spherical support vector machine for effective handwritten character recognition. Sādhanā 42: 1513–1525

    Article  MathSciNet  Google Scholar 

  26. Emmanuel W R S and Minija S J 2018 Fuzzy clustering and Whale-based neural network to food recognition and calorie estimation for daily dietary assessment. Sādhanā 43: 19

    Article  MathSciNet  Google Scholar 

  27. Daisy V R and Nirmala S 2018 Stability-integrated Fuzzy C means segmentation for spatial incorporated automation of number of clusters. Sādhanā 43: 16

    Article  MathSciNet  Google Scholar 

  28. Rehman A, Javed K, Babri H A and Asim M N 2018 Selection of the most relevant terms based on a max–min ratio metric for text classification. Expert Syst. Appl. 114: 78–96

  29. Pembe F C and Gungor T 2014 A tree-based learning approach for document structure analysis and its application to web search. Nat. Lang. Eng. 21: 569-605

    Article  Google Scholar 

  30. Thakur R K and Deshpande M V 2018 Kernel Optimized-Support Vector Machine and Mapreduce framework for sentiment classification of train reviews. Sādhanā 44: 06

    Article  Google Scholar 

  31. Xu B, Guo X, Ye Y and Cheng J 2012 An improved random forest classifier for text categorization. J. Comput. 07: 2913–2920

    Google Scholar 

  32. Demšar J 2006 Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 07: 1–30

    MathSciNet  MATH  Google Scholar 

  33. Tellez E S, Moctezuma D, Miranda-Jimnez S and Graff M 2018 An automated text categorization framework based on hyperparameter optimization. Knowl. Based Syst. 149: 110–123

    Article  Google Scholar 

  34. Mahabal A, Baldridge J, Ayan B K, Perot V and Roth D 2019 Text classification with few examples using controlled generalization. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3158–3167

  35. Wan C H, Lee L H, Rajkumar R and Isa D 2012 A hybrid text classification approach with low dependency on parameter by integrating k-nearest neighbor and support vector machine. Expert Syst. Appl. 39: 11880–11888

    Article  Google Scholar 

  36. Malliaros F D and Skianis K 2015 Graph-based term weighting for text categorization. In: Proceedings of the International Conference on Advances in Social Networks Analysis and Mining, pp. 1473–1479

  37. Ko Y 2012 A study of term weighting schemes using class information for text classification. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1030

  38. Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y and Guan R 2018 Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29: 61-70

    Article  Google Scholar 

  39. Prati R C 2015 Fuzzy rule classifiers for multi-label classification. In: Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 1–8

Download references

Acknowledgements

One of the authors thanks DST for financial support as INSPIRE fellowship for carrying out this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kaushik Roy.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dhar, A., Mukherjee, H., Dash, N.S. et al. Automatic categorization of web text documents using fuzzy inference rule. Sādhanā 45, 168 (2020). https://doi.org/10.1007/s12046-020-01401-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12046-020-01401-6

Keywords

Navigation