Abstract
The digital world is flooded with a huge number of documents belonging to multifarious categories. Most of these documents are uncategorized, which is a hindrance to efficient retrieval. In the case of news texts (one of the largest and most common sources of text information), it is often observed that a text does not belong to one particular category and has contents from multiple domains. This demands a text categorization system to segregate it into its respective domains for efficient information retrieval. The main challenge lies in handling the overlap of vocabulary among different domains at the time of categorization, which we have tackled using an approach based on fuzzy logic. In the present work a fuzzy rule inference system is presented, which works with newly proposed statistical features for segregating documents that belong to more than one or an undefined category. The generated model was defuzzified using five different techniques for determining the category of a document and the highest accuracy of 98.63% for the Centroid method was obtained. Experimentation was also carried out on standard English datasets (Reuters-21578 R8 and 20 Newsgroups). We obtain better results than those of reported works, thereby pointing to the language independence of our system.
Similar content being viewed by others
References
Post M, Callison-Burch C and Osborne M 2012 Constructing parallel corpora for six Indian languages via crowdsourcing. In: Proceedings of the Workshop on Statistical Machine Translation, pp. 401–409
Babbel 2019 https://www.babbel.com/en/magazine/the-10-most-spoken-languages-in-the-world/
Ethnologue 2019 https://www.ethnologue.com/language/ben
Pal A R, Saha D and Dash N S 2015 Automatic classification of Bengali sentences based on sense definitions present in Bengali wordnet. Int. J. Control Theory Comput. Model. 05: 1–13
Wu K, Zhou M, Lu X S and Huang L 2017 A fuzzy logic based text classification method for social media data. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, pp. 1942–1947
Prusa J D and Khoshgoftaar T M 2016 Designing a better data representation for deep neural networks and text classification. In: Proceedings of the IEEE International Conference on Information Reuse and Integration, pp. 411-416
Bidi N and Elberrichi Z 2016 Feature selection for text classification using genetic algorithms. In: Proceedings of the IEEE International Conference on Modelling Identification and Control, pp. 806–810
Wu H, Gu X and Gu Y 2017 Balancing between over-weighting and under-weighting in supervised term weighting. Inf. Process. Manag. 02: 547–557
Jiang M, Pan Z and Li N 2017 Multi-label text categorization using l21-norm minimization extreme learning machine. Neurocomputing 261: 4–10
Parvin H, Dahbashi A, Parvin S and Minaei-Bidgoli B 2012 Improving Persian text classification and clustering using Persian thesaurus. In: Proceedings of the International Conference on Distributed Computing and Artificial Intelligence, pp. 493–500
Gupta N and Gupta V 2012 Punjabi text classification using naive Bayes, centroid and hybrid approach. In: Proceedings of the International Workshop on Computer Networks & Communications, pp. 109-122
ArunaDevi K and Saveeth R 2014 A novel approach on Tamil text classification using C-feature. Int. J. Sci. Res. Dev. 02: 343–345
Patil J J and Bogiri N 2015 Automatic text categorization: Marathi documents In: Proceedings of the International Conference on Energy Systems and Applications, pp. 689–694
Sarmah J, Saharia N and Shikhar K 2012 A novel approach for document classification using Assamese wordnet. In: Proceedings of the International Global Wordnet Conference, pp. 324–329
Kabir F, Siddique S, Kotwal M R A and Huda M N 2015 Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of the International Conference on Cognitive Computing and Information Processing, pp. 1–4
Islam M S, Jubayer F E M and Ahmed S I 2017 A comparative study on different types of approaches to Bengali document categorization. In: Proceedings of the International Conference on Engineering, Research, Innovation and Education, page 06
Alam M T and Islam M M 2018 BARD: Bangla article classification using a new comprehensive dataset. In: Proceedings of the International Conference on Bangla Speech and Language Processing, pp. 1–5
Sathe J B and Mali M P 2017 A hybrid sentiment classification method using neural network and fuzzy logic. In: Proceedings of the IEEE International Conference on Intelligent Systems and Control, pp. 93–96
Kavuri D, Kumar P A and Rao D V S 2012 Text and image classification using fuzzy similarity based self-constructing algorithm. Int. J. Eng. Sci. Adv. Technol. 02: 1572–1576
Wilges B, Mateus G, Nassar S, Cislaghi R and Bastos R C 2016 Fuzzy modeling for multilabel text classification supported by classification algorithms. J. Comput. Sci. 12: 341–349
Tetali A, Madhukumar B P N and Chandrakumar K 2012 Classification of text using fuzzy based incremental feature clustering algorithm. Int. J. Adv. Res. Comput. Eng. Technol. 01: 313–318
Dhar A, Dash N S and Roy K 2018 A fuzzy logic-based Bangla text classification for web text documents. J. Adv. Ling. Stud. 07: 159–187
wikipedia 2019 https://en.wikipedia.org/wiki/languages\_used \_on\_the\_internet
Zadeh L 1965 Fuzzy sets. Inf. Control 8: 338–353
Sampath A K and Gomathi N 2017 Fuzzy-based multi-kernel spherical support vector machine for effective handwritten character recognition. Sādhanā 42: 1513–1525
Emmanuel W R S and Minija S J 2018 Fuzzy clustering and Whale-based neural network to food recognition and calorie estimation for daily dietary assessment. Sādhanā 43: 19
Daisy V R and Nirmala S 2018 Stability-integrated Fuzzy C means segmentation for spatial incorporated automation of number of clusters. Sādhanā 43: 16
Rehman A, Javed K, Babri H A and Asim M N 2018 Selection of the most relevant terms based on a max–min ratio metric for text classification. Expert Syst. Appl. 114: 78–96
Pembe F C and Gungor T 2014 A tree-based learning approach for document structure analysis and its application to web search. Nat. Lang. Eng. 21: 569-605
Thakur R K and Deshpande M V 2018 Kernel Optimized-Support Vector Machine and Mapreduce framework for sentiment classification of train reviews. Sādhanā 44: 06
Xu B, Guo X, Ye Y and Cheng J 2012 An improved random forest classifier for text categorization. J. Comput. 07: 2913–2920
Demšar J 2006 Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 07: 1–30
Tellez E S, Moctezuma D, Miranda-Jimnez S and Graff M 2018 An automated text categorization framework based on hyperparameter optimization. Knowl. Based Syst. 149: 110–123
Mahabal A, Baldridge J, Ayan B K, Perot V and Roth D 2019 Text classification with few examples using controlled generalization. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3158–3167
Wan C H, Lee L H, Rajkumar R and Isa D 2012 A hybrid text classification approach with low dependency on parameter by integrating k-nearest neighbor and support vector machine. Expert Syst. Appl. 39: 11880–11888
Malliaros F D and Skianis K 2015 Graph-based term weighting for text categorization. In: Proceedings of the International Conference on Advances in Social Networks Analysis and Mining, pp. 1473–1479
Ko Y 2012 A study of term weighting schemes using class information for text classification. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1030
Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y and Guan R 2018 Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29: 61-70
Prati R C 2015 Fuzzy rule classifiers for multi-label classification. In: Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 1–8
Acknowledgements
One of the authors thanks DST for financial support as INSPIRE fellowship for carrying out this research.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dhar, A., Mukherjee, H., Dash, N.S. et al. Automatic categorization of web text documents using fuzzy inference rule. Sādhanā 45, 168 (2020). https://doi.org/10.1007/s12046-020-01401-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12046-020-01401-6