Automatic categorization of web text documents using fuzzy inference rule

Dhar, Ankita; Mukherjee, Himadri; Dash, Niladri Sekhar; Roy, Kaushik

doi:10.1007/s12046-020-01401-6

Automatic categorization of web text documents using fuzzy inference rule

Published: 27 June 2020

Volume 45, article number 168, (2020)
Cite this article

Sādhanā Aims and scope Submit manuscript

Ankita Dhar¹,
Himadri Mukherjee¹,
Niladri Sekhar Dash² &
…
Kaushik Roy ORCID: orcid.org/0000-0002-3360-7576¹

180 Accesses
4 Citations
Explore all metrics

Abstract

The digital world is flooded with a huge number of documents belonging to multifarious categories. Most of these documents are uncategorized, which is a hindrance to efficient retrieval. In the case of news texts (one of the largest and most common sources of text information), it is often observed that a text does not belong to one particular category and has contents from multiple domains. This demands a text categorization system to segregate it into its respective domains for efficient information retrieval. The main challenge lies in handling the overlap of vocabulary among different domains at the time of categorization, which we have tackled using an approach based on fuzzy logic. In the present work a fuzzy rule inference system is presented, which works with newly proposed statistical features for segregating documents that belong to more than one or an undefined category. The generated model was defuzzified using five different techniques for determining the category of a document and the highest accuracy of 98.63% for the Centroid method was obtained. Experimentation was also carried out on standard English datasets (Reuters-21578 R8 and 20 Newsgroups). We obtain better results than those of reported works, thereby pointing to the language independence of our system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

A review of semi-supervised learning for text classification

Article 31 January 2023

A novel feature and class-based globalization technique for text classification

Article 25 April 2023

References

Post M, Callison-Burch C and Osborne M 2012 Constructing parallel corpora for six Indian languages via crowdsourcing. In: Proceedings of the Workshop on Statistical Machine Translation, pp. 401–409
Babbel 2019 https://www.babbel.com/en/magazine/the-10-most-spoken-languages-in-the-world/
Ethnologue 2019 https://www.ethnologue.com/language/ben
Pal A R, Saha D and Dash N S 2015 Automatic classification of Bengali sentences based on sense definitions present in Bengali wordnet. Int. J. Control Theory Comput. Model. 05: 1–13
Google Scholar
Wu K, Zhou M, Lu X S and Huang L 2017 A fuzzy logic based text classification method for social media data. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, pp. 1942–1947
Prusa J D and Khoshgoftaar T M 2016 Designing a better data representation for deep neural networks and text classification. In: Proceedings of the IEEE International Conference on Information Reuse and Integration, pp. 411-416
Bidi N and Elberrichi Z 2016 Feature selection for text classification using genetic algorithms. In: Proceedings of the IEEE International Conference on Modelling Identification and Control, pp. 806–810
Wu H, Gu X and Gu Y 2017 Balancing between over-weighting and under-weighting in supervised term weighting. Inf. Process. Manag. 02: 547–557
Article Google Scholar
Jiang M, Pan Z and Li N 2017 Multi-label text categorization using l21-norm minimization extreme learning machine. Neurocomputing 261: 4–10
Article Google Scholar
Parvin H, Dahbashi A, Parvin S and Minaei-Bidgoli B 2012 Improving Persian text classification and clustering using Persian thesaurus. In: Proceedings of the International Conference on Distributed Computing and Artificial Intelligence, pp. 493–500
Gupta N and Gupta V 2012 Punjabi text classification using naive Bayes, centroid and hybrid approach. In: Proceedings of the International Workshop on Computer Networks & Communications, pp. 109-122
ArunaDevi K and Saveeth R 2014 A novel approach on Tamil text classification using C-feature. Int. J. Sci. Res. Dev. 02: 343–345
Google Scholar
Patil J J and Bogiri N 2015 Automatic text categorization: Marathi documents In: Proceedings of the International Conference on Energy Systems and Applications, pp. 689–694
Sarmah J, Saharia N and Shikhar K 2012 A novel approach for document classification using Assamese wordnet. In: Proceedings of the International Global Wordnet Conference, pp. 324–329
Kabir F, Siddique S, Kotwal M R A and Huda M N 2015 Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of the International Conference on Cognitive Computing and Information Processing, pp. 1–4
Islam M S, Jubayer F E M and Ahmed S I 2017 A comparative study on different types of approaches to Bengali document categorization. In: Proceedings of the International Conference on Engineering, Research, Innovation and Education, page 06
Alam M T and Islam M M 2018 BARD: Bangla article classification using a new comprehensive dataset. In: Proceedings of the International Conference on Bangla Speech and Language Processing, pp. 1–5
Sathe J B and Mali M P 2017 A hybrid sentiment classification method using neural network and fuzzy logic. In: Proceedings of the IEEE International Conference on Intelligent Systems and Control, pp. 93–96
Kavuri D, Kumar P A and Rao D V S 2012 Text and image classification using fuzzy similarity based self-constructing algorithm. Int. J. Eng. Sci. Adv. Technol. 02: 1572–1576
Google Scholar
Wilges B, Mateus G, Nassar S, Cislaghi R and Bastos R C 2016 Fuzzy modeling for multilabel text classification supported by classification algorithms. J. Comput. Sci. 12: 341–349
Article Google Scholar
Tetali A, Madhukumar B P N and Chandrakumar K 2012 Classification of text using fuzzy based incremental feature clustering algorithm. Int. J. Adv. Res. Comput. Eng. Technol. 01: 313–318
Google Scholar
Dhar A, Dash N S and Roy K 2018 A fuzzy logic-based Bangla text classification for web text documents. J. Adv. Ling. Stud. 07: 159–187
Google Scholar
wikipedia 2019 https://en.wikipedia.org/wiki/languages\_used \_on\_the\_internet
Zadeh L 1965 Fuzzy sets. Inf. Control 8: 338–353
Article MathSciNet Google Scholar
Sampath A K and Gomathi N 2017 Fuzzy-based multi-kernel spherical support vector machine for effective handwritten character recognition. Sādhanā 42: 1513–1525
Article MathSciNet Google Scholar
Emmanuel W R S and Minija S J 2018 Fuzzy clustering and Whale-based neural network to food recognition and calorie estimation for daily dietary assessment. Sādhanā 43: 19
Article MathSciNet Google Scholar
Daisy V R and Nirmala S 2018 Stability-integrated Fuzzy C means segmentation for spatial incorporated automation of number of clusters. Sādhanā 43: 16
Article MathSciNet Google Scholar
Rehman A, Javed K, Babri H A and Asim M N 2018 Selection of the most relevant terms based on a max–min ratio metric for text classification. Expert Syst. Appl. 114: 78–96
Pembe F C and Gungor T 2014 A tree-based learning approach for document structure analysis and its application to web search. Nat. Lang. Eng. 21: 569-605
Article Google Scholar
Thakur R K and Deshpande M V 2018 Kernel Optimized-Support Vector Machine and Mapreduce framework for sentiment classification of train reviews. Sādhanā 44: 06
Article Google Scholar
Xu B, Guo X, Ye Y and Cheng J 2012 An improved random forest classifier for text categorization. J. Comput. 07: 2913–2920
Google Scholar
Demšar J 2006 Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 07: 1–30
MathSciNet MATH Google Scholar
Tellez E S, Moctezuma D, Miranda-Jimnez S and Graff M 2018 An automated text categorization framework based on hyperparameter optimization. Knowl. Based Syst. 149: 110–123
Article Google Scholar
Mahabal A, Baldridge J, Ayan B K, Perot V and Roth D 2019 Text classification with few examples using controlled generalization. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3158–3167
Wan C H, Lee L H, Rajkumar R and Isa D 2012 A hybrid text classification approach with low dependency on parameter by integrating k-nearest neighbor and support vector machine. Expert Syst. Appl. 39: 11880–11888
Article Google Scholar
Malliaros F D and Skianis K 2015 Graph-based term weighting for text categorization. In: Proceedings of the International Conference on Advances in Social Networks Analysis and Mining, pp. 1473–1479
Ko Y 2012 A study of term weighting schemes using class information for text classification. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1030
Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y and Guan R 2018 Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29: 61-70
Article Google Scholar
Prati R C 2015 Fuzzy rule classifiers for multi-label classification. In: Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 1–8

Download references

Acknowledgements

One of the authors thanks DST for financial support as INSPIRE fellowship for carrying out this research.

Author information

Authors and Affiliations

Department of Computer Science, West Bengal State University, Kolkata, India
Ankita Dhar, Himadri Mukherjee & Kaushik Roy
Linguistic Research Unit, Indian Statistical Institute, Kolkata, India
Niladri Sekhar Dash

Authors

Ankita Dhar
View author publications
You can also search for this author in PubMed Google Scholar
Himadri Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar
Kaushik Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaushik Roy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dhar, A., Mukherjee, H., Dash, N.S. et al. Automatic categorization of web text documents using fuzzy inference rule. Sādhanā 45, 168 (2020). https://doi.org/10.1007/s12046-020-01401-6

Download citation

Received: 11 November 2019
Revised: 06 March 2020
Accepted: 06 March 2020
Published: 27 June 2020
DOI: https://doi.org/10.1007/s12046-020-01401-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic categorization of web text documents using fuzzy inference rule

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A review of semi-supervised learning for text classification

A novel feature and class-based globalization technique for text classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic categorization of web text documents using fuzzy inference rule

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A review of semi-supervised learning for text classification

A novel feature and class-based globalization technique for text classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation