A dictionary based model for bengali document classification

Das Dawn, Debapratim; Khan, Abhinandan; Shaikh, Soharab Hossain; Pal, Rajat Kumar

doi:10.1007/s10489-022-03955-w

A dictionary based model for bengali document classification

Published: 20 October 2022

Volume 53, pages 14023–14042, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

235 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Computer-aided documented content analysis is a prominent research area in natural language processing. A realistic implementation of this task is related to the subjectivity of the quantifiable data. One of the most interesting specialisations of this problem is automated document classification, which is a system that can identify the category of a document without human intervention. The problem of document classification has to consider the evaluation of the heart-of-the-matter of the textual material. Being one of the most-spoken languages in the world, a huge number of Bengali documents are present in digital form, and it is increasing rapidly due to the age of the internet. A document classification method is required to organise and categorise these huge documents rapidly and efficiently. In this paper, a decisive dictionary based model has been presented for the classification of documents in the Bengali text. We have introduced the concepts of lexiconid, lexiconaffinity, lexiconunicity, and lexiconassociation to acquire the features. The feature set is integrated with different levels of threshold. The proposed model is supervised, and the entire dataset has been split into testing and training sets. The proposed model has been validated using the k-fold cross-validation strategy. A significant number of dictionary based parameter values have been estimated for each token present in the text. In this paper, the text has been classified using a new rule based classification algorithm, predictive lexicon inference (PLI) classifier. The proposed model has been evaluated on five datasets: Paradise Lost, Iliad, Odyssey, Ramayana, and Mahabharata. In addition to document classification, this algorithm enables name entity classification, and chronology or description classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Hindi Text Classification Model Using SVM

Automatic Document Classification Based on J.S. Mill’s Ideas

A New Method of Automatic Text Document Classification

Article 01 May 2021

References

Ahmed M, Chakraborty P, Choudhury T (2022) Bangla Document Categorization Using Deep RNN Model with Attention Mechanism. In: Cyber intelligence and information retrieval, Springer, pp 137–147
Al-Harbi S, Almuhareb A, Al-Thubaity A, Khorsheed MS, Al-Rajeh A (2008) Automatic Arabic Text Classification. In: Proceedings of the 9th International conference on the statistical analysis of textual data (01/03/08). https://eprints.soton.ac.uk/272254/
Bartolini R, Lenci A, Montemagni S, Pirrelli V, Soria C (2004) Automatic classification and analysis of provisions in Italian legal texts: a case study. In: OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”, Springer, pp 593–604
Berry MW, Castellanos M (2004) Survey of text mining. Comput Rev 45(9):548
Google Scholar
Borko H, Bernick M (1963) Automatic document classification. J ACM 10(2):151–162
Article MATH Google Scholar
Britannica (1993) Encyclopædia britannica
Chen N, Blostein D (2007) A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int J Doc Anal Recognit 10(1):1–16
Article Google Scholar
Chy AN, Seddiqui MH, Das S (2014) Bangla news classification using naive Bayes classifier
Cunningham P, Delany SJ (2020) k-Nearest Neighbour Classifiers–. arXiv:200404523
Cutler A, Zhao G (2001) Pert-perfect random tree ensembles. Comput Sci Stat 33:490–497
Google Scholar
Dhar A, Dash NS, Roy K (2017) Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents. In: 2017 3rd international conference on advances in computing, communication & automation (ICACCA)(Fall), IEEE, pp 1–6
Dhar A, Mukherjee H, Dash NS, Roy K (2020) Automatic categorization of web text documents using fuzzy inference rule. Sādhanā 45(1):1–22
Article Google Scholar
Dhar A, Mukherjee H, Dash NS, Roy K (2020) CESS-A system to categorize bangla web text documents. ACM Trans Asian Low-Resour Lang Infor Process (TALLIP) 19(5):1–18
Article Google Scholar
Ekbal A, Naskar SK, Bandyopadhyay S (2007) Named entity recognition and transliteration in Bengali. Lingvisticae Investigationes 30(1):95–114
Article Google Scholar
Freitas AA (2014) Comprehensible classification models: a position paper. ACM SIGKDD explorations newsletter 15(1):1–10
Article Google Scholar
Friedman J, Hastie T, Tibshirani R, et al. (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407
Article MATH Google Scholar
Fu J, Lee S (2012) A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents. Expert Syst Appl 39(3):3127–3134
Article Google Scholar
Gonçalves T, Quaresma P (2003) A preliminary approach to the multilabel classification problem of Portuguese juridical documents. In: Portuguese conference on artificial intelligence, Springer, pp 435–444
Han EHS, Karypis G (2000) Centroid-based document classification: Analysis and experimental results. In: European conference on principles of data mining and knowledge discovery, Springer, pp 424–431
Heaps HS (1973) A theory of relevance for automatic document classification. Inf Control 22 (3):268–278
Article MathSciNet MATH Google Scholar
Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11(1):63–90
Article MATH Google Scholar
Hossain MR, Hoque MM, Siddique N, Sarker IH (2021) Bengali text document categorization based on very deep convolution neural network. Expert Syst Appl 184:115,394
Article Google Scholar
Borko H, Bernick M (1964) Automatic document classification part II. Additional experiments. J ACM 11(2):138–151
Article MATH Google Scholar
Dhar A, Dash NS, Roy K (2018) A fuzzy Logic-Based bangla text classification for web text documents. Journal of Advanced Linguistics Studies 7(1-2)
Islam MS, Jubayer FEM, Ahmed SI (2017) A support vector machine mixed with TF-IDF algorithm to categorize Bengali document. In: 2017 International conference on electrical, computer and communication engineering (ECCE), IEEE, pp 191–196
Kabir F, Siddique S, Kotwal MRA, Huda MN (2015) Bangla text document categorization using stochastic gradient descent (sgd) classifier. In: 2015 International Conference on Cognitive Computing and Information Processing (CCIP), IEEE, pp 1–4
Kohavi R (1995) The power of decision tables. In: European conference on machine learning, Springer, pp 174–189
Krail N, Gupta V (2012) Domain based classification of Punjabi text documents using ontology and hybrid based approach. In: Proceedings of the 3rd Workshop on south and southeast asian natural language processing, pp 109–122
Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59(1-2):161–205
Article MATH Google Scholar
Mansur M (2006) Analysis of n-gram based text categorization for bangla in a newspaper corpus. PhD thesis, BRAC University
Mishra AK, Ratha BK (2016) Study of random tree and random forest data mining algorithms for microarray data analysis. Int J Adv Electric Comput Eng 3(4):5–7
Google Scholar
Naji H, Ashour W (2016) Text classification for arabic words using Rep-Tree. International Journal of Computer Science & Information Technology (IJCSIT) vol 8
Paul S, Purkhyasta BS (2019) English to bengali transliteration tool for OOV words common in indian civil aviation. J Adv Database Manag & Syst 6(1):23–32
Google Scholar
Puri S, Singh SP (2018) Hindi text document classification system using SVM and fuzzy: a survey. International Journal of Rough Sets and Data Analysis (IJRSDA) 5(4):1–31
Article Google Scholar
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier
Reshma U, Barathi Ganesh H, Anand Kumar M, Soman K (2015) Supervised methods for domain classification of tamil documents. ARPN J Eng Appl Sci 10(8):3702–3707
Google Scholar
Şahin G (2017) Turkish document classification based on Word2Vec and SVM classifier. In: 2017 25th Signal processing and communications applications conference (SIU), IEEE, pp 1–4
Shahzad W, Asad S, Khan MA (2013) Feature subset selection using association rule mining and JRip classifier. Int J Phys Sci 8(18):885–896
Article Google Scholar
Vempala SS (2005) The random projection method, vol 65. American Mathematical Soc
Wang L, Jia X (2009) Integration of soft and hard classifications using extended support vector machines. IEEE Geosci Remote Sens Lett 6(3):543–547
Article Google Scholar
Wang Y (1997) Induction of model trees for predicting continuous classes. Proceedings of the European conference on machine learning, Prague, Czech Republic. https://ci.nii.ac.jp/naid/10017502385/en/. Accessed 22 Aug 2022
Willett P (1983) Similarity coefficients and weighting functions for automatic document classification: an empirical comparison. Int Class 10(3):138–142
Google Scholar
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Calcutta, Acharya Prafulla Chandra Roy Shiksha Prangan, JD-2, Sector-III, Saltlake, Kolkata, 700106, India
Debapratim Das Dawn & Rajat Kumar Pal
Product Development and Diversification, ARP Engineering, 147 Nilgunj Road, Kolkata, 700056, India
Abhinandan Khan
Department of Computer Science and Engineering, BML Munjal University, National Highway 8, 67KM Milestone, Gurugram, Haryana, 122413, India
Soharab Hossain Shaikh

Authors

Debapratim Das Dawn
View author publications
You can also search for this author in PubMed Google Scholar
Abhinandan Khan
View author publications
You can also search for this author in PubMed Google Scholar
Soharab Hossain Shaikh
View author publications
You can also search for this author in PubMed Google Scholar
Rajat Kumar Pal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Debapratim Das Dawn.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Das Dawn, D., Khan, A., Shaikh, S.H. et al. A dictionary based model for bengali document classification. Appl Intell 53, 14023–14042 (2023). https://doi.org/10.1007/s10489-022-03955-w

Download citation

Accepted: 01 July 2022
Published: 20 October 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-03955-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A dictionary based model for bengali document classification

Abstract

Access this article

Similar content being viewed by others

An Efficient Hindi Text Classification Model Using SVM

Automatic Document Classification Based on J.S. Mill’s Ideas

A New Method of Automatic Text Document Classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A dictionary based model for bengali document classification

Abstract

Access this article

Similar content being viewed by others

An Efficient Hindi Text Classification Model Using SVM

Automatic Document Classification Based on J.S. Mill’s Ideas

A New Method of Automatic Text Document Classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation