Abstract
Computer-aided documented content analysis is a prominent research area in natural language processing. A realistic implementation of this task is related to the subjectivity of the quantifiable data. One of the most interesting specialisations of this problem is automated document classification, which is a system that can identify the category of a document without human intervention. The problem of document classification has to consider the evaluation of the heart-of-the-matter of the textual material. Being one of the most-spoken languages in the world, a huge number of Bengali documents are present in digital form, and it is increasing rapidly due to the age of the internet. A document classification method is required to organise and categorise these huge documents rapidly and efficiently. In this paper, a decisive dictionary based model has been presented for the classification of documents in the Bengali text. We have introduced the concepts of lexiconid, lexiconaffinity, lexiconunicity, and lexiconassociation to acquire the features. The feature set is integrated with different levels of threshold. The proposed model is supervised, and the entire dataset has been split into testing and training sets. The proposed model has been validated using the k-fold cross-validation strategy. A significant number of dictionary based parameter values have been estimated for each token present in the text. In this paper, the text has been classified using a new rule based classification algorithm, predictive lexicon inference (PLI) classifier. The proposed model has been evaluated on five datasets: Paradise Lost, Iliad, Odyssey, Ramayana, and Mahabharata. In addition to document classification, this algorithm enables name entity classification, and chronology or description classification.
Similar content being viewed by others
References
Ahmed M, Chakraborty P, Choudhury T (2022) Bangla Document Categorization Using Deep RNN Model with Attention Mechanism. In: Cyber intelligence and information retrieval, Springer, pp 137–147
Al-Harbi S, Almuhareb A, Al-Thubaity A, Khorsheed MS, Al-Rajeh A (2008) Automatic Arabic Text Classification. In: Proceedings of the 9th International conference on the statistical analysis of textual data (01/03/08). https://eprints.soton.ac.uk/272254/
Bartolini R, Lenci A, Montemagni S, Pirrelli V, Soria C (2004) Automatic classification and analysis of provisions in Italian legal texts: a case study. In: OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”, Springer, pp 593–604
Berry MW, Castellanos M (2004) Survey of text mining. Comput Rev 45(9):548
Borko H, Bernick M (1963) Automatic document classification. J ACM 10(2):151–162
Britannica (1993) Encyclopædia britannica
Chen N, Blostein D (2007) A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int J Doc Anal Recognit 10(1):1–16
Chy AN, Seddiqui MH, Das S (2014) Bangla news classification using naive Bayes classifier
Cunningham P, Delany SJ (2020) k-Nearest Neighbour Classifiers–. arXiv:200404523
Cutler A, Zhao G (2001) Pert-perfect random tree ensembles. Comput Sci Stat 33:490–497
Dhar A, Dash NS, Roy K (2017) Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents. In: 2017 3rd international conference on advances in computing, communication & automation (ICACCA)(Fall), IEEE, pp 1–6
Dhar A, Mukherjee H, Dash NS, Roy K (2020) Automatic categorization of web text documents using fuzzy inference rule. Sādhanā 45(1):1–22
Dhar A, Mukherjee H, Dash NS, Roy K (2020) CESS-A system to categorize bangla web text documents. ACM Trans Asian Low-Resour Lang Infor Process (TALLIP) 19(5):1–18
Ekbal A, Naskar SK, Bandyopadhyay S (2007) Named entity recognition and transliteration in Bengali. Lingvisticae Investigationes 30(1):95–114
Freitas AA (2014) Comprehensible classification models: a position paper. ACM SIGKDD explorations newsletter 15(1):1–10
Friedman J, Hastie T, Tibshirani R, et al. (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407
Fu J, Lee S (2012) A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents. Expert Syst Appl 39(3):3127–3134
Gonçalves T, Quaresma P (2003) A preliminary approach to the multilabel classification problem of Portuguese juridical documents. In: Portuguese conference on artificial intelligence, Springer, pp 435–444
Han EHS, Karypis G (2000) Centroid-based document classification: Analysis and experimental results. In: European conference on principles of data mining and knowledge discovery, Springer, pp 424–431
Heaps HS (1973) A theory of relevance for automatic document classification. Inf Control 22 (3):268–278
Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11(1):63–90
Hossain MR, Hoque MM, Siddique N, Sarker IH (2021) Bengali text document categorization based on very deep convolution neural network. Expert Syst Appl 184:115,394
Borko H, Bernick M (1964) Automatic document classification part II. Additional experiments. J ACM 11(2):138–151
Dhar A, Dash NS, Roy K (2018) A fuzzy Logic-Based bangla text classification for web text documents. Journal of Advanced Linguistics Studies 7(1-2)
Islam MS, Jubayer FEM, Ahmed SI (2017) A support vector machine mixed with TF-IDF algorithm to categorize Bengali document. In: 2017 International conference on electrical, computer and communication engineering (ECCE), IEEE, pp 191–196
Kabir F, Siddique S, Kotwal MRA, Huda MN (2015) Bangla text document categorization using stochastic gradient descent (sgd) classifier. In: 2015 International Conference on Cognitive Computing and Information Processing (CCIP), IEEE, pp 1–4
Kohavi R (1995) The power of decision tables. In: European conference on machine learning, Springer, pp 174–189
Krail N, Gupta V (2012) Domain based classification of Punjabi text documents using ontology and hybrid based approach. In: Proceedings of the 3rd Workshop on south and southeast asian natural language processing, pp 109–122
Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59(1-2):161–205
Mansur M (2006) Analysis of n-gram based text categorization for bangla in a newspaper corpus. PhD thesis, BRAC University
Mishra AK, Ratha BK (2016) Study of random tree and random forest data mining algorithms for microarray data analysis. Int J Adv Electric Comput Eng 3(4):5–7
Naji H, Ashour W (2016) Text classification for arabic words using Rep-Tree. International Journal of Computer Science & Information Technology (IJCSIT) vol 8
Paul S, Purkhyasta BS (2019) English to bengali transliteration tool for OOV words common in indian civil aviation. J Adv Database Manag & Syst 6(1):23–32
Puri S, Singh SP (2018) Hindi text document classification system using SVM and fuzzy: a survey. International Journal of Rough Sets and Data Analysis (IJRSDA) 5(4):1–31
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier
Reshma U, Barathi Ganesh H, Anand Kumar M, Soman K (2015) Supervised methods for domain classification of tamil documents. ARPN J Eng Appl Sci 10(8):3702–3707
Şahin G (2017) Turkish document classification based on Word2Vec and SVM classifier. In: 2017 25th Signal processing and communications applications conference (SIU), IEEE, pp 1–4
Shahzad W, Asad S, Khan MA (2013) Feature subset selection using association rule mining and JRip classifier. Int J Phys Sci 8(18):885–896
Vempala SS (2005) The random projection method, vol 65. American Mathematical Soc
Wang L, Jia X (2009) Integration of soft and hard classifications using extended support vector machines. IEEE Geosci Remote Sens Lett 6(3):543–547
Wang Y (1997) Induction of model trees for predicting continuous classes. Proceedings of the European conference on machine learning, Prague, Czech Republic. https://ci.nii.ac.jp/naid/10017502385/en/. Accessed 22 Aug 2022
Willett P (1983) Similarity coefficients and weighting functions for automatic document classification: an empirical comparison. Int Class 10(3):138–142
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Das Dawn, D., Khan, A., Shaikh, S.H. et al. A dictionary based model for bengali document classification. Appl Intell 53, 14023–14042 (2023). https://doi.org/10.1007/s10489-022-03955-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03955-w