Skip to main content

Advertisement

Log in

A dictionary based model for bengali document classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Computer-aided documented content analysis is a prominent research area in natural language processing. A realistic implementation of this task is related to the subjectivity of the quantifiable data. One of the most interesting specialisations of this problem is automated document classification, which is a system that can identify the category of a document without human intervention. The problem of document classification has to consider the evaluation of the heart-of-the-matter of the textual material. Being one of the most-spoken languages in the world, a huge number of Bengali documents are present in digital form, and it is increasing rapidly due to the age of the internet. A document classification method is required to organise and categorise these huge documents rapidly and efficiently. In this paper, a decisive dictionary based model has been presented for the classification of documents in the Bengali text. We have introduced the concepts of lexiconid, lexiconaffinity, lexiconunicity, and lexiconassociation to acquire the features. The feature set is integrated with different levels of threshold. The proposed model is supervised, and the entire dataset has been split into testing and training sets. The proposed model has been validated using the k-fold cross-validation strategy. A significant number of dictionary based parameter values have been estimated for each token present in the text. In this paper, the text has been classified using a new rule based classification algorithm, predictive lexicon inference (PLI) classifier. The proposed model has been evaluated on five datasets: Paradise Lost, Iliad, Odyssey, Ramayana, and Mahabharata. In addition to document classification, this algorithm enables name entity classification, and chronology or description classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Algorithm 3
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Ahmed M, Chakraborty P, Choudhury T (2022) Bangla Document Categorization Using Deep RNN Model with Attention Mechanism. In: Cyber intelligence and information retrieval, Springer, pp 137–147

  2. Al-Harbi S, Almuhareb A, Al-Thubaity A, Khorsheed MS, Al-Rajeh A (2008) Automatic Arabic Text Classification. In: Proceedings of the 9th International conference on the statistical analysis of textual data (01/03/08). https://eprints.soton.ac.uk/272254/

  3. Bartolini R, Lenci A, Montemagni S, Pirrelli V, Soria C (2004) Automatic classification and analysis of provisions in Italian legal texts: a case study. In: OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”, Springer, pp 593–604

  4. Berry MW, Castellanos M (2004) Survey of text mining. Comput Rev 45(9):548

    Google Scholar 

  5. Borko H, Bernick M (1963) Automatic document classification. J ACM 10(2):151–162

    Article  MATH  Google Scholar 

  6. Britannica (1993) Encyclopædia britannica

  7. Chen N, Blostein D (2007) A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int J Doc Anal Recognit 10(1):1–16

    Article  Google Scholar 

  8. Chy AN, Seddiqui MH, Das S (2014) Bangla news classification using naive Bayes classifier

  9. Cunningham P, Delany SJ (2020) k-Nearest Neighbour Classifiers–. arXiv:200404523

  10. Cutler A, Zhao G (2001) Pert-perfect random tree ensembles. Comput Sci Stat 33:490–497

    Google Scholar 

  11. Dhar A, Dash NS, Roy K (2017) Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents. In: 2017 3rd international conference on advances in computing, communication & automation (ICACCA)(Fall), IEEE, pp 1–6

  12. Dhar A, Mukherjee H, Dash NS, Roy K (2020) Automatic categorization of web text documents using fuzzy inference rule. Sādhanā 45(1):1–22

    Article  Google Scholar 

  13. Dhar A, Mukherjee H, Dash NS, Roy K (2020) CESS-A system to categorize bangla web text documents. ACM Trans Asian Low-Resour Lang Infor Process (TALLIP) 19(5):1–18

    Article  Google Scholar 

  14. Ekbal A, Naskar SK, Bandyopadhyay S (2007) Named entity recognition and transliteration in Bengali. Lingvisticae Investigationes 30(1):95–114

    Article  Google Scholar 

  15. Freitas AA (2014) Comprehensible classification models: a position paper. ACM SIGKDD explorations newsletter 15(1):1–10

    Article  Google Scholar 

  16. Friedman J, Hastie T, Tibshirani R, et al. (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 28(2):337–407

    Article  MATH  Google Scholar 

  17. Fu J, Lee S (2012) A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents. Expert Syst Appl 39(3):3127–3134

    Article  Google Scholar 

  18. Gonçalves T, Quaresma P (2003) A preliminary approach to the multilabel classification problem of Portuguese juridical documents. In: Portuguese conference on artificial intelligence, Springer, pp 435–444

  19. Han EHS, Karypis G (2000) Centroid-based document classification: Analysis and experimental results. In: European conference on principles of data mining and knowledge discovery, Springer, pp 424–431

  20. Heaps HS (1973) A theory of relevance for automatic document classification. Inf Control 22 (3):268–278

    Article  MathSciNet  MATH  Google Scholar 

  21. Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11(1):63–90

    Article  MATH  Google Scholar 

  22. Hossain MR, Hoque MM, Siddique N, Sarker IH (2021) Bengali text document categorization based on very deep convolution neural network. Expert Syst Appl 184:115,394

    Article  Google Scholar 

  23. Borko H, Bernick M (1964) Automatic document classification part II. Additional experiments. J ACM 11(2):138–151

    Article  MATH  Google Scholar 

  24. Dhar A, Dash NS, Roy K (2018) A fuzzy Logic-Based bangla text classification for web text documents. Journal of Advanced Linguistics Studies 7(1-2)

  25. Islam MS, Jubayer FEM, Ahmed SI (2017) A support vector machine mixed with TF-IDF algorithm to categorize Bengali document. In: 2017 International conference on electrical, computer and communication engineering (ECCE), IEEE, pp 191–196

  26. Kabir F, Siddique S, Kotwal MRA, Huda MN (2015) Bangla text document categorization using stochastic gradient descent (sgd) classifier. In: 2015 International Conference on Cognitive Computing and Information Processing (CCIP), IEEE, pp 1–4

  27. Kohavi R (1995) The power of decision tables. In: European conference on machine learning, Springer, pp 174–189

  28. Krail N, Gupta V (2012) Domain based classification of Punjabi text documents using ontology and hybrid based approach. In: Proceedings of the 3rd Workshop on south and southeast asian natural language processing, pp 109–122

  29. Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59(1-2):161–205

    Article  MATH  Google Scholar 

  30. Mansur M (2006) Analysis of n-gram based text categorization for bangla in a newspaper corpus. PhD thesis, BRAC University

  31. Mishra AK, Ratha BK (2016) Study of random tree and random forest data mining algorithms for microarray data analysis. Int J Adv Electric Comput Eng 3(4):5–7

    Google Scholar 

  32. Naji H, Ashour W (2016) Text classification for arabic words using Rep-Tree. International Journal of Computer Science & Information Technology (IJCSIT) vol 8

  33. Paul S, Purkhyasta BS (2019) English to bengali transliteration tool for OOV words common in indian civil aviation. J Adv Database Manag & Syst 6(1):23–32

    Google Scholar 

  34. Puri S, Singh SP (2018) Hindi text document classification system using SVM and fuzzy: a survey. International Journal of Rough Sets and Data Analysis (IJRSDA) 5(4):1–31

    Article  Google Scholar 

  35. Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier

  36. Reshma U, Barathi Ganesh H, Anand Kumar M, Soman K (2015) Supervised methods for domain classification of tamil documents. ARPN J Eng Appl Sci 10(8):3702–3707

    Google Scholar 

  37. Şahin G (2017) Turkish document classification based on Word2Vec and SVM classifier. In: 2017 25th Signal processing and communications applications conference (SIU), IEEE, pp 1–4

  38. Shahzad W, Asad S, Khan MA (2013) Feature subset selection using association rule mining and JRip classifier. Int J Phys Sci 8(18):885–896

    Article  Google Scholar 

  39. Vempala SS (2005) The random projection method, vol 65. American Mathematical Soc

  40. Wang L, Jia X (2009) Integration of soft and hard classifications using extended support vector machines. IEEE Geosci Remote Sens Lett 6(3):543–547

    Article  Google Scholar 

  41. Wang Y (1997) Induction of model trees for predicting continuous classes. Proceedings of the European conference on machine learning, Prague, Czech Republic. https://ci.nii.ac.jp/naid/10017502385/en/. Accessed 22 Aug 2022

  42. Willett P (1983) Similarity coefficients and weighting functions for automatic document classification: an empirical comparison. Int Class 10(3):138–142

    Google Scholar 

  43. Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debapratim Das Dawn.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Das Dawn, D., Khan, A., Shaikh, S.H. et al. A dictionary based model for bengali document classification. Appl Intell 53, 14023–14042 (2023). https://doi.org/10.1007/s10489-022-03955-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03955-w

Keywords

Navigation