Abstract
Automatic text categorization is the operation of sorting out the text documents into pre-defined text categories using some machine learning algorithms. Normally, it defines the most important approaches to organizing and making the use of a large volume of information exists in unstructured form. Nowadays, text categorization is becoming an extensively researched field of text mining and processing of languages. Word sense, semantic relationships among terms, text documents and categories are quite essential in order of enhancing the performances of categorization. Various surveys on text categorization have already been available which involve techniques of various text representation schemes to such extent but do not include several approaches that have been explored in text categorization over the standard techniques. Here, an exhaustive analysis of different text categorization approaches over the conventional approaches has been undertaken. This survey paper explores a wide variety of algorithms used for categorizing text documents and tries to assemble the existing works into three basic fields: conventional methods, fuzzy logic-based methods, deep learning-based methods. Further, conventional methods have been categorized into three fields: text categorization using handcrafted features, text categorization using nature-inspired algorithms and text categorization using graph-based methods. Furthermore, this survey provides a clear idea about the available libraries used for different algorithms, availability of datasets, categorization technologies explored in various non-Indian and Indian languages as well.
Similar content being viewed by others
References
Abutiheen ZA, Aliwy AH, Aljanabi KBS (2018) Arabic text classification using master-slaves technique. In: Proceedings of the scientific conference on renewable energy and its applications, pp 1–10
Al-Harbi S, Almuhareb A, Al-Thubaity A, Khorsheed MS, Al-Rajeh A (2008) Automatic arabic text classification. In: Proceedings of the international conference on the statistical analysis of textual data, pp 77–83
Al-Radaideh QA, Al-Khateeb SS (2015) An associative rule-based classifier for Arabic medical text. Int J Knowl Eng Data Mining 03:255–273
Al-Taani AT, Al-Awad NAK (2009) An empirical analysis of Arabic webpages classification using fuzzy operators. Int J Comput Inf Eng 03:671–676
Al-Tahrawi MM (2015) Arabic text categorization using logistic regression. Int J Intell Syst Appl 06:71–78
Alam MT, Islam MM (2018) Bard: Bangla article classification using a new comprehensive dataset. In: Proceedings of the international conference on Bangla speech and language rocessing
Ali AR, Ijaz M (2009) Urdu text classification. In: Proceedings of the international conference on frontiers of information technology, pp 1–7
Aly W, Kelleny HA (2014) Adaptation of cuckoo search for documents clustering. Int J Compu Appl 86:4–10
Asim MN, Wasim M, Ali MS, Rehman A (2017) Comparison of feature selection methods in text classification on highly skewed datasets. In: Proceedings of the international conference on latest trends in electrical engineering and computing technologies (INTELLECT), p 8
Baltrusaitis T, Ahuja C, Morency LP (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41:423–443
Basu A, Watters C, Shepherd M (2003) Support vector machines for text categorization. In: Proceedings of the annual Hawaii international conference on system sciences (HICSS’03), pp 137–142
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35:1798–1828
Bidi N, Elberrichi Z (2016) Feature selection for text classification using genetic algorithms. In: Proceedings of IEEE international conference on modelling identification and control, pp 806–810
Bijalwan V, Kumar V, Kumari P, Pascual J (2014) KNN based machine learning approach for text and document mining. Int J Database Theory Appl 07(01):61–70
Boukil S, Biniz M, Adnani FE, Cherrat L, Moutaouakkil AEE (2018) Arabic text classification using deep learning technics. Int J Grid Distrib Comput 11:103–114
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with naïve bayes. Expert Syst Appl 36:5432–5435
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Cordobés H, Fernández A, Chiroque LF, Pérez F, Redondo T, Santos A (2014) Graph-based techniques for topic classification of tweets in Spanish. Int J Artif Intell Interac Multimed 02:31–37
Cortez P, Moro S, Rita P, King D, Hall J (2018) Insights from a text mining survey on expert systems research from 2000 to 2016. Expert Syst 35:10
Cozman F, Cohen I, Cirelo M (2003) Semi-supervised learning of mixture models. In: Proceedings of the international conference on machine learning
Dasondi V, Pathak M, Rathore NPS (2016) An implementation of graph based text classification technique for social media. In: Proceedings of symposium on colossal data analysis and networking (CDAN), p 07
DeySarkar S, Goswami S, Agarwal A, Aktar J (2014) A novel feature selection technique for text classification using naïve bayes. Int Sch Res Not 2014:10
Dhar A, Dash NS, Roy K (2017) Application of TF-IDF feature for categorizing documents of online Bangla web text corpus. In: Proceedings of the international Ccnference on frontiers of intelligent computing: theory and applications, pp 51–60
Dhar A, Dash NS, Roy K (2017) Classification of text documents through distance measurement: an experiment with multi-domain bangla text documents. In: Proceedings of the international conference on advances in computing, communication and automation, pp 1–6
Dhar A, Dash NS, Roy K (2018) Categorization of Bangla web text documents based on TF-IDF-ICF text analysis scheme. In: Proceedings of the 52nd annual convention of the computer society of India, pp 477–484
Dhar A, Dash NS, Roy K (2018) Classification of Bangla text documents based on inverse class frequency. In: Proceedings of the international conference on internet of things: smart innovation and usages, pp 1–6
Dhar A, Dash NS, Roy K (2018) A fuzzy logic-based Bangla text classification for web text documents. J Adv Linguist Stud 07:159–187
Dhar A, Dash NS, Roy K (2018) An innovative method of feature extraction for text classification using part classifier. In: Proceedings of the international conference information, communication and computing technology, pp 131–138
Dogan T, Uysal AK (2019) On term frequency factor in supervised term weighting schemes for text classification. Arab J Sci Eng 44:9545–9560
Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the international conference on information and knowledge management, pp 148–155
el Ameen A, Shaout A (2014) Fuzzy arabic document classification. In: Proceedings of the international Arab conference on information technology (ACIT2014), pp 1–5
El-Halees AM (2007) Arabic text classification using maximum entropy. Islam Univ J (Ser Nat Stud Eng) 15:157–167
El Kourdi M, Bensaid A, Rachidi Te (2004) Automatic arabic document categorization based on the naïve bayes algorithm. In: Proceedings of the workshop on computational approaches to Arabic script-based languages, pp 51–58
Elberrichi Z, Abidi K (2012) Arabic text categorization: a comparative study of different representation models. Int Arab J Inf Technol 09:465–470
Farhoodi M, Yari A (2010) Applying machine learning algorithms for automatic persian text classification. In: Proceedings of the international conference on advanced information management and service (IMS), pp 318–323
Feng G, Li S, Sun T, Zhang B (2018) A probabilistic model derived term weighting scheme for text classification. Pattern Recognit Lett 110:23–29
Fu G, Wang X (2010) Chinese sentence-level sentiment classification based on fuzzy sets. In: Proceedings of the international conference on computational linguistics, pp 312–319
Gu C, Wu M, Zhang C (2017) Chinese sentence classification based on convolutional neural network. IOP Conf Ser Mater Sci Eng 261:012008
Guelpeli MV, Garcia ACB, Bernardini FC (2010) An analysis of constructed categories for textual classification using fuzzy similarity and agglomerative hierarchical methods. In: Proceedings of the emergent web intelligence: advanced semantic technologies, pp 277–306
Gupta N, Gupta V (2012) Punjabi text classification using naive bayes, centroid and hybrid approach. In: Proceedings of the international workshop on computer networks & communications, pp 109–122
Guru DS, Suhil M (2015) A novel term\_class relevance measure for text categorization. Proc Comput Sci 45:13–22
Guru DS, Suhil M, Raju LN, Kumar NV (2018) An alternative framework for univariate filter based feature selection for text categorization. Pattern Recognit Lett 103:23–31
Haralambous Y, Elidrissi Y, Lenca P (2014) Arabic language text classification using dependency syntax-based feature selection. In: Proceedings of the international colloquium on automata, languages and programming, p 10
He J, Tan AH, Tan CL (2000) A comparative study on Chinese text categorization methods. In: Proceedings of the international conference on text and web mining, pp 24–35
Hemmatian F, Sohrabi MK (2017) A survey on classification techniques for opinion mining and sentiment analysis. Artif Intell Rev 1–51
Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the annual meeting of the association for computational linguistics, pp 873–882
Islam MS, Jubayer FEM, Ahmed SI (2017) A support vector machine mixed with TF-IDF algorithm to categorize bengali document. In: Proceedings of the international conference on electrical, computer and communication engineering, pp 191–196
Jayashree R, Srikanta MK (2011) An analysis of sentence-level text classification for the Kannada language. In: Proceedings of IEEE conference on soft computing and pattern recognition (SoCPaR), pp 147–151
Jiang C, Coenen F, Sanderson R, Zito M (2010) Text classification using graph mining-based feature extraction. Int J Eng Res Appl 23:3028–3308
Jiang JY, Liou RJ, Lee SJ (2011) A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Trans Knowl Data Eng 23:335–349
Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y, Guan R (2018) Text classification based on deep belief network and softmax regression. Neural Comput Appl 29:61–70
Jin Y, Xiong W, Wang C (2010) Feature selection for Chinese text categorization based on improved particle swarm optimization. In: Proceedings of the international conference on natural language processing and knowledge engineering (NLPKE-2010), p 6
Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of embeddings for text classification. In: Proceedings of the international joint conference on artificial intelligence, pp 2824–2830
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the European conference on machine learning, pp 137–142
Kabir F, Siddique S, Kotwal MRA, Huda MN (2015) Bangla text document categorization using stochastic gradient descent (SGD) classifier. In: Proceedings of the international conference on cognitive computing and information processing (CCIP), pp 1–4
Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 52:273–292
Kanapala A, Pal S, Pamula R (2019) Text summarization from legal documents: a survey. Artif Intell Rev 51:371–402
Kavuri D, Kumar PA, Rao DVS (2012) Text and image classification using fuzzy similarity based self constructing algorithm. Int J Eng Sci Adv Technol 02:1572–1576
Khamar K (2013) Short text classification using KNN based on distance function. Int J Adv Res Comput Commun Eng 02(04):1916–1919
Khoury R, Karray F, Kamel M (2005) A fuzzy classifier for natural language text using automatically-learned fuzzy rules. In: Proceedings of the international conference on artificial and computational intelligence for decision, control and automation, p 6
Khreisat L (2006) Arabic text classification using N-gram frequency statistics a comparative study. In: Proceedings of the international conference on data mining, pp 78–82
Kim SB, Han KS, Rim HC, Myaeng SH (2006) Some effective techniques for Naive bayes text classification. IEEE Trans Knowl Data Eng 18(11):1457–1466
Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic: theory and applications, 1st edn. Prentice-Hall, Saddle River, NJ
Kosko B (1994) Fuzzy thinking: the new science of fuzzy logic. Hypercollins, UK
Kowsari K, Heidarysafa M, Brown DE, Meimandi KJ, Barnes LE (2018) RMDL: Random multimodel deep learning for classification. In: Proceedings of the international conference on information system and data mining, p 11
Kulhari A, Pandey A, Pal R, Mittal H (2016) Unsupervised data classification using modified cuckoo search method. In: Proceedings of the international conference on contemporary computing (IC3), pp 1–5
Kumari L (2013) Improved graph based KNN text classification. Int J Eng Res Appl 03:928–931
Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intell 70:25–37
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31:721–735
Lebanon G (2006) Metric learning for text documents. IEEE Trans Pattern Anal Mach Intell 28:497–508
Lewis DD (1992) Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on speech and natural language, pp 212–217
Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Lin H (2014) Research on energy-efficient text classification. In: Proceedings of the international conference on information technology and electronic commerce, pp 257–261
Linh NV, Anh NK, Than K, Dang CN (2017) An effective and interpretable method for document classification. Knowl Inf Syst 50:763–793
Liu T (2010) A novel text classification approach based on deep belief network. In: Proceedings of the international conference on neural information processing, pp 314–321
Liu WY, Song N (2003) A fuzzy approach to classification of text documents. J Comput Sci Technol 18:640–647
Liu R, Zhou J, Liu M (2006) A graph-based semi-supervised learning algorithm for web page classification. In: Proceedings of the international conference on intelligent systems design and applications, pp 856–860
Liu Z, Lv X, Liu K, Shi S (2010) Study on SVM compared with the other text classification methods. In: Proceedings of the international workshop on education technology and computer science, pp 219–222
Malliaros FD, Skianis K (2015) Graph-based term weighting for text categorization. In: Proceedings of the international conference on advances in social networks analysis and mining (ASONAM), pp 1473–1479
Mandal AK, Sen R (2014) Supervised learning methods for Bangla web document categorization. Int J Artif Intell Appl 05:93–105
Manikandan R, Sivakumar R (2018) Machine learning algorithms for text-documents classification: A review. Mach Learn 3
Mansur M, UzZaman N, Khan M (2006) Analysis of N-gram based text categorization for Bangla in a newspaper corpus. In: Proceedings of ICESA, p 6
Marie-Sainte SL, Alalyani N (2018) Firefly algorithm based feature selection for Arabic text classification. J King Saud Univ Comput Inf Sci 32:320–328
Mesleh AMdA (2007) Chi square feature extraction based SVMS Arabic language text categorization system. J Comput Sci 3:430–435
Mikawa K, Ishidat T, Goto M (2011) A proposal of extended cosine measure for distance metric learning in text classification. In: Proceedings of the international conference on systems, man, and cybernetics, pp 1741–1746
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the international conference on neural information processing systems, pp 3111–3119
Mnih A, Hinton G (2007) Three new graphical models for statistical language modelling. In: Proceedings of the international conference on machine learning, pp 641–648
Mohammad AH, Al-Momani O, Alwada’n T (2016) Arabic text categorization using k-nearest neighbour, decision trees (c4.5) and Rocchio classifier: a comparative study. Int J Curr Eng Technol 06:477–482
Mohanty S, Santi PK, Mishra R, Mohapatra RN, Swain S (2006) Semantic based text classification using wordnets: Indian language perspective. In: Proceedings of international conference on electrical, computer and communication engineering, pp 321–324
Murtaza G, Shuib L, Wahab AWA, Mujtaba G, Nweke HF, Al-garadi MA, Zulfiqar F, Raza G, Azmi NA (2019) Deep learning-based breast cancer classification through medical imaging modalities: state of the art and research challenges. Artif Intell Rev pp 1–66
Murthy KN (2003) Automatic categorization of Telugu news articles. In: Department of computer and information sciences, University of Hyderabad
Nguyen TH, Shirai K (2013) Text classification of technical papers based on text segmentation. In: Proceedings of the international conference on application of natural language to information systems, pp 278–284
Parvin H, Dahbashi A, Parvin S, Minaei-Bidgoli B (2012) Improving Persian text classification and clustering using Persian thesaurus. In: Proceedings of the international conference on distributed computing and artificial intelligence, pp 493–500
Patil AS, Pawar BV (2012) Automated classification of web sites using naive bayesian algorithm. In: Proceedings of the international multiConference of engineers and computer scientists, pp 14–16
Patil M, Game P (2014) Comparison of Marathi text classifiers. ACEEE Int J Inf Technol 04(01):11–22
Patil JJ, Bogiri N (2015) Automatic text categorization: Marathi documents. In: Proceedings of the international conference on energy systems and applications, pp 689–694
Pawar PY, Gawande SH (2012) A comparative study on different types of approaches to text categorization. Int J Mach Learn Comput 02(04):423–426
Peng F, Huang X, Schuurmans D, Wang S (2003) Text classification in Asian languages without word segmentation. Proceedings of the international workshop on information retrieval with Asian languages 11:41–48
Pereira RB, Plastino A, Zadrozny B, Merschmann LH (2018) Categorizing feature selection methods for multi-label classification. Artif Intell Rev 49:57–78
Prusa JD, Khoshgoftaar TM (2016) Designing a better data representation for deep neural networks and text classification. In: Proceedings of IEEE international conference on information reuse and integration, pp 411–416
Puri S (2011) A fuzzy similarity based concept mining model for text classification. Int J Adv Comput Sci Appl 02:115–121
Rajan K, Ramalingam V, Ganesan M, Palanivel S, Palaniappan B (2009) Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Syst Appl 36:10914–10918
Rakholia RM, Saini JR (2017) Classification of Gujarati documents using naïve bayes classifier. Indian J Sci Technol 10(5):1–9
Redmond M, Salesi S, Cosma G (2017) A novel approach based on an extended cuckoo search algorithm for the classification of tweets which contain emoticon and emoji. In: Proceedings of the international conference on knowledge engineering and applications (ICKEA), pp 13–19
Saad MK, Ashour W (2010) Arabic text classification using decision trees. In: Proceedings of the international workshop on computer science and information echnologies, pp 75–79
Salloum SA, AlHamad AQ, Al-Emran M, Shaalan K (2018) A survey of Arabic text mining. Intelligent natural language processing: trends and applications. Springer, Cham, pp 417–431
Sarmah J, Saharia N, Shikhar K (2012) A novel approach for document classification using Assamese wordnet. In: Proceedings of the international global Wordnet conference, pp 324–329
Sathe JB, Mali MP (2017) A hybrid sentiment classification method using neural network and fuzzy logic. In: Proceedings of IEEE international conference on intelligent systems and control, pp 93–96
Sato M, Orihara R, Sei Y, Tahara Y, Ohsuga A (2017) Japanese text classification by character-level deep convnets and transfer learning. In: Proceedings of the international conference on agents and artificial intelligence, pp 175–184
Sebastiani F (2005) Text categorization. In: Encyclopedia of database technologies and applications
Shah AA, Rana K (2018) A review on supervised machine learning text categorization approaches. In: Proceedings of international conference on circuits and systems in digital enterprise echnology, pp 1-6
Shahi TB, Pant AK (2018) Nepali news classification using naïve bayes, support vector machines and neural networks. In: Proceedings of the international conference on communication, information & computing technology, pp 1–5
Socher R, Huang EH, Pennington J, Ng AY, Manning CD (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the international conference on neural information processing systems, pp 801–809
Socher R, Pennington J, Huang EH, Ng AY, Manning CD (2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the conference on empirical methods in natural language processing, pp 151–161
Suanmali L, Binwahlan MS, Salim N (2009) Sentence features fusion for text summarization using fuzzy logic. In: Proceedings of IEEE international conference on hybrid intelligent systems, pp 142–146
Sujana TS, Rao NMS, Reddy RS (2017) An efficient feature selection using parallel cuckoo search and Naive bayes classifier. In: Proceedings of the international conference on networks & advances in computational technologies (NetACT), pp 167–172
Swamy MN, Hanumanthappa M (2013) Indian language text representation and categorization using supervised learning algorithm. Int J Data Mining Tech Appl 02:251–257
Tandel SS, Jamadar A, Dudugu S (2019) A survey on text mining techniques. In: Proceedings of the international conference on advanced computing & communication systems, pp 1022-1026
Tellez ES, Moctezuma D, Miranda-Jiménez S, Graff M (2018) An automated text categorization framework based on hyperparameter optimization. Knowledge-Based Syst 149:110–123
Tetali A, Madhukumar BPN, Chandrakumar K (2012) Classification of text using fuzzy based incremental feature clustering algorithm. Int J Adv Res Comput Eng Technol 01:313–318
Tsekouras GE, Anagnostopoulos C, Gavalas D, Dafhi E (2007) Classification of web documents using fuzzy logic categorical data clustering. In: Proceedings of international conference on artificial intelligence applications and innovations, pp 93–100
Usman M, Ayub S, Shafique Z, Malik K (2016) Urdu text classification using majority voting. Int J Adv Comput Sci Appl 07:1–10
Vinoth R, Jayachandran A, Balaji M, Srinivasan R (2014) A hybrid text classification approach using KNN and SVM. Int J Adv Found Res Comput 01(03):20–26
Wang Z, Liu Z (2010) Graph-based KNN text classification. In: Proceedings of the international conference on Fuzzy systems and knowledge discovery, pp 2363–2366
Wang D, Zhang H (2013) Inverse-category-frequency based supervised term weighting schemes for text categorization. J Inf Sci Eng 29:209–225
Wei Z, Miao D, Chauchat JH, Zhao R, Li W (2009) N-grams based feature selection and text representation for Chinese text classification. Int J Comput Intell Syst 2(4):365–374
Wenliang C, Xingzhi C, Huizhen W, Jingbo Z, Tianshun Y (2005) Automatic word clustering for text categorization using global information. In: Proceedings of the Asia information retrieval symposium, pp 1–11
Wilges B, Mateus G, Nassar S, Cislaghi R, Bastos RC (2016) Fuzzy modeling for multilabel text classification supported by classification algorithms. J comput Sci 12:341–349
Wong KW, Chumwatana T, Tikk D (2010) Exploring the use of fuzzy signature for text mining. In: Proceedings of the IEEE international conference on fuzzy systems (FUZZ), pp 1–5
Wu TP, Chen SM (1999) A new method for constructing membership functions and fuzzy rules from training examples. IEEE Trans Syst Man Cybern 29:25–40
Wu H, Gu X, Gu Y (2017) Balancing between over-weighting and under-weighting in supervised term weighting. In Process Manag 53(02):547–557
Wu K, Zhou M, Lu XS, Huang L (2017) A fuzzy logic based text classification method for social media data. In: Proceedings of IEEE international conference on systems, man, and cybernetics, pp 1942–1947
Zadeh L (1965) Fuzzy sets. Inf Control 8:338–353
Zhang XY, Yin F, Zhang YM, Liu CL, Bengio Y (2017) Drawing and recognizing Chinese characters with recurrent neural network. IEEE Trans Pattern Anal Mach Intell 40:849–862
Zhao W, Ye J, Yang M, Lei Z, Zhang S, Zhao Z (2018) Investigating capsule networks with dynamic routing for text classification. In: Proceedings of the conference on empirical methods in natural language processing, pp 3110–3119
Acknowledgements
One of the authors thank DST for support in the form of INSPIRE fellowship.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dhar, A., Mukherjee, H., Dash, N.S. et al. Text categorization: past and present. Artif Intell Rev 54, 3007–3054 (2021). https://doi.org/10.1007/s10462-020-09919-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-020-09919-1