Skip to main content
Log in

An Arabic text categorization approach using term weighting and multiple reducts

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Text categorization is the process of assigning a predefined category label to an unlabeled document based on its content. One of the challenges of automatic text categorization is the high dimensionality of data that may affect the performance of the categorization model. This paper proposed an approach for the categorization of Arabic text based on term weighting and the reduct concept of the rough set theory to reduce the number of terms used to generate the classification rules that form the classifier. The paper proposed a multiple minimal reduct extraction algorithm by improving the Quick reduct algorithm. The multiple reducts are used to generate the set of classification rules which represent the rough set classifier. To evaluate the proposed approach, an Arabic corpus of 2700 documents nine categories is used. In the experiment, we compared the results of the proposed approach when using multiple and single minimal reducts. The results showed that the proposed approach had achieved an accuracy of 94% when using multiple reducts, which outperformed the single reduct method which achieved an accuracy of 86%. The results of the experiments also showed that the proposed approach outperforms both the K-NN and J48 algorithms regarding classification accuracy using the dataset on hand.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. Int J Comput Sci Eng Appl 5(1):19–28

    Google Scholar 

  • Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795

    Article  Google Scholar 

  • Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456–466. https://doi.org/10.1016/j.jocs.2017.07.018

    Article  Google Scholar 

  • Al-Dhaheri S (2010) Arabic text categorization based on features reduction using artificial neural network. Master Thesis Faculty of Graduate Studies, The University of Jordan

  • Al-Diabat M (2012) Arabic text categorization using classification rule mining. Appl Math Sci 6:4033–4046

    Google Scholar 

  • Al-Radaideh Q, Al-Khateeb S (2015) An associative rule-based classifier for Arabic medical text. Int J Knowl Eng Data Min 3(3–4):255–273

    Article  Google Scholar 

  • Al-Radaideh Q, Al-Qudah G (2017) Application of rough set-based feature selection for Arabic sentiment analysis. Cognit Comput 9(4):436–445

    Article  Google Scholar 

  • Al-Radaideh Q, Bataineh D (2018) A hybrid approach for arabic text summarization using domain knowledge and genetic algorithms. Cognit Comput. https://doi.org/10.1007/s12559-018-9547-z

    Article  Google Scholar 

  • Al-Radaideh Q, Al-Shawakfa E, Ghareb A, Abu Salem H (2011) An approach for Arabic text categorization using association rule mining. Int J Comput Process Lang 23(1):81–106

    Article  Google Scholar 

  • Al-Radaideh Q, Sulaiman MN, Selamat MH, Ibrahim H (2005) Approximate reduct computation by rough sets based attribute weighting. In: Proceedings of the IEEE international conference on granular computing, pp 383–386

  • Al-Radaideh Q, Twaiq L (2014) Rough set theory for Arabic sentiment classification. In: Proceedings of the 2014 international conference on future internet of things and cloud. IEEE Computer Society

  • Alsaleem S (2011) Automated Arabic text categorization using SVM and NB. Int Arab J e-Technol 2(2):124–128

    Google Scholar 

  • Al-Salemi B, Aziz M (2011) Statistical Bayesian learning for automatic arabic text categorization. J Comput Sci 7(1):39–45

    Article  Google Scholar 

  • Al-Shalabi R, Kanaan G, Gharaibeh M (2006) Arabic text categorization using KNN algorithm. In: Proceedings of the 4th international multi-conference on computer science and information technology. Amman, Jordan

  • Azara M, Fatayer T, El-Halees A (2012) Arabic text classification using learning vector quantization. In: Proceedings of the 8th international conference on informatics and systems (INFOS2012), pp 39–43

  • Bao Y, Aoyama S, Du X, Yamada K, Ishii N (2001) A rough set based hybrid method to text categorization. In: Proceedings of the 2nd international conference on web information systems engineering. IEEE Computer Society, pp 254–261

  • Chantar HK, Corne DW (2011) Feature subset selection for arabic document categorization using BPSO-KNN. In: Nature and Biologically Inspired Computing (NaBIC), pp 545–551

  • Chen Y, Zeng Z, Lu J (2017) Neighborhood rough set reduction with fish swarm algorithm. Soft Comput 21(23):6907–6918

    Article  Google Scholar 

  • Chen P, Liu S (2008) Rough set-based SVM classifier for text categorization. In: Proceedings of the fourth international conference on natural computation (ICNC), pp 153–157

  • Chouchoulas A (1999) A rough set approach to text classification. Master Thesis, School of Artificial Intelligence, Division of Informatics, the University of Edinburgh

  • Dai L, Hu J, Liu W (2008) Using modified CHI square and rough set for text categorization with many redundant features. In: Proceedings of the international symposium on computational intelligence and design (ISCIS), vol 1, pp 182–185

  • Darwish K (2002) Building a shallow Arabic morphological analyzer in one day. In: Proceedings of the ACL workshop on computational approaches to semitic ACL

  • Duwairi R (2006) Machine learning for Arabic text categorization. J Am Soc Inf Sci Technol 57(8):1005–1010

    Article  Google Scholar 

  • Duwairi R (2007) Arabic text categorization. Arab J Inf Technol 4(2):125–131

    Google Scholar 

  • Duwairi R, El-Orfali M (2014) A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J Inf Sci 40(4):501–13

    Article  Google Scholar 

  • Duwairi R, Al-Refai M, Khasawneh N (2009) Feature reduction techniques for Arabic text categorization. J Am Soc Inf Sci 60(11):2347–2352

    Article  Google Scholar 

  • Ghareb A, Hamdan A, Bakar A (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Exp Syst Appl 49:31–47

    Article  Google Scholar 

  • Ghareb A, Bakar AA, Al-Radaideh Q, Hamdan A (2018) Enhanced filter feature selection methods for Arabic text categorization. Int J Inf Retr Res 8(2):1–24

    Google Scholar 

  • Gharib TF, Habib MB, Fayed ZT (2009) Arabic text classification using support vector machines. Int J Comput Appl 16(4):1–8

    Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

    Article  Google Scholar 

  • Han J, Kamber M, Pei J (2012) Data mining concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, Los Altos

    MATH  Google Scholar 

  • Harrag F, El-Qawasmah E, Al-Salman AS (2010) Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In: Proceedings of the 2010 first international conference on integrated intelligent computing, pp 6–11

  • Harrag F, El-Qawasmeh E (2009) Neural network for Arabic text classification. In: Proceedings of the international conference of applications of digital information and web technologies, ICADIWT ’09, pp 778–783

  • Harrag F, El-Qawasmeh E, Pichappan P (2009) Improving Arabic text categorization using decision trees. In: Proceedings of the 1st international conference of NDT ’09, pp 110–115

  • Hmeidi I, Hawashin B, El-Qawasmeh E (2008) Performance of KNN and SVM classifiers on full word Arabic articles. Adv Eng Inform 22:106–111

    Article  Google Scholar 

  • Hmeidi I, Al-Ayyoub M, Abdulla N, Almodawar A, Abooraig R, Mahyoub N (2015) Automatic Arabic text categorization: a comprehensive comparative study. J Inf Sci 41(1):114–124

    Article  Google Scholar 

  • Hussien MI, Olayah F, AL-dwan M, Shamsan A (2011) Arabic text classification using SMO, Naive Bayesian, J48 algorithm. Int J Res Rev Appl Sci 9(2):306–316

    Google Scholar 

  • Hu Q, Yu D, Xie Z (2004) Improvement on classification performance based on multiple reduct ensembles. In: Proceedings of the 2004 IEEE conference on cybernetics and intelligent systems, vol 2, pp 1016–1021

  • Ishii N, Morioka Y, Kimura H, Bao Y (2010) Classification by partial data of multiple reducts kNN with confidence. In: Proceedings of the 22nd IEEE international conference on tools with artificial intelligence, pp 94–101

  • Jensen R (2005) Combining rough and fuzzy sets for feature selection. Ph.D. Thesis, School of Informatics, University of Edinburgh

  • Lam W, Ruiz M, Srinivasan P (1999) Automatic text categorization and its application to text retrieval. IEEE Trans Knowl Data Eng 11(6):865–879

    Article  Google Scholar 

  • Lin TY (1996) Rough set theory in very large databases. In: Proceedings of the symposium on modeling analysis and simulation, CESA’96 IMACS multi-conference on computational engineering in systems applications, pp 936–941

  • Mesleh A (2007) Chi-square feature extraction based SVMs Arabic language text categorization system. J Comput Sci 3(6):430–435

    Article  Google Scholar 

  • Noaman H, Elmougy S, Ghoneim A, Hamza T (2010) Naïve Bayes classifier based Arabic document categorization. In: Proceedings of the 7th international conference in informatics and systems (INFOS 2010), Cairo, Egypt

  • Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11(5):341–356

    Article  MATH  Google Scholar 

  • Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer, Dordrecht

    Book  MATH  Google Scholar 

  • Rasim Cekik R, Telceken S (2018) A new classification method based on rough sets theory. Soft Comput 22(6):1881–1889

    Article  Google Scholar 

  • Skowron A, Rauszer C (1992) The discernibility matrices and functions in information systems. In: Słowiński R (ed) Intelligent decision

  • Syiam MM, Fayed ZT, Habib MB (2006) An intelligent system for arabic text categorization. Int J Intell Comput Inf Sci 6(1):1–19

    Article  Google Scholar 

  • Thabtah F, Eljinini M, Zamzeer M, Hadi W (2009) Naïve Bayesian based on chi-square to categorize Arabic data. In: Proceedings of the 11th international business information management association conference (IBIMA) conference on innovation and knowledge management in Twin track economies, Cairo, pp 930–935

  • Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1–12

    Article  Google Scholar 

  • Velayutham C, Thangavel K (2011) Unsupervised quick reduct algorithm using rough set theory. J Electron Sci Technol (JEST) 9(3):193–201

    Google Scholar 

  • Wahbeh A, Al-Kabi M, Al-Radaideh Q, Al-Shawakfa E, Alsmadi I (2011) The effect of stemming on Arabic text classification: an empirical study. Int J Inf Retr Res 1(3):54–70

    Google Scholar 

  • Wang Z, Sun X, Li X, Zhang D (2006) An efficient SVM-based spam filtering algorithm. In: Proceedings of the fifth international conference on machine learning and cybernetics, pp 3682–3686

  • Wang N, Wang P, Zhang B (2010) An improved TF–IDF weights function based on information theory. In: Proceedings of the international conference on computer and communication technologies in agriculture engineering, pp 439–441

  • Yahia ME (2011) Arabic text categorization based on rough set classification. In: Proceedings of the 9th IEEE/ACS international conference on computer systems and applications, pp 293–294

  • Yin S, Huang Z, Chen L, Qiu Y (2008) An approach for text classification feature dimensionality reduction and rule generation on rough set. In: Proceedings of the third international conference on innovative computing, information and control (ICICIC 2008), published by IEEE CS

  • Zhang Q, Tan J, Zhou H, Tao W, He K (2009) Machine learning methods for medical text categorization. In: Proceedings of the Pacific-Asia conference on circuits, communications and system, pp 494–497

  • Zhao W, Zhang Z (2005) An E-mail classification model based on rough set theory. In: Proceedings of the 2005 international conference on active media technology (AMT 2005), pp 403–408

  • Zhong N, Dong J, Ohsuga S (2001) Using rough sets with heuristics for feature selection. J Intell Inf Syst 16(3):199–214

    Article  MATH  Google Scholar 

  • Zhu XZ, Zhu W, Fan XN (2017) Rough set methods in feature selection via submodular function. Soft Comput 21(13):3699–3711

    Article  MATH  Google Scholar 

Download references

Funding

This research received no specific grant from any funding agency in public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qasem A. Al-Radaideh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Radaideh, Q.A., Al-Abrat, M.A. An Arabic text categorization approach using term weighting and multiple reducts. Soft Comput 23, 5849–5863 (2019). https://doi.org/10.1007/s00500-018-3249-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3249-z

Keywords

Navigation