Abstract
Text categorization (TC) is a machine learning task that tries to assign a text to one of the predefined categories. In a nutshell, texts are converted into numerical feature vectors in which each feature is bounded with a weight value. Afterward, a classifier is trained on vectorized texts and is used to classify previously unseen documents. Feature selection (FS) is also optionally applied to achieve better classification accuracy by using a lower number of features. Item response theory (IRT), on the other hand, is a set of statistical models designed to understand persons based on their responses to questions by assuming that responses on a given item are a function of both person and item properties. Even though there exist many studies devoted to understand, explore, and improve methods, there is not any previous study that aims at combining powers of these fields. As such, in this study, an IRT-based approach is proposed that suggests using the IRT score of a feature in both term weighting and FS that are important inter-steps of TC. The efficiency of the proposed approach is measured on two well-known benchmark datasets by comparing it with its two traditional peers. Experimental results show that the IRT-based approach can be used for text FS and there is open room for possible improvements. To the best of our knowledge, this study is the first of its kind which tries to adapt IRT for classical TC.
Similar content being viewed by others
Notes
This experiment is conducted only for \(k=900\) as the maximum number of features to evaluate in raiwidgets (https://pypi.org/project/raiwidgets/) package is 1000.
References
Dhar, A.; Mukherjee, H.; Dash, N.S.; Roy, K.: Text categorization: past and present. Artif. Intell. Rev. 54(4), 3007–3054 (2021)
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. (CSUR) 54(3), 1–40 (2021)
Coban, O.; Ozyildirim, B.M.; Ozel, S.A.: An empirical study of the extreme learning machine for twitter sentiment analysis. Int. J. Intell. Syst. Appl. Eng. 6(3), 178–184 (2018)
Kadhim, A.I.: An evaluation of preprocessing techniques for text classification. Int. J. Comput. Sci. Inf. Secur. 16(6) (2018)
Cortes, C.; Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Kibriya, A.M.; Frank, E.; Pfahringer, B.; Holmes, G.: Multinomial naive bayes for text categorization revisited. In: Australasian Joint Conference on Artificial Intelligence, pp. 488–499. Springer (2004)
Moumen, A.; Bouchama, E.H.; El Idirissi, Y.E.B.: Data mining techniques for employability: Systematic literature review. In: 2020 IEEE 2nd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS), pp. 1–5. IEEE (2020)
Stanke, L.; Bulut, O.: Explanatory item response models for polytomous item responses. Int. J. Assess. Tools Educ. 6(2), 259–278 (2019)
Embretson, S.E.; Reise, S.P.: Item Response Theory. Psychology Press, Hove (2013)
Zanon, C.; Hutz, C.S.; Yoo, H.H.; Hambleton, R.K.: An application of item response theory to psychological test development. Psicol. Reflexão e Crítica 29 (2016)
Baker, F.B.: The basics of item response theory. ERIC (2001)
He, Q.: Text mining and IRT for psychiatric and psychological assessment. University of Twente Enschede (2013)
He, Q.; Veldkamp, B.P.; Glas, C.A.; Van Den Berg, S.M.: Combining text mining of long constructed responses and item-based measures: A hybrid test design to screen for posttraumatic stress disorder (ptsd). Front. Psychol. 10, 2358 (2019)
Zhang, L.; Zhu, G.; Zhang, S.; Zhan, X.; Wang, J.; Meng, W.; Fang, X.; Wang, P.: Assessment of career adaptability: combining text mining and item response theory method. IEEE Access 7, 125893–125908 (2019)
Debole, F.; Sebastiani, F.: Supervised term weighting for automated text categorization. In: Text Mining and Its Applications, pp. 81–97. Springer (2004)
Alsaeedi, A.: A survey of term weighting schemes for text classification. Int. J. Data Min. Model. Manag. 12(2), 237–254 (2020)
Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2008)
Belazzoug, M.; Touahria, M.; Nouioua, F.; Brahimi, M.: An improved sine cosine algorithm to select features for text categorization. J. King Saud Univ. Comput. Inf. Sci. 32(4), 454–464 (2020)
Şahin, D.Ö.; Kural, O.E.; Akleylek, S.; Kılıç, E.: A novel Android malware detection system: adaption of filter-based feature selection methods. J. Ambient Intell. Hum. Comput. pp. 1–15 (2021)
Diao, R.; Shen, Q.: Nature inspired feature selection meta-heuristics. Artif. Intell. Rev. 44(3), 311–340 (2015)
Mafarja, M.; Qasem, A.; Heidari, A.A.; Aljarah, I.; Faris, H.; Mirjalili, S.: Efficient hybrid nature-inspired binary optimizers for feature selection. Cogn. Comput. 12(1), 150–175 (2020)
Chen, H.; Hou, Q.; Han, L.; Hu, Z.; Ye, Z.; Zeng, J.; Yuan, J.: Distributed text feature selection based on bat algorithm optimization. In: 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), vol. 1, pp. 75–80. IEEE (2019)
Jiang, Z.; Gao, B.; He, Y.; Han, Y.; Doyle, P.; Zhu, Q.: Text classification using novel term weighting scheme-based improved TF-IDF for Internet media reports. Math. Probl. Eng. 2021 (2021)
Chen, L.; Jiang, L.; Li, C.: Modified DFS-based term weighting scheme for text classification. Expert Syst. Appl. 168, 114438 (2021)
Shahee, S.A.; Ananthakumar, U.: An effective distance based feature selection approach for imbalanced data. Appl. Intell. 50(3), 717–745 (2020)
Erenel, Z.; Adegboye, O.R.; Kusetogullari, H.: A new feature selection scheme for emotion recognition from text. Appl. Sci. 10(15), 5351 (2020)
Olmus, H.; Nazman, E.; Erbas, S.: An evaluation of the two parameter (2-pl) irt models through a simulation study. Gazi Univ. J. Sci. 30(1), 235–249 (2017)
Liu, D.T.; Philips, K.M.; Speth, M.M.; Besser, G.; Mueller, C.A.; Sedaghat, A.R.: Item Response Theory for Psychometric Properties of the SNOT-22 (22-Item Sinonasal Outcome Test). Otolaryngology–Head and Neck Surgery p. 01945998211018383 (2021)
Clark, D.A.; Donnellan, M.B.; Durbin, C.E.; Brooker, R.J.; Neppl, T.K.; Gunnar, M.; Carlson, S.M.; Le Mare, L.; Kochanska, G.; Fisher, P.A.; et al.: Using item response theory to evaluate the Children’s Behavior Questionnaire: considerations of general functioning and assessment length. Psychol. Assess. 32(10), 928 (2020)
Peersman, C.; Daelemans, W.; Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pp. 37–44 (2011)
Liu, K.; Terzi, E.: A framework for computing the privacy scores of users in online social networks. ACM Trans. Knowl. Discov. Data (TKDD) 5(1), 1–30 (2010)
Reckase, M.D.: Multidimensional item response theory. Handb. Stat. 26, 607–642 (2006)
Harvey, R.J.; Hammer, A.L.: Item response theory. Couns. Psychol. 27(3), 353–383 (1999)
Abdul-Rahman, S.; Mutalib, S.; Khanafi, N.A.; Ali, A.M.: Exploring feature selection and support vector machine in text categorization. In: 2013 IEEE 16th International Conference on Computational Science and Engineering, pp. 1101–1104. IEEE (2013)
Guru, D.; Suhil, M.; Raju, L.N.; Kumar, N.V.: An alternative framework for univariate filter based feature selection for text categorization. Pattern Recogn. Lett. 103, 23–31 (2018)
Deng, X.; Li, Y.; Weng, J.; Zhang, J.: Feature selection for text classification: a review. Multim. Tools Appl. 78(3), 3797–3816 (2019)
Yang, Y.; Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Icml, vol. 97, p. 35. Nashville, TN, USA (1997)
Marowka, A.: On parallel software engineering education using python. Educ. Inf. Technol. 23(1), 357–372 (2018)
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Bengfort, B.; Bilbro, R.: Yellowbrick: visualizing the scikit-learn model selection process. J. Open Source Softw. 4(35), 1075 (2019)
Acknowledgements
The author would like to thank referees and editors for their valuable suggestions contributing to the improvement of this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that there is no conflict of interest.
Appendix
Appendix
See Table 6.
Rights and permissions
About this article
Cite this article
Coban, O. IRText: An Item Response Theory-Based Approach for Text Categorization. Arab J Sci Eng 47, 9423–9439 (2022). https://doi.org/10.1007/s13369-021-06238-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-021-06238-7