Skip to main content
Log in

IRText: An Item Response Theory-Based Approach for Text Categorization

  • Research Article-Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Text categorization (TC) is a machine learning task that tries to assign a text to one of the predefined categories. In a nutshell, texts are converted into numerical feature vectors in which each feature is bounded with a weight value. Afterward, a classifier is trained on vectorized texts and is used to classify previously unseen documents. Feature selection (FS) is also optionally applied to achieve better classification accuracy by using a lower number of features. Item response theory (IRT), on the other hand, is a set of statistical models designed to understand persons based on their responses to questions by assuming that responses on a given item are a function of both person and item properties. Even though there exist many studies devoted to understand, explore, and improve methods, there is not any previous study that aims at combining powers of these fields. As such, in this study, an IRT-based approach is proposed that suggests using the IRT score of a feature in both term weighting and FS that are important inter-steps of TC. The efficiency of the proposed approach is measured on two well-known benchmark datasets by comparing it with its two traditional peers. Experimental results show that the IRT-based approach can be used for text FS and there is open room for possible improvements. To the best of our knowledge, this study is the first of its kind which tries to adapt IRT for classical TC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. https://en.wikipedia.org/wiki/Item_response_theory.

  2. https://docs.python.org/3/library/multiprocessing.html.

  3. https://pypi.org/project/girth/.

  4. https://www.nltk.org/book/ch02.html.

  5. https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html.

  6. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html.

  7. https://www.scikit-yb.org/en/latest/api/model_selection/importances.html.

  8. https://erroranalysis.ai/.

  9. https://github.com/microsoft/responsible-ai-widgets.

  10. This experiment is conducted only for \(k=900\) as the maximum number of features to evaluate in raiwidgets (https://pypi.org/project/raiwidgets/) package is 1000.

References

  1. Dhar, A.; Mukherjee, H.; Dash, N.S.; Roy, K.: Text categorization: past and present. Artif. Intell. Rev. 54(4), 3007–3054 (2021)

    Article  Google Scholar 

  2. Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. (CSUR) 54(3), 1–40 (2021)

    Article  Google Scholar 

  3. Coban, O.; Ozyildirim, B.M.; Ozel, S.A.: An empirical study of the extreme learning machine for twitter sentiment analysis. Int. J. Intell. Syst. Appl. Eng. 6(3), 178–184 (2018)

    Article  Google Scholar 

  4. Kadhim, A.I.: An evaluation of preprocessing techniques for text classification. Int. J. Comput. Sci. Inf. Secur. 16(6) (2018)

  5. Cortes, C.; Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  6. Kibriya, A.M.; Frank, E.; Pfahringer, B.; Holmes, G.: Multinomial naive bayes for text categorization revisited. In: Australasian Joint Conference on Artificial Intelligence, pp. 488–499. Springer (2004)

  7. Moumen, A.; Bouchama, E.H.; El Idirissi, Y.E.B.: Data mining techniques for employability: Systematic literature review. In: 2020 IEEE 2nd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS), pp. 1–5. IEEE (2020)

  8. Stanke, L.; Bulut, O.: Explanatory item response models for polytomous item responses. Int. J. Assess. Tools Educ. 6(2), 259–278 (2019)

    Article  Google Scholar 

  9. Embretson, S.E.; Reise, S.P.: Item Response Theory. Psychology Press, Hove (2013)

    Book  Google Scholar 

  10. Zanon, C.; Hutz, C.S.; Yoo, H.H.; Hambleton, R.K.: An application of item response theory to psychological test development. Psicol. Reflexão e Crítica 29 (2016)

  11. Baker, F.B.: The basics of item response theory. ERIC (2001)

  12. He, Q.: Text mining and IRT for psychiatric and psychological assessment. University of Twente Enschede (2013)

  13. He, Q.; Veldkamp, B.P.; Glas, C.A.; Van Den Berg, S.M.: Combining text mining of long constructed responses and item-based measures: A hybrid test design to screen for posttraumatic stress disorder (ptsd). Front. Psychol. 10, 2358 (2019)

    Article  Google Scholar 

  14. Zhang, L.; Zhu, G.; Zhang, S.; Zhan, X.; Wang, J.; Meng, W.; Fang, X.; Wang, P.: Assessment of career adaptability: combining text mining and item response theory method. IEEE Access 7, 125893–125908 (2019)

    Article  Google Scholar 

  15. Debole, F.; Sebastiani, F.: Supervised term weighting for automated text categorization. In: Text Mining and Its Applications, pp. 81–97. Springer (2004)

  16. Alsaeedi, A.: A survey of term weighting schemes for text classification. Int. J. Data Min. Model. Manag. 12(2), 237–254 (2020)

    Google Scholar 

  17. Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2008)

    Article  Google Scholar 

  18. Belazzoug, M.; Touahria, M.; Nouioua, F.; Brahimi, M.: An improved sine cosine algorithm to select features for text categorization. J. King Saud Univ. Comput. Inf. Sci. 32(4), 454–464 (2020)

    Google Scholar 

  19. Şahin, D.Ö.; Kural, O.E.; Akleylek, S.; Kılıç, E.: A novel Android malware detection system: adaption of filter-based feature selection methods. J. Ambient Intell. Hum. Comput. pp. 1–15 (2021)

  20. Diao, R.; Shen, Q.: Nature inspired feature selection meta-heuristics. Artif. Intell. Rev. 44(3), 311–340 (2015)

    Article  Google Scholar 

  21. Mafarja, M.; Qasem, A.; Heidari, A.A.; Aljarah, I.; Faris, H.; Mirjalili, S.: Efficient hybrid nature-inspired binary optimizers for feature selection. Cogn. Comput. 12(1), 150–175 (2020)

    Article  Google Scholar 

  22. Chen, H.; Hou, Q.; Han, L.; Hu, Z.; Ye, Z.; Zeng, J.; Yuan, J.: Distributed text feature selection based on bat algorithm optimization. In: 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), vol. 1, pp. 75–80. IEEE (2019)

  23. Jiang, Z.; Gao, B.; He, Y.; Han, Y.; Doyle, P.; Zhu, Q.: Text classification using novel term weighting scheme-based improved TF-IDF for Internet media reports. Math. Probl. Eng. 2021 (2021)

  24. Chen, L.; Jiang, L.; Li, C.: Modified DFS-based term weighting scheme for text classification. Expert Syst. Appl. 168, 114438 (2021)

    Article  Google Scholar 

  25. Shahee, S.A.; Ananthakumar, U.: An effective distance based feature selection approach for imbalanced data. Appl. Intell. 50(3), 717–745 (2020)

    Article  Google Scholar 

  26. Erenel, Z.; Adegboye, O.R.; Kusetogullari, H.: A new feature selection scheme for emotion recognition from text. Appl. Sci. 10(15), 5351 (2020)

    Article  Google Scholar 

  27. Olmus, H.; Nazman, E.; Erbas, S.: An evaluation of the two parameter (2-pl) irt models through a simulation study. Gazi Univ. J. Sci. 30(1), 235–249 (2017)

    Google Scholar 

  28. Liu, D.T.; Philips, K.M.; Speth, M.M.; Besser, G.; Mueller, C.A.; Sedaghat, A.R.: Item Response Theory for Psychometric Properties of the SNOT-22 (22-Item Sinonasal Outcome Test). Otolaryngology–Head and Neck Surgery p. 01945998211018383 (2021)

  29. Clark, D.A.; Donnellan, M.B.; Durbin, C.E.; Brooker, R.J.; Neppl, T.K.; Gunnar, M.; Carlson, S.M.; Le Mare, L.; Kochanska, G.; Fisher, P.A.; et al.: Using item response theory to evaluate the Children’s Behavior Questionnaire: considerations of general functioning and assessment length. Psychol. Assess. 32(10), 928 (2020)

  30. Peersman, C.; Daelemans, W.; Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pp. 37–44 (2011)

  31. Liu, K.; Terzi, E.: A framework for computing the privacy scores of users in online social networks. ACM Trans. Knowl. Discov. Data (TKDD) 5(1), 1–30 (2010)

    Article  Google Scholar 

  32. Reckase, M.D.: Multidimensional item response theory. Handb. Stat. 26, 607–642 (2006)

    Article  Google Scholar 

  33. Harvey, R.J.; Hammer, A.L.: Item response theory. Couns. Psychol. 27(3), 353–383 (1999)

    Article  Google Scholar 

  34. Abdul-Rahman, S.; Mutalib, S.; Khanafi, N.A.; Ali, A.M.: Exploring feature selection and support vector machine in text categorization. In: 2013 IEEE 16th International Conference on Computational Science and Engineering, pp. 1101–1104. IEEE (2013)

  35. Guru, D.; Suhil, M.; Raju, L.N.; Kumar, N.V.: An alternative framework for univariate filter based feature selection for text categorization. Pattern Recogn. Lett. 103, 23–31 (2018)

    Article  Google Scholar 

  36. Deng, X.; Li, Y.; Weng, J.; Zhang, J.: Feature selection for text classification: a review. Multim. Tools Appl. 78(3), 3797–3816 (2019)

    Article  Google Scholar 

  37. Yang, Y.; Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Icml, vol. 97, p. 35. Nashville, TN, USA (1997)

  38. Marowka, A.: On parallel software engineering education using python. Educ. Inf. Technol. 23(1), 357–372 (2018)

    Article  Google Scholar 

  39. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  40. Bengfort, B.; Bilbro, R.: Yellowbrick: visualizing the scikit-learn model selection process. J. Open Source Softw. 4(35), 1075 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

The author would like to thank referees and editors for their valuable suggestions contributing to the improvement of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Onder Coban.

Ethics declarations

Conflict of interest

The author declares that there is no conflict of interest.

Appendix

Appendix

See Table 6.

Table 6 Top-ranked 50 features with their scores in ascending order with respect to FS method on the R8 dataset processed with tf*idf weighted bow features

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Coban, O. IRText: An Item Response Theory-Based Approach for Text Categorization. Arab J Sci Eng 47, 9423–9439 (2022). https://doi.org/10.1007/s13369-021-06238-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-021-06238-7

Keywords

Navigation