IRText: An Item Response Theory-Based Approach for Text Categorization

Coban, Onder

doi:10.1007/s13369-021-06238-7

IRText: An Item Response Theory-Based Approach for Text Categorization

Research Article-Computer Engineering and Computer Science
Published: 05 October 2021

Volume 47, pages 9423–9439, (2022)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Onder Coban ORCID: orcid.org/0000-0001-9404-2583¹

486 Accesses
5 Citations
Explore all metrics

Abstract

Text categorization (TC) is a machine learning task that tries to assign a text to one of the predefined categories. In a nutshell, texts are converted into numerical feature vectors in which each feature is bounded with a weight value. Afterward, a classifier is trained on vectorized texts and is used to classify previously unseen documents. Feature selection (FS) is also optionally applied to achieve better classification accuracy by using a lower number of features. Item response theory (IRT), on the other hand, is a set of statistical models designed to understand persons based on their responses to questions by assuming that responses on a given item are a function of both person and item properties. Even though there exist many studies devoted to understand, explore, and improve methods, there is not any previous study that aims at combining powers of these fields. As such, in this study, an IRT-based approach is proposed that suggests using the IRT score of a feature in both term weighting and FS that are important inter-steps of TC. The efficiency of the proposed approach is measured on two well-known benchmark datasets by comparing it with its two traditional peers. Experimental results show that the IRT-based approach can be used for text FS and there is open room for possible improvements. To the best of our knowledge, this study is the first of its kind which tries to adapt IRT for classical TC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Utility-based feature selection for text classification

Article 08 December 2018

A Comparative Study on Term Weighting Schemes for Text Classification

A New Improved Term Weighting Scheme for Text Categorization

Notes

https://en.wikipedia.org/wiki/Item_response_theory.
https://docs.python.org/3/library/multiprocessing.html.
https://pypi.org/project/girth/.
https://www.nltk.org/book/ch02.html.
https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html.
https://www.scikit-yb.org/en/latest/api/model_selection/importances.html.
https://erroranalysis.ai/.
https://github.com/microsoft/responsible-ai-widgets.
This experiment is conducted only for \(k=900\) as the maximum number of features to evaluate in raiwidgets (https://pypi.org/project/raiwidgets/) package is 1000.

References

Dhar, A.; Mukherjee, H.; Dash, N.S.; Roy, K.: Text categorization: past and present. Artif. Intell. Rev. 54(4), 3007–3054 (2021)
Article Google Scholar
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. (CSUR) 54(3), 1–40 (2021)
Article Google Scholar
Coban, O.; Ozyildirim, B.M.; Ozel, S.A.: An empirical study of the extreme learning machine for twitter sentiment analysis. Int. J. Intell. Syst. Appl. Eng. 6(3), 178–184 (2018)
Article Google Scholar
Kadhim, A.I.: An evaluation of preprocessing techniques for text classification. Int. J. Comput. Sci. Inf. Secur. 16(6) (2018)
Cortes, C.; Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Kibriya, A.M.; Frank, E.; Pfahringer, B.; Holmes, G.: Multinomial naive bayes for text categorization revisited. In: Australasian Joint Conference on Artificial Intelligence, pp. 488–499. Springer (2004)
Moumen, A.; Bouchama, E.H.; El Idirissi, Y.E.B.: Data mining techniques for employability: Systematic literature review. In: 2020 IEEE 2nd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS), pp. 1–5. IEEE (2020)
Stanke, L.; Bulut, O.: Explanatory item response models for polytomous item responses. Int. J. Assess. Tools Educ. 6(2), 259–278 (2019)
Article Google Scholar
Embretson, S.E.; Reise, S.P.: Item Response Theory. Psychology Press, Hove (2013)
Book Google Scholar
Zanon, C.; Hutz, C.S.; Yoo, H.H.; Hambleton, R.K.: An application of item response theory to psychological test development. Psicol. Reflexão e Crítica 29 (2016)
Baker, F.B.: The basics of item response theory. ERIC (2001)
He, Q.: Text mining and IRT for psychiatric and psychological assessment. University of Twente Enschede (2013)
He, Q.; Veldkamp, B.P.; Glas, C.A.; Van Den Berg, S.M.: Combining text mining of long constructed responses and item-based measures: A hybrid test design to screen for posttraumatic stress disorder (ptsd). Front. Psychol. 10, 2358 (2019)
Article Google Scholar
Zhang, L.; Zhu, G.; Zhang, S.; Zhan, X.; Wang, J.; Meng, W.; Fang, X.; Wang, P.: Assessment of career adaptability: combining text mining and item response theory method. IEEE Access 7, 125893–125908 (2019)
Article Google Scholar
Debole, F.; Sebastiani, F.: Supervised term weighting for automated text categorization. In: Text Mining and Its Applications, pp. 81–97. Springer (2004)
Alsaeedi, A.: A survey of term weighting schemes for text classification. Int. J. Data Min. Model. Manag. 12(2), 237–254 (2020)
Google Scholar
Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2008)
Article Google Scholar
Belazzoug, M.; Touahria, M.; Nouioua, F.; Brahimi, M.: An improved sine cosine algorithm to select features for text categorization. J. King Saud Univ. Comput. Inf. Sci. 32(4), 454–464 (2020)
Google Scholar
Şahin, D.Ö.; Kural, O.E.; Akleylek, S.; Kılıç, E.: A novel Android malware detection system: adaption of filter-based feature selection methods. J. Ambient Intell. Hum. Comput. pp. 1–15 (2021)
Diao, R.; Shen, Q.: Nature inspired feature selection meta-heuristics. Artif. Intell. Rev. 44(3), 311–340 (2015)
Article Google Scholar
Mafarja, M.; Qasem, A.; Heidari, A.A.; Aljarah, I.; Faris, H.; Mirjalili, S.: Efficient hybrid nature-inspired binary optimizers for feature selection. Cogn. Comput. 12(1), 150–175 (2020)
Article Google Scholar
Chen, H.; Hou, Q.; Han, L.; Hu, Z.; Ye, Z.; Zeng, J.; Yuan, J.: Distributed text feature selection based on bat algorithm optimization. In: 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), vol. 1, pp. 75–80. IEEE (2019)
Jiang, Z.; Gao, B.; He, Y.; Han, Y.; Doyle, P.; Zhu, Q.: Text classification using novel term weighting scheme-based improved TF-IDF for Internet media reports. Math. Probl. Eng. 2021 (2021)
Chen, L.; Jiang, L.; Li, C.: Modified DFS-based term weighting scheme for text classification. Expert Syst. Appl. 168, 114438 (2021)
Article Google Scholar
Shahee, S.A.; Ananthakumar, U.: An effective distance based feature selection approach for imbalanced data. Appl. Intell. 50(3), 717–745 (2020)
Article Google Scholar
Erenel, Z.; Adegboye, O.R.; Kusetogullari, H.: A new feature selection scheme for emotion recognition from text. Appl. Sci. 10(15), 5351 (2020)
Article Google Scholar
Olmus, H.; Nazman, E.; Erbas, S.: An evaluation of the two parameter (2-pl) irt models through a simulation study. Gazi Univ. J. Sci. 30(1), 235–249 (2017)
Google Scholar
Liu, D.T.; Philips, K.M.; Speth, M.M.; Besser, G.; Mueller, C.A.; Sedaghat, A.R.: Item Response Theory for Psychometric Properties of the SNOT-22 (22-Item Sinonasal Outcome Test). Otolaryngology–Head and Neck Surgery p. 01945998211018383 (2021)
Clark, D.A.; Donnellan, M.B.; Durbin, C.E.; Brooker, R.J.; Neppl, T.K.; Gunnar, M.; Carlson, S.M.; Le Mare, L.; Kochanska, G.; Fisher, P.A.; et al.: Using item response theory to evaluate the Children’s Behavior Questionnaire: considerations of general functioning and assessment length. Psychol. Assess. 32(10), 928 (2020)
Peersman, C.; Daelemans, W.; Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pp. 37–44 (2011)
Liu, K.; Terzi, E.: A framework for computing the privacy scores of users in online social networks. ACM Trans. Knowl. Discov. Data (TKDD) 5(1), 1–30 (2010)
Article Google Scholar
Reckase, M.D.: Multidimensional item response theory. Handb. Stat. 26, 607–642 (2006)
Article Google Scholar
Harvey, R.J.; Hammer, A.L.: Item response theory. Couns. Psychol. 27(3), 353–383 (1999)
Article Google Scholar
Abdul-Rahman, S.; Mutalib, S.; Khanafi, N.A.; Ali, A.M.: Exploring feature selection and support vector machine in text categorization. In: 2013 IEEE 16th International Conference on Computational Science and Engineering, pp. 1101–1104. IEEE (2013)
Guru, D.; Suhil, M.; Raju, L.N.; Kumar, N.V.: An alternative framework for univariate filter based feature selection for text categorization. Pattern Recogn. Lett. 103, 23–31 (2018)
Article Google Scholar
Deng, X.; Li, Y.; Weng, J.; Zhang, J.: Feature selection for text classification: a review. Multim. Tools Appl. 78(3), 3797–3816 (2019)
Article Google Scholar
Yang, Y.; Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Icml, vol. 97, p. 35. Nashville, TN, USA (1997)
Marowka, A.: On parallel software engineering education using python. Educ. Inf. Technol. 23(1), 357–372 (2018)
Article Google Scholar
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Bengfort, B.; Bilbro, R.: Yellowbrick: visualizing the scikit-learn model selection process. J. Open Source Softw. 4(35), 1075 (2019)
Article Google Scholar

Download references

Acknowledgements

The author would like to thank referees and editors for their valuable suggestions contributing to the improvement of this paper.

Author information

Authors and Affiliations

Department of Computer Engineering, Adiyaman University, Adıyaman, Turkey
Onder Coban

Authors

Onder Coban
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Onder Coban.

Ethics declarations

Conflict of interest

The author declares that there is no conflict of interest.

Appendix

See Table 6.

Table 6 Top-ranked 50 features with their scores in ascending order with respect to FS method on the R8 dataset processed with tf*idf weighted bow features

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Coban, O. IRText: An Item Response Theory-Based Approach for Text Categorization. Arab J Sci Eng 47, 9423–9439 (2022). https://doi.org/10.1007/s13369-021-06238-7

Download citation

Received: 10 June 2021
Accepted: 20 September 2021
Published: 05 October 2021
Issue Date: August 2022
DOI: https://doi.org/10.1007/s13369-021-06238-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

IRText: An Item Response Theory-Based Approach for Text Categorization

Abstract

Access this article

Similar content being viewed by others

Utility-based feature selection for text classification

A Comparative Study on Term Weighting Schemes for Text Classification

A New Improved Term Weighting Scheme for Text Categorization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

IRText: An Item Response Theory-Based Approach for Text Categorization

Abstract

Access this article

Similar content being viewed by others

Utility-based feature selection for text classification

A Comparative Study on Term Weighting Schemes for Text Classification

A New Improved Term Weighting Scheme for Text Categorization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation