Abstract
The efficiency of using the chi-square metrics to weigh terms used in text documents is evaluated. The procedure includes the selection and advanced processing of class C and ~C texts, compilation of a reference dictionary and calculation of scores for all the terms in the dictionary, calculation of χ2 coefficients for terms from a class C text, and calculation of the general efficiency factor by the sum of the coefficients found for the terms from the reference dictionary. The weighting by the χ2 formula, odds-ratio (OR) formula, and on the basis of probabilistic variables is analyzed and compared. It was found that the best result is yielded by the OR-based weighting.
Similar content being viewed by others
References
Yatsko, V.A., Computational linguistics or linguistic informatics?, Autom. Doc. Math. Linguist., 2014, vol. 48, no. 3, pp. 149–157.
Yatsko, V.A., The technique for symmetrical weighing of sentences, Nauchn.-Tekhn. Inform., Ser. 2., 2016, no. 2, pp. 36–41.
Yatsko, V., TF*IDF revisited, Int. J. Comput. Linguist. Nat. Language Process., 2013, vol. 2, no. 6, pp. 385–387. http://www.aaai.org/Papers/AAAI/2006/AAAI06-121.pdf.
Lan, M., Tan, C-L., and Low, H-B., Proposing a new term weighting scheme for text categorization, 2006. http://www.aaai.org/Papers/AAAI/2006/AAAI06-121.pdf.
Marapov, D., Pearson’s chi-squared test, 2013. http://medstatistic.ru/theory/hi_kvadrat.html.
Fox, C., A stop list for general text, SIGIR Forum, 1989, vol. 24, nos. 1–2, pp. 19–21.
McHugh, M.L., The odds ratio: Calculation, usage, and interpretation, Biochem. Med., 2009, vol. 19, no. 2, pp. 120–126. doi doi 10.11613/BM.2009.011
Oakes, M.P., Gaizauskas, R., and Fowkes, H., A method based on the chi-square test for document classification, SIGIR '01 Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2001. http://pers-www.wlv.ac.uk/~in4326/old/2001_Oakes_SIGIR.pdf.
Debole, F. and Sebastiani, F., Supervised term weighting for automated text categorization. http://nmis.isti.cnr.it/sebastiani/Publications/NEMIS04.pdf.
http://www.anc.org/data/anc-second-release/anc-second-release-contents/.
http://www.nytimes.com/1988/02/28/magazine/bodyand-mind-the-high-cost-of-thinness.html.
http://www.scientificpsychic.com/paice/paice.html.
http://www.laurenceanthony.net/software.html.
http://www.daviddlewis.com/resources/testcollections/reuters21578/.
http://math.hws.edu/javamath/ryan/ChiSquare.html.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © V.A. Yatsko, 2016, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2: Informatsionnye Protsessy i Sistemy, 2016, No. 7, pp. 24–29.
About this article
Cite this article
Yatsko, V.A. Evaluation of the efficiency of the chi-square metric. Autom. Doc. Math. Linguist. 50, 173–178 (2016). https://doi.org/10.3103/S0005105516040051
Received:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0005105516040051