Skip to main content
Log in

Evaluation of the efficiency of the chi-square metric

  • Published:
Automatic Documentation and Mathematical Linguistics Aims and scope

Abstract

The efficiency of using the chi-square metrics to weigh terms used in text documents is evaluated. The procedure includes the selection and advanced processing of class C and ~C texts, compilation of a reference dictionary and calculation of scores for all the terms in the dictionary, calculation of χ2 coefficients for terms from a class C text, and calculation of the general efficiency factor by the sum of the coefficients found for the terms from the reference dictionary. The weighting by the χ2 formula, odds-ratio (OR) formula, and on the basis of probabilistic variables is analyzed and compared. It was found that the best result is yielded by the OR-based weighting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Yatsko, V.A., Computational linguistics or linguistic informatics?, Autom. Doc. Math. Linguist., 2014, vol. 48, no. 3, pp. 149–157.

    Article  Google Scholar 

  2. Yatsko, V.A., The technique for symmetrical weighing of sentences, Nauchn.-Tekhn. Inform., Ser. 2., 2016, no. 2, pp. 36–41.

    Google Scholar 

  3. Yatsko, V., TF*IDF revisited, Int. J. Comput. Linguist. Nat. Language Process., 2013, vol. 2, no. 6, pp. 385–387. http://www.aaai.org/Papers/AAAI/2006/AAAI06-121.pdf.

    Google Scholar 

  4. Lan, M., Tan, C-L., and Low, H-B., Proposing a new term weighting scheme for text categorization, 2006. http://www.aaai.org/Papers/AAAI/2006/AAAI06-121.pdf.

    Google Scholar 

  5. Marapov, D., Pearson’s chi-squared test, 2013. http://medstatistic.ru/theory/hi_kvadrat.html.

    Google Scholar 

  6. Fox, C., A stop list for general text, SIGIR Forum, 1989, vol. 24, nos. 1–2, pp. 19–21.

    Article  Google Scholar 

  7. McHugh, M.L., The odds ratio: Calculation, usage, and interpretation, Biochem. Med., 2009, vol. 19, no. 2, pp. 120–126. doi doi 10.11613/BM.2009.011

    Article  Google Scholar 

  8. Oakes, M.P., Gaizauskas, R., and Fowkes, H., A method based on the chi-square test for document classification, SIGIR '01 Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2001. http://pers-www.wlv.ac.uk/~in4326/old/2001_Oakes_SIGIR.pdf.

    Google Scholar 

  9. Debole, F. and Sebastiani, F., Supervised term weighting for automated text categorization. http://nmis.isti.cnr.it/sebastiani/Publications/NEMIS04.pdf.

  10. http://www.anc.org/data/anc-second-release/anc-second-release-contents/.

  11. http://www.nytimes.com/1988/02/28/magazine/bodyand-mind-the-high-cost-of-thinness.html.

  12. http://www.scientificpsychic.com/paice/paice.html.

  13. http://www.laurenceanthony.net/software.html.

  14. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

  15. http://math.hws.edu/javamath/ryan/ChiSquare.html.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. A. Yatsko.

Additional information

Original Russian Text © V.A. Yatsko, 2016, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2: Informatsionnye Protsessy i Sistemy, 2016, No. 7, pp. 24–29.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yatsko, V.A. Evaluation of the efficiency of the chi-square metric. Autom. Doc. Math. Linguist. 50, 173–178 (2016). https://doi.org/10.3103/S0005105516040051

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0005105516040051

Keywords

Navigation