Skip to main content
Log in

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Term frequency-inverse document frequency, or TF-IDF for short, and its many variants form a class of term weighting functions the members of which are widely used in text analysis applications. While TF-IDF was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that TF-IDF corresponds very nearly with the hypergeometric test of statistical significance on selected real-data document retrieval, summarization, and classification tasks. These findings suggest that a fundamental mathematical connection between TF-IDF and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We advance the empirical analyses herein as a first step toward explaining the long-standing effectiveness of TF-IDF from a statistical significance testing lens. It is our aspiration that these results will open the door to the systematic evaluation of significance testing derived term weighting functions in text analysis applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Data Availability

The NYSK dataset is available for download at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/NYSK. The TREC XML formatted version of the Cranfield collection was downloaded from the cranfield-trec-dataset GitHub repository (https://github.com/oussbenk/cranfield-trec-dataset, Commit Id: 1208e6edfb6cb2527b2c44398d3d8fefd3249144) . The 20 Newsgroups dataset used in this study is included in the Python library sklearn 1.2.2 [70].

Code Availability

R and Python code used to analyze the data is available at https://github.com/paul-sheridan/hgt-tfidf.

References

  1. Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372. https://doi.org/10.1108/eb026562

    Article  Google Scholar 

  2. Rathi RN, Mustafi A (2023) The importance of term weighting in semantic understanding of text: A review of techniques. Multimedia Tools Appl 82:9761–9783. https://doi.org/10.1007/s11042-022-12538-3

    Article  CAS  Google Scholar 

  3. Robertson S (2004) Understanding inverse document frequency: On theoretical arguments for IDF. J Doc 60(5):503–520. https://doi.org/10.1108/00220410410560582

    Article  Google Scholar 

  4. Hiemstra D (2000) A probabilistic justification for using tf idf term weighting in information retrieval. Int J Digit Libr 3(2):131–139. https://doi.org/10.1007/s007999900025

    Article  Google Scholar 

  5. Amati G, Van Rijsbergen CJ (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst 20(4):357–389. https://doi.org/10.1145/582415.582416

    Article  Google Scholar 

  6. Aizawa A (2003) An information-theoretic perspective of tf-idf measures. Inf Process Manag 39(1):45–65. https://doi.org/10.1016/S0306-4573(02)00021-3

    Article  Google Scholar 

  7. de Vries AP, Roelleke T (2005) Relevance information: A loss of entropy but a gain for IDF? In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’05, Association for Computing Machinery, New York, NY, USA pp 282–289. https://doi.org/10.1145/1076034.1076084

  8. Elkan C (2005) Deriving TF-IDF as a Fisher kernel. In: Proceedings of the 12th International conference on string processing and information retrieval. SPIRE’05, Springer, Berlin, Heidelberg, pp 295–300. https://doi.org/10.1007/11575832_33

  9. Roelleke T, Wang J (2006) A parallel derivation of probabilistic information retrieval models. In: Proceedings of the 29th Annual International ACM SIGIR conference on research and development in information retrieval. SIGIR ’06, Association for Computing Machinery, New York, NY, USA, pp 107–114. https://doi.org/10.1145/1148170.1148192

  10. Roelleke T, Wang J (2008) TF-IDF uncovered: A study of theories and probabilities. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’08, ACM, New York, NY, USA, pp 435–442. https://doi.org/10.1145/1390334.1390409

  11. Roelleke T (2013) Information Retrieval Models: Foundations and Relationships. Morgan & Claypool Publishers, San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA). https://doi.org/10.1007/978-3-031-02328-6

  12. Havrlant L, Kreinovich V (2017) A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation). Int J Gen Syst 46(1):27–36. https://doi.org/10.1080/03081079.2017.1291635

    Article  MathSciNet  Google Scholar 

  13. Rivals I, Personnaz L, Taing L, Potier M-C (2007) Enrichment or depletion of a GO category within a class of genes: Which test? Bioinformatics 23(4):401–407. https://doi.org/10.1093/bioinformatics/btl633

    Article  CAS  PubMed  Google Scholar 

  14. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G (2004) GO::TermFinder-open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20(18):3710. https://doi.org/10.1093/bioinformatics/bth456

    Article  CAS  PubMed  Google Scholar 

  15. Huang DW, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37(1):1. https://doi.org/10.1093/nar/gkn923

    Article  CAS  PubMed  Google Scholar 

  16. Maere S, Heymans K, Kuiper M (2005) BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in biological networks. Bioinformatics 21(16):3448–3449. https://doi.org/10.1093/bioinformatics/bti551

    Article  CAS  PubMed  Google Scholar 

  17. Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A, Mostafavi S, Montojo J, Shao Q, Wright G, Bader GD, Morris Q (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research 38(suppl_2):214–220. https://doi.org/10.1093/nar/gkq537

  18. Zheng Q, Wang X-J (2008) GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis. Nucleic Acids Res 36(suppl_2):358–363. https://doi.org/10.1093/nar/gkn276

  19. Dermouche M, Velcin J, Khouas L, Loudcher S (2014) A joint model for topic-sentiment evolution over time. In: Proceedings of the 2014 IEEE international conference on data mining. ICDM ’14, IEEE Computer Society, Washington, DC, USA, pp 773–778. https://doi.org/10.1109/ICDM.2014.82

  20. Glasgow Information Retrieval Group: Cranfield collection. http://ir.dcs.gla.ac.uk/resources/test_collections/cran/. Accessed 23 May 2023

  21. Lang K (1995) NewsWeeder: learning to filter netnews. In: Proceedings of the 12th international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 331–339

  22. Robertson SE, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1994) Okapi at TREC-3. In: Harman DK (ed) TREC NIST Special Publication, vol. 500-225. National Institute of Standards and Technology (NIST), Gaithersburg, MD, pp 109–126

  23. Robertson S, Zaragoza H (2009) The probabilistic relevance framework: BM25 and beyond. Found Trends Inf Retr 3(4):333–389. https://doi.org/10.1561/1500000019

    Article  Google Scholar 

  24. Lv Y, Zhai C (2011) Lower-bounding term frequency normalization. In: Proceedings of the 20th ACM international conference on information and knowledge management. CIKM ’11, Association for Computing Machinery, New York, NY, USA, pp 7–16. https://doi.org/10.1145/2063576.2063584

  25. Jimenez S, Cucerzan S-P, Gonzalez FA, Gelbukh A, Dueñas G, Pinto D, Singh VK, Villavicencio A, Mayr-Schlegel P, Stamatatos E (2018) BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies. J Intell Fuzzy Syst 34(5):2887–2899. https://doi.org/10.3233/JIFS-169475

    Article  Google Scholar 

  26. Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of the fourteenth international conference on machine learning. ICML ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 143–151

  27. Sabbah T, Selamat A, Selamat MH, Al-Anzi FS, Herrera-Viedma EE, Krejcar O, Fujita H (2017) Modified frequency-based term weighting schemes for text classification. Appl Soft Comput 58:193–206. https://doi.org/10.1016/J.ASOC.2017.04.069

    Article  Google Scholar 

  28. Kim S-W, Gil J-M (2019) Research paper classification systems based on TF-IDF and LDA schemes. Hum Centric Comput Inf Sci 9(1):30. https://doi.org/10.1186/s13673-019-0192-7

    Article  Google Scholar 

  29. Jiang Z, Gao B, He Y, Han Y, Doyle P, Zhu Q (2021) Text classification using novel term weighting scheme-based improved TF-IDF for internet media reports. Math Probl Eng 2021:1–30. https://doi.org/10.1155/2021/6619088

    Article  CAS  Google Scholar 

  30. Chawla S, Kaur R, Aggarwal P (2023) Text classification framework for short text based on TFIDF-FastText. Multimedia Tools Appl. https://doi.org/10.1007/s11042-023-15211-5

    Article  Google Scholar 

  31. Bafna P, Pramod D, Vaidya A (2016) Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp 61–66. https://doi.org/10.1109/ICEEOT.2016.7754750

  32. Marcińczuk M, Gniewkowski M, Walkowiak T, Bedkowski M (2021) Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In: Proceedings of the 11th Global Wordnet Conference, Global Wordnet Association, University of South Africa (UNISA), pp 207–214

  33. Thielmann A, Weisser C, Kneib T, Säfken B (2023) Coherence based document clustering. In: 2023 IEEE 17th International Conference on Semantic Computing (ICSC), pp 9–16. https://doi.org/10.1109/ICSC56153.2023.00009

  34. Firoozeh N, Nazarenko A, Alizon F, Daille B (2020) Keyword extraction: Issues and methods. Nat Lang Eng 26(3):259–291. https://doi.org/10.1017/S1351324919000457

    Article  Google Scholar 

  35. Koloski B, Pollak S, Škrlj B, Martinc M (2021) Extending neural keyword extraction with TF–IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Association for Computational Linguistics, Online, pp 22–29

  36. Qian Y, Jia C (1992) Liu Y (2021) Bert-based text keyword extraction. J Phys Conf Ser 4:042077. https://doi.org/10.1088/1742-6596/1992/4/042077

    Article  Google Scholar 

  37. Magdy S, Abouelseoud Y, Mikhail M (2020) Privacy preserving search index for image databases based on SURF and order preserving encryption. IET Image Process 14(5):874–881. https://doi.org/10.1049/iet-ipr.2019.0575

    Article  Google Scholar 

  38. Yang J, Jiang Y-G, Hauptmann AG, Ngo C-W (2007) Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the international workshop on workshop on multimedia information retrieval. MIR ’07, Association for Computing Machinery, New York, NY, USA, pp 197–206. https://doi.org/10.1145/1290082.1290111

  39. Moulin C, Barat C, Ducottet C (2010) Fusion of tf.idf weighted bag of visual features for image classification. In: 2010 International Workshop on Content Based Multimedia Indexing (CBMI), pp 1–6. https://doi.org/10.1109/CBMI.2010.5529901

  40. Suzuki Y, Mitsukawa M, Kawagoe K (2008) A image retrieval method using TFIDF based weighting scheme. In: 2008 19th International Workshop on Database and Expert Systems Applications, pp 112–116. https://doi.org/10.1109/DEXA.2008.106

  41. Kondylidis N, Tzelepi M, Tefas A (2018) Exploiting Tf-Idf in deep convolutional neural networks for content based image retrieval. Multimedia Tools Appl 77(23):30729–30748

    Article  Google Scholar 

  42. Chum O, Philbin J, Zisserman A (2008) Near duplicate image detection: min-hash and tf-idf weighting. In: Everingham M, Needham CJ, Fraile R (eds) Proceedings of the British Machine Vision Conference 2008, British Machine Vision Association, Leeds, UK, pp 1–10 https://doi.org/10.5244/C.22.50

  43. Kaur G, Singh N, Kumar M (2022) Image forgery techniques: A review. Artif Intell Rev 56(2):1577–1625. https://doi.org/10.1007/s10462-022-10211-7

    Article  Google Scholar 

  44. Shan W, Yi Y, Huang R, Xie Y (2019) Robust contrast enhancement forensics based on convolutional neural networks. Signal Process Image Commun 71:138–146. https://doi.org/10.1016/j.image.2018.11.011

    Article  Google Scholar 

  45. Koul S, Kumar M, Khurana SS, Mushtaq F, Kumar K (2022) An efficient approach for copy-move image forgery detection using convolution neural network. Multimedia Tools Appl 81(8):11259–11277. https://doi.org/10.1007/s11042-022-11974-5

    Article  Google Scholar 

  46. Walia S, Kumar K, Kumar M (2022) Unveiling digital image forgeries using markov based quaternions in frequency domain and fusion of machine learning algorithms. Multimedia Tools Appl 82(3):4517–4532. https://doi.org/10.1007/s11042-022-13610-8

    Article  Google Scholar 

  47. Bansal M, Kumar M, Kumar M (2021) Performance comparison of various feature extraction methods for object recognition on Caltech-101 Image dataset. In: Choudhary A, Agrawal AP, Logeswaran R, Unhelkar B (eds) Applications of artificial intelligence and machine learning. Springer, Singapore, pp 289–303

    Google Scholar 

  48. Bansal M, Kumar M, Sachdeva M, Mittal A (2021) Transfer learning for image classification using VGG19: Caltech-101 image data set. J Ambient Intell Humanized Comput 14:3609–3620

    Article  Google Scholar 

  49. Shaheed K, Mao A, Qureshi I, Kumar M, Hussain S, Ullah I, Zhang X (2022) DS-CNN: A pre-trained xception model based on depth-wise separable convolutional neural network for finger vein recognition. Expert Syst Appl 191(C). https://doi.org/10.1016/j.eswa.2021.116288

  50. Arnesia PD, Madenda S (2012) Matching images with textual document using TFIDF method. In: 2012 5th International Congress on Image and Signal Processing, pp 1283–1289. https://doi.org/10.1109/CISP.2012.6469720

  51. Schneider F, Biemann C (2022) Golden retriever: A real-time multi-modal text-image retrieval system with the ability to focus. In: Proceedings of the 45th International ACM SIGIR conference on research and development in information retrieval. SIGIR ’22, Association for Computing Machinery, New York, NY, USA, pp 3245–3250. https://doi.org/10.1145/3477495.3531666

  52. Xie Z, Liu L, Wu Y, Li L, Zhong L (2022) Learning TFIDF enhanced joint embedding for recipe-image cross-modal retrieval service. IEEE Trans Serv Comput 15(6):3304–3316. https://doi.org/10.1109/TSC.2021.3098834

    Article  Google Scholar 

  53. Pavlopoulos J, Kougia V, Androutsopoulos I (2019) A survey on biomedical image captioning. In: Proceedings of the second workshop on shortcomings in vision and language. Association for Computational Linguistics, Minneapolis, Minnesota, pp 26–36. https://doi.org/10.18653/v1/W19-1803

  54. Krishnan A, Rajesh S, SS S (2021) Text-based image retrieval using captioning. In: 2021 Fourth International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp 1–5. https://doi.org/10.1109/ICECCT52121.2021.9616897

  55. Masciari E, Moscato V, Picariello A, Sperlí G (2020) Detecting fake news by image analysis. In: Proceedings of the 24th Symposium on International Database Engineering & Applications. IDEAS ’20. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3410566.3410599

  56. Choraś M, Demestichas K, Giełczyk A, Álvaro Herrero Ksieniewicz P, Remoundou K, Urda D, Woźniak M (2021) Advanced machine learning techniques for fake news (online disinformation) detection: A systematic mapping study. Appl Soft Comput 101:107050. https://doi.org/10.1016/j.asoc.2020.107050

    Article  Google Scholar 

  57. Mangolin RB, Pereira RM, Britto AS, Silla CN, Feltrim VD, Bertolini D, Costa YMG (2022) A multimodal approach for multi-label movie genre classification. Multimedia Tools Appl 81(14):19071–19096. https://doi.org/10.1007/s11042-020-10086-2

    Article  Google Scholar 

  58. Rajput NK, Grover BA (2022) A multi-label movie genre classification scheme based on the movie’s subtitles. Multimedia Tools Appl 81(22):32469–32490. https://doi.org/10.1007/s11042-022-12961-6

    Article  Google Scholar 

  59. Giveki D (2021) Scale-space multi-view bag of words for scene categorization. Multimedia Tools Appl 80(1):1223–1245. https://doi.org/10.1007/s11042-020-09759-9

    Article  Google Scholar 

  60. Kannao R, Guha P, Chaudhuri BB (2022) Only overlay text: novel features for TV news broadcast video segmentation. Multimedia Tools Appl 81(21):30493–30517. https://doi.org/10.1007/s11042-022-12917-w

    Article  Google Scholar 

  61. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423

  62. von der Mosel J, Trautsch A, Herbold S (2023) On the validity of pre-trained transformers for natural language processing in the software engineering domain. IEEE Trans Softw Eng 49(4):1487–1507. https://doi.org/10.1109/TSE.2022.3178469

    Article  Google Scholar 

  63. Robertson S (1974) Specificity and weighted retrieval. J Doc 30(1):41–46

    Google Scholar 

  64. Wong SKM, Yao Y (1992) An information-theoretic measure of term specificity. J Am Soc Inf Sci 43(1):54–61. https://doi.org/10.1002/(SICI)1097-4571(199201)43:1<54::AID-ASI5>3.0.CO;2-A

    Article  Google Scholar 

  65. Robertson S, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146. https://doi.org/10.1002/asi.4630270302

    Article  Google Scholar 

  66. Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval, pp 227–228. Cambridge University Press, New York, NY, USA. Chap. 11. https://doi.org/10.1017/CBO9780511809071

  67. Wu HC, Luk RWP, Wong KF, Kwok KL (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13–11337. https://doi.org/10.1145/1361684.1361686

    Article  Google Scholar 

  68. Dua D, Graff C (2023) UCI Machine Learning Repository. https://archive-beta.ics.uci.edu

  69. Oussama BK (2022) Cranfield collection in TREC XML format. GitHub. https://github.com/oussbenk/cranfield-trec-dataset

  70. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  Google Scholar 

  71. Irvine A, Callison-Burch C (2017) A comprehensive analysis of bilingual lexicon induction. Comput Linguist 43(2):273–310. https://doi.org/10.1162/COLI_a_00284

    Article  MathSciNet  Google Scholar 

  72. Cao J, Zhang S (2014) A bayesian extension of the hypergeometric test for functional enrichment analysis. Biometrics 70(1):84–94. https://doi.org/10.1111/biom.12122

    Article  MathSciNet  PubMed  Google Scholar 

  73. Cao J (2017) Bayesian functional enrichment analysis for the Reactome database. Stati Theory Relat Fields 1(2):185–193. https://doi.org/10.1080/24754269.2017.1387444

    Article  MathSciNet  Google Scholar 

  74. Fan R, Cui Q (2021) Toward comprehensive functional analysis of gene lists weighted by gene essentiality scores. Bioinformatics 37(23):4399–4404. https://doi.org/10.1093/bioinformatics/btab475

    Article  CAS  PubMed  Google Scholar 

  75. Onsjö M, Sheridan P (2020) Theme enrichment analysis: A statistical test for identifying significantly enriched themes in a list of stories with an application to the Star Trek television franchise. Digit Studies/le champ numérique 10(1):1. https://doi.org/10.16995/dscn.316

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Sheridan.

Ethics declarations

Conflict of interest/Competing interests

The authors declare no conflict of interest/competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheridan, P., Onsjö, M. The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks. Multimed Tools Appl 83, 28875–28890 (2024). https://doi.org/10.1007/s11042-023-16615-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16615-z

Keywords

Navigation