The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

Sheridan, Paul; Onsjö, Mikael

doi:10.1007/s11042-023-16615-z

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

Published: 08 September 2023

Volume 83, pages 28875–28890, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

118 Accesses
Explore all metrics

Abstract

Term frequency-inverse document frequency, or TF-IDF for short, and its many variants form a class of term weighting functions the members of which are widely used in text analysis applications. While TF-IDF was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that TF-IDF corresponds very nearly with the hypergeometric test of statistical significance on selected real-data document retrieval, summarization, and classification tasks. These findings suggest that a fundamental mathematical connection between TF-IDF and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We advance the empirical analyses herein as a first step toward explaining the long-standing effectiveness of TF-IDF from a statistical significance testing lens. It is our aspiration that these results will open the door to the systematic evaluation of significance testing derived term weighting functions in text analysis applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Article Open access 01 April 2016

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Siamese Neural Networks: An Overview

Data Availability

The NYSK dataset is available for download at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/NYSK. The TREC XML formatted version of the Cranfield collection was downloaded from the cranfield-trec-dataset GitHub repository (https://github.com/oussbenk/cranfield-trec-dataset, Commit Id: 1208e6edfb6cb2527b2c44398d3d8fefd3249144) . The 20 Newsgroups dataset used in this study is included in the Python library sklearn 1.2.2 [70].

Code Availability

R and Python code used to analyze the data is available at https://github.com/paul-sheridan/hgt-tfidf.

References

Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372. https://doi.org/10.1108/eb026562
Article Google Scholar
Rathi RN, Mustafi A (2023) The importance of term weighting in semantic understanding of text: A review of techniques. Multimedia Tools Appl 82:9761–9783. https://doi.org/10.1007/s11042-022-12538-3
Article CAS Google Scholar
Robertson S (2004) Understanding inverse document frequency: On theoretical arguments for IDF. J Doc 60(5):503–520. https://doi.org/10.1108/00220410410560582
Article Google Scholar
Hiemstra D (2000) A probabilistic justification for using tf idf term weighting in information retrieval. Int J Digit Libr 3(2):131–139. https://doi.org/10.1007/s007999900025
Article Google Scholar
Amati G, Van Rijsbergen CJ (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst 20(4):357–389. https://doi.org/10.1145/582415.582416
Article Google Scholar
Aizawa A (2003) An information-theoretic perspective of tf-idf measures. Inf Process Manag 39(1):45–65. https://doi.org/10.1016/S0306-4573(02)00021-3
Article Google Scholar
de Vries AP, Roelleke T (2005) Relevance information: A loss of entropy but a gain for IDF? In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’05, Association for Computing Machinery, New York, NY, USA pp 282–289. https://doi.org/10.1145/1076034.1076084
Elkan C (2005) Deriving TF-IDF as a Fisher kernel. In: Proceedings of the 12th International conference on string processing and information retrieval. SPIRE’05, Springer, Berlin, Heidelberg, pp 295–300. https://doi.org/10.1007/11575832_33
Roelleke T, Wang J (2006) A parallel derivation of probabilistic information retrieval models. In: Proceedings of the 29th Annual International ACM SIGIR conference on research and development in information retrieval. SIGIR ’06, Association for Computing Machinery, New York, NY, USA, pp 107–114. https://doi.org/10.1145/1148170.1148192
Roelleke T, Wang J (2008) TF-IDF uncovered: A study of theories and probabilities. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’08, ACM, New York, NY, USA, pp 435–442. https://doi.org/10.1145/1390334.1390409
Roelleke T (2013) Information Retrieval Models: Foundations and Relationships. Morgan & Claypool Publishers, San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA). https://doi.org/10.1007/978-3-031-02328-6
Havrlant L, Kreinovich V (2017) A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation). Int J Gen Syst 46(1):27–36. https://doi.org/10.1080/03081079.2017.1291635
Article MathSciNet Google Scholar
Rivals I, Personnaz L, Taing L, Potier M-C (2007) Enrichment or depletion of a GO category within a class of genes: Which test? Bioinformatics 23(4):401–407. https://doi.org/10.1093/bioinformatics/btl633
Article CAS PubMed Google Scholar
Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G (2004) GO::TermFinder-open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20(18):3710. https://doi.org/10.1093/bioinformatics/bth456
Article CAS PubMed Google Scholar
Huang DW, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37(1):1. https://doi.org/10.1093/nar/gkn923
Article CAS PubMed Google Scholar
Maere S, Heymans K, Kuiper M (2005) BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in biological networks. Bioinformatics 21(16):3448–3449. https://doi.org/10.1093/bioinformatics/bti551
Article CAS PubMed Google Scholar
Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A, Mostafavi S, Montojo J, Shao Q, Wright G, Bader GD, Morris Q (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research 38(suppl_2):214–220. https://doi.org/10.1093/nar/gkq537
Zheng Q, Wang X-J (2008) GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis. Nucleic Acids Res 36(suppl_2):358–363. https://doi.org/10.1093/nar/gkn276
Dermouche M, Velcin J, Khouas L, Loudcher S (2014) A joint model for topic-sentiment evolution over time. In: Proceedings of the 2014 IEEE international conference on data mining. ICDM ’14, IEEE Computer Society, Washington, DC, USA, pp 773–778. https://doi.org/10.1109/ICDM.2014.82
Glasgow Information Retrieval Group: Cranfield collection. http://ir.dcs.gla.ac.uk/resources/test_collections/cran/. Accessed 23 May 2023
Lang K (1995) NewsWeeder: learning to filter netnews. In: Proceedings of the 12th international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 331–339
Robertson SE, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1994) Okapi at TREC-3. In: Harman DK (ed) TREC NIST Special Publication, vol. 500-225. National Institute of Standards and Technology (NIST), Gaithersburg, MD, pp 109–126
Robertson S, Zaragoza H (2009) The probabilistic relevance framework: BM25 and beyond. Found Trends Inf Retr 3(4):333–389. https://doi.org/10.1561/1500000019
Article Google Scholar
Lv Y, Zhai C (2011) Lower-bounding term frequency normalization. In: Proceedings of the 20th ACM international conference on information and knowledge management. CIKM ’11, Association for Computing Machinery, New York, NY, USA, pp 7–16. https://doi.org/10.1145/2063576.2063584
Jimenez S, Cucerzan S-P, Gonzalez FA, Gelbukh A, Dueñas G, Pinto D, Singh VK, Villavicencio A, Mayr-Schlegel P, Stamatatos E (2018) BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies. J Intell Fuzzy Syst 34(5):2887–2899. https://doi.org/10.3233/JIFS-169475
Article Google Scholar
Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of the fourteenth international conference on machine learning. ICML ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 143–151
Sabbah T, Selamat A, Selamat MH, Al-Anzi FS, Herrera-Viedma EE, Krejcar O, Fujita H (2017) Modified frequency-based term weighting schemes for text classification. Appl Soft Comput 58:193–206. https://doi.org/10.1016/J.ASOC.2017.04.069
Article Google Scholar
Kim S-W, Gil J-M (2019) Research paper classification systems based on TF-IDF and LDA schemes. Hum Centric Comput Inf Sci 9(1):30. https://doi.org/10.1186/s13673-019-0192-7
Article Google Scholar
Jiang Z, Gao B, He Y, Han Y, Doyle P, Zhu Q (2021) Text classification using novel term weighting scheme-based improved TF-IDF for internet media reports. Math Probl Eng 2021:1–30. https://doi.org/10.1155/2021/6619088
Article CAS Google Scholar
Chawla S, Kaur R, Aggarwal P (2023) Text classification framework for short text based on TFIDF-FastText. Multimedia Tools Appl. https://doi.org/10.1007/s11042-023-15211-5
Article Google Scholar
Bafna P, Pramod D, Vaidya A (2016) Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp 61–66. https://doi.org/10.1109/ICEEOT.2016.7754750
Marcińczuk M, Gniewkowski M, Walkowiak T, Bedkowski M (2021) Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In: Proceedings of the 11th Global Wordnet Conference, Global Wordnet Association, University of South Africa (UNISA), pp 207–214
Thielmann A, Weisser C, Kneib T, Säfken B (2023) Coherence based document clustering. In: 2023 IEEE 17th International Conference on Semantic Computing (ICSC), pp 9–16. https://doi.org/10.1109/ICSC56153.2023.00009
Firoozeh N, Nazarenko A, Alizon F, Daille B (2020) Keyword extraction: Issues and methods. Nat Lang Eng 26(3):259–291. https://doi.org/10.1017/S1351324919000457
Article Google Scholar
Koloski B, Pollak S, Škrlj B, Martinc M (2021) Extending neural keyword extraction with TF–IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Association for Computational Linguistics, Online, pp 22–29
Qian Y, Jia C (1992) Liu Y (2021) Bert-based text keyword extraction. J Phys Conf Ser 4:042077. https://doi.org/10.1088/1742-6596/1992/4/042077
Article Google Scholar
Magdy S, Abouelseoud Y, Mikhail M (2020) Privacy preserving search index for image databases based on SURF and order preserving encryption. IET Image Process 14(5):874–881. https://doi.org/10.1049/iet-ipr.2019.0575
Article Google Scholar
Yang J, Jiang Y-G, Hauptmann AG, Ngo C-W (2007) Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the international workshop on workshop on multimedia information retrieval. MIR ’07, Association for Computing Machinery, New York, NY, USA, pp 197–206. https://doi.org/10.1145/1290082.1290111
Moulin C, Barat C, Ducottet C (2010) Fusion of tf.idf weighted bag of visual features for image classification. In: 2010 International Workshop on Content Based Multimedia Indexing (CBMI), pp 1–6. https://doi.org/10.1109/CBMI.2010.5529901
Suzuki Y, Mitsukawa M, Kawagoe K (2008) A image retrieval method using TFIDF based weighting scheme. In: 2008 19th International Workshop on Database and Expert Systems Applications, pp 112–116. https://doi.org/10.1109/DEXA.2008.106
Kondylidis N, Tzelepi M, Tefas A (2018) Exploiting Tf-Idf in deep convolutional neural networks for content based image retrieval. Multimedia Tools Appl 77(23):30729–30748
Article Google Scholar
Chum O, Philbin J, Zisserman A (2008) Near duplicate image detection: min-hash and tf-idf weighting. In: Everingham M, Needham CJ, Fraile R (eds) Proceedings of the British Machine Vision Conference 2008, British Machine Vision Association, Leeds, UK, pp 1–10 https://doi.org/10.5244/C.22.50
Kaur G, Singh N, Kumar M (2022) Image forgery techniques: A review. Artif Intell Rev 56(2):1577–1625. https://doi.org/10.1007/s10462-022-10211-7
Article Google Scholar
Shan W, Yi Y, Huang R, Xie Y (2019) Robust contrast enhancement forensics based on convolutional neural networks. Signal Process Image Commun 71:138–146. https://doi.org/10.1016/j.image.2018.11.011
Article Google Scholar
Koul S, Kumar M, Khurana SS, Mushtaq F, Kumar K (2022) An efficient approach for copy-move image forgery detection using convolution neural network. Multimedia Tools Appl 81(8):11259–11277. https://doi.org/10.1007/s11042-022-11974-5
Article Google Scholar
Walia S, Kumar K, Kumar M (2022) Unveiling digital image forgeries using markov based quaternions in frequency domain and fusion of machine learning algorithms. Multimedia Tools Appl 82(3):4517–4532. https://doi.org/10.1007/s11042-022-13610-8
Article Google Scholar
Bansal M, Kumar M, Kumar M (2021) Performance comparison of various feature extraction methods for object recognition on Caltech-101 Image dataset. In: Choudhary A, Agrawal AP, Logeswaran R, Unhelkar B (eds) Applications of artificial intelligence and machine learning. Springer, Singapore, pp 289–303
Google Scholar
Bansal M, Kumar M, Sachdeva M, Mittal A (2021) Transfer learning for image classification using VGG19: Caltech-101 image data set. J Ambient Intell Humanized Comput 14:3609–3620
Article Google Scholar
Shaheed K, Mao A, Qureshi I, Kumar M, Hussain S, Ullah I, Zhang X (2022) DS-CNN: A pre-trained xception model based on depth-wise separable convolutional neural network for finger vein recognition. Expert Syst Appl 191(C). https://doi.org/10.1016/j.eswa.2021.116288
Arnesia PD, Madenda S (2012) Matching images with textual document using TFIDF method. In: 2012 5th International Congress on Image and Signal Processing, pp 1283–1289. https://doi.org/10.1109/CISP.2012.6469720
Schneider F, Biemann C (2022) Golden retriever: A real-time multi-modal text-image retrieval system with the ability to focus. In: Proceedings of the 45th International ACM SIGIR conference on research and development in information retrieval. SIGIR ’22, Association for Computing Machinery, New York, NY, USA, pp 3245–3250. https://doi.org/10.1145/3477495.3531666
Xie Z, Liu L, Wu Y, Li L, Zhong L (2022) Learning TFIDF enhanced joint embedding for recipe-image cross-modal retrieval service. IEEE Trans Serv Comput 15(6):3304–3316. https://doi.org/10.1109/TSC.2021.3098834
Article Google Scholar
Pavlopoulos J, Kougia V, Androutsopoulos I (2019) A survey on biomedical image captioning. In: Proceedings of the second workshop on shortcomings in vision and language. Association for Computational Linguistics, Minneapolis, Minnesota, pp 26–36. https://doi.org/10.18653/v1/W19-1803
Krishnan A, Rajesh S, SS S (2021) Text-based image retrieval using captioning. In: 2021 Fourth International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp 1–5. https://doi.org/10.1109/ICECCT52121.2021.9616897
Masciari E, Moscato V, Picariello A, Sperlí G (2020) Detecting fake news by image analysis. In: Proceedings of the 24th Symposium on International Database Engineering & Applications. IDEAS ’20. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3410566.3410599
Choraś M, Demestichas K, Giełczyk A, Álvaro Herrero Ksieniewicz P, Remoundou K, Urda D, Woźniak M (2021) Advanced machine learning techniques for fake news (online disinformation) detection: A systematic mapping study. Appl Soft Comput 101:107050. https://doi.org/10.1016/j.asoc.2020.107050
Article Google Scholar
Mangolin RB, Pereira RM, Britto AS, Silla CN, Feltrim VD, Bertolini D, Costa YMG (2022) A multimodal approach for multi-label movie genre classification. Multimedia Tools Appl 81(14):19071–19096. https://doi.org/10.1007/s11042-020-10086-2
Article Google Scholar
Rajput NK, Grover BA (2022) A multi-label movie genre classification scheme based on the movie’s subtitles. Multimedia Tools Appl 81(22):32469–32490. https://doi.org/10.1007/s11042-022-12961-6
Article Google Scholar
Giveki D (2021) Scale-space multi-view bag of words for scene categorization. Multimedia Tools Appl 80(1):1223–1245. https://doi.org/10.1007/s11042-020-09759-9
Article Google Scholar
Kannao R, Guha P, Chaudhuri BB (2022) Only overlay text: novel features for TV news broadcast video segmentation. Multimedia Tools Appl 81(21):30493–30517. https://doi.org/10.1007/s11042-022-12917-w
Article Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
von der Mosel J, Trautsch A, Herbold S (2023) On the validity of pre-trained transformers for natural language processing in the software engineering domain. IEEE Trans Softw Eng 49(4):1487–1507. https://doi.org/10.1109/TSE.2022.3178469
Article Google Scholar
Robertson S (1974) Specificity and weighted retrieval. J Doc 30(1):41–46
Google Scholar
Wong SKM, Yao Y (1992) An information-theoretic measure of term specificity. J Am Soc Inf Sci 43(1):54–61. https://doi.org/10.1002/(SICI)1097-4571(199201)43:1<54::AID-ASI5>3.0.CO;2-A
Article Google Scholar
Robertson S, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146. https://doi.org/10.1002/asi.4630270302
Article Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval, pp 227–228. Cambridge University Press, New York, NY, USA. Chap. 11. https://doi.org/10.1017/CBO9780511809071
Wu HC, Luk RWP, Wong KF, Kwok KL (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13–11337. https://doi.org/10.1145/1361684.1361686
Article Google Scholar
Dua D, Graff C (2023) UCI Machine Learning Repository. https://archive-beta.ics.uci.edu
Oussama BK (2022) Cranfield collection in TREC XML format. GitHub. https://github.com/oussbenk/cranfield-trec-dataset
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet Google Scholar
Irvine A, Callison-Burch C (2017) A comprehensive analysis of bilingual lexicon induction. Comput Linguist 43(2):273–310. https://doi.org/10.1162/COLI_a_00284
Article MathSciNet Google Scholar
Cao J, Zhang S (2014) A bayesian extension of the hypergeometric test for functional enrichment analysis. Biometrics 70(1):84–94. https://doi.org/10.1111/biom.12122
Article MathSciNet PubMed Google Scholar
Cao J (2017) Bayesian functional enrichment analysis for the Reactome database. Stati Theory Relat Fields 1(2):185–193. https://doi.org/10.1080/24754269.2017.1387444
Article MathSciNet Google Scholar
Fan R, Cui Q (2021) Toward comprehensive functional analysis of gene lists weighted by gene essentiality scores. Bioinformatics 37(23):4399–4404. https://doi.org/10.1093/bioinformatics/btab475
Article CAS PubMed Google Scholar
Onsjö M, Sheridan P (2020) Theme enrichment analysis: A statistical test for identifying significantly enriched themes in a list of stories with an application to the Star Trek television franchise. Digit Studies/le champ numérique 10(1):1. https://doi.org/10.16995/dscn.316

Download references

Author information

Authors and Affiliations

School of Mathematical and Computational Sciences, University of Prince Edward Island, 550 University Ave, Charlottetown, C1A 4P3, Prince Edward Island, Canada
Paul Sheridan
Independent Researcher, London, UK
Mikael Onsjö

Authors

Paul Sheridan
View author publications
You can also search for this author in PubMed Google Scholar
Mikael Onsjö
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Sheridan.

Ethics declarations

Conflict of interest/Competing interests

The authors declare no conflict of interest/competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sheridan, P., Onsjö, M. The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks. Multimed Tools Appl 83, 28875–28890 (2024). https://doi.org/10.1007/s11042-023-16615-z

Download citation

Received: 02 August 2022
Revised: 03 June 2023
Accepted: 21 August 2023
Published: 08 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16615-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

Abstract

Access this article

Similar content being viewed by others

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

A Comprehensive Survey of Clustering Algorithms

Siamese Neural Networks: An Overview

Data Availability

Code Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

Abstract

Access this article

Similar content being viewed by others

Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

A Comprehensive Survey of Clustering Algorithms

Siamese Neural Networks: An Overview

Data Availability

Code Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation