Abstract
Term frequency-inverse document frequency, or TF-IDF for short, and its many variants form a class of term weighting functions the members of which are widely used in text analysis applications. While TF-IDF was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that TF-IDF corresponds very nearly with the hypergeometric test of statistical significance on selected real-data document retrieval, summarization, and classification tasks. These findings suggest that a fundamental mathematical connection between TF-IDF and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We advance the empirical analyses herein as a first step toward explaining the long-standing effectiveness of TF-IDF from a statistical significance testing lens. It is our aspiration that these results will open the door to the systematic evaluation of significance testing derived term weighting functions in text analysis applications.
Similar content being viewed by others
Data Availability
The NYSK dataset is available for download at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/NYSK. The TREC XML formatted version of the Cranfield collection was downloaded from the cranfield-trec-dataset GitHub repository (https://github.com/oussbenk/cranfield-trec-dataset, Commit Id: 1208e6edfb6cb2527b2c44398d3d8fefd3249144) . The 20 Newsgroups dataset used in this study is included in the Python library sklearn 1.2.2 [70].
Code Availability
R and Python code used to analyze the data is available at https://github.com/paul-sheridan/hgt-tfidf.
References
Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372. https://doi.org/10.1108/eb026562
Rathi RN, Mustafi A (2023) The importance of term weighting in semantic understanding of text: A review of techniques. Multimedia Tools Appl 82:9761–9783. https://doi.org/10.1007/s11042-022-12538-3
Robertson S (2004) Understanding inverse document frequency: On theoretical arguments for IDF. J Doc 60(5):503–520. https://doi.org/10.1108/00220410410560582
Hiemstra D (2000) A probabilistic justification for using tf idf term weighting in information retrieval. Int J Digit Libr 3(2):131–139. https://doi.org/10.1007/s007999900025
Amati G, Van Rijsbergen CJ (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst 20(4):357–389. https://doi.org/10.1145/582415.582416
Aizawa A (2003) An information-theoretic perspective of tf-idf measures. Inf Process Manag 39(1):45–65. https://doi.org/10.1016/S0306-4573(02)00021-3
de Vries AP, Roelleke T (2005) Relevance information: A loss of entropy but a gain for IDF? In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’05, Association for Computing Machinery, New York, NY, USA pp 282–289. https://doi.org/10.1145/1076034.1076084
Elkan C (2005) Deriving TF-IDF as a Fisher kernel. In: Proceedings of the 12th International conference on string processing and information retrieval. SPIRE’05, Springer, Berlin, Heidelberg, pp 295–300. https://doi.org/10.1007/11575832_33
Roelleke T, Wang J (2006) A parallel derivation of probabilistic information retrieval models. In: Proceedings of the 29th Annual International ACM SIGIR conference on research and development in information retrieval. SIGIR ’06, Association for Computing Machinery, New York, NY, USA, pp 107–114. https://doi.org/10.1145/1148170.1148192
Roelleke T, Wang J (2008) TF-IDF uncovered: A study of theories and probabilities. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’08, ACM, New York, NY, USA, pp 435–442. https://doi.org/10.1145/1390334.1390409
Roelleke T (2013) Information Retrieval Models: Foundations and Relationships. Morgan & Claypool Publishers, San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA). https://doi.org/10.1007/978-3-031-02328-6
Havrlant L, Kreinovich V (2017) A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation). Int J Gen Syst 46(1):27–36. https://doi.org/10.1080/03081079.2017.1291635
Rivals I, Personnaz L, Taing L, Potier M-C (2007) Enrichment or depletion of a GO category within a class of genes: Which test? Bioinformatics 23(4):401–407. https://doi.org/10.1093/bioinformatics/btl633
Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G (2004) GO::TermFinder-open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20(18):3710. https://doi.org/10.1093/bioinformatics/bth456
Huang DW, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37(1):1. https://doi.org/10.1093/nar/gkn923
Maere S, Heymans K, Kuiper M (2005) BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in biological networks. Bioinformatics 21(16):3448–3449. https://doi.org/10.1093/bioinformatics/bti551
Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A, Mostafavi S, Montojo J, Shao Q, Wright G, Bader GD, Morris Q (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research 38(suppl_2):214–220. https://doi.org/10.1093/nar/gkq537
Zheng Q, Wang X-J (2008) GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis. Nucleic Acids Res 36(suppl_2):358–363. https://doi.org/10.1093/nar/gkn276
Dermouche M, Velcin J, Khouas L, Loudcher S (2014) A joint model for topic-sentiment evolution over time. In: Proceedings of the 2014 IEEE international conference on data mining. ICDM ’14, IEEE Computer Society, Washington, DC, USA, pp 773–778. https://doi.org/10.1109/ICDM.2014.82
Glasgow Information Retrieval Group: Cranfield collection. http://ir.dcs.gla.ac.uk/resources/test_collections/cran/. Accessed 23 May 2023
Lang K (1995) NewsWeeder: learning to filter netnews. In: Proceedings of the 12th international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 331–339
Robertson SE, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1994) Okapi at TREC-3. In: Harman DK (ed) TREC NIST Special Publication, vol. 500-225. National Institute of Standards and Technology (NIST), Gaithersburg, MD, pp 109–126
Robertson S, Zaragoza H (2009) The probabilistic relevance framework: BM25 and beyond. Found Trends Inf Retr 3(4):333–389. https://doi.org/10.1561/1500000019
Lv Y, Zhai C (2011) Lower-bounding term frequency normalization. In: Proceedings of the 20th ACM international conference on information and knowledge management. CIKM ’11, Association for Computing Machinery, New York, NY, USA, pp 7–16. https://doi.org/10.1145/2063576.2063584
Jimenez S, Cucerzan S-P, Gonzalez FA, Gelbukh A, Dueñas G, Pinto D, Singh VK, Villavicencio A, Mayr-Schlegel P, Stamatatos E (2018) BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies. J Intell Fuzzy Syst 34(5):2887–2899. https://doi.org/10.3233/JIFS-169475
Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of the fourteenth international conference on machine learning. ICML ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 143–151
Sabbah T, Selamat A, Selamat MH, Al-Anzi FS, Herrera-Viedma EE, Krejcar O, Fujita H (2017) Modified frequency-based term weighting schemes for text classification. Appl Soft Comput 58:193–206. https://doi.org/10.1016/J.ASOC.2017.04.069
Kim S-W, Gil J-M (2019) Research paper classification systems based on TF-IDF and LDA schemes. Hum Centric Comput Inf Sci 9(1):30. https://doi.org/10.1186/s13673-019-0192-7
Jiang Z, Gao B, He Y, Han Y, Doyle P, Zhu Q (2021) Text classification using novel term weighting scheme-based improved TF-IDF for internet media reports. Math Probl Eng 2021:1–30. https://doi.org/10.1155/2021/6619088
Chawla S, Kaur R, Aggarwal P (2023) Text classification framework for short text based on TFIDF-FastText. Multimedia Tools Appl. https://doi.org/10.1007/s11042-023-15211-5
Bafna P, Pramod D, Vaidya A (2016) Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp 61–66. https://doi.org/10.1109/ICEEOT.2016.7754750
Marcińczuk M, Gniewkowski M, Walkowiak T, Bedkowski M (2021) Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In: Proceedings of the 11th Global Wordnet Conference, Global Wordnet Association, University of South Africa (UNISA), pp 207–214
Thielmann A, Weisser C, Kneib T, Säfken B (2023) Coherence based document clustering. In: 2023 IEEE 17th International Conference on Semantic Computing (ICSC), pp 9–16. https://doi.org/10.1109/ICSC56153.2023.00009
Firoozeh N, Nazarenko A, Alizon F, Daille B (2020) Keyword extraction: Issues and methods. Nat Lang Eng 26(3):259–291. https://doi.org/10.1017/S1351324919000457
Koloski B, Pollak S, Škrlj B, Martinc M (2021) Extending neural keyword extraction with TF–IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Association for Computational Linguistics, Online, pp 22–29
Qian Y, Jia C (1992) Liu Y (2021) Bert-based text keyword extraction. J Phys Conf Ser 4:042077. https://doi.org/10.1088/1742-6596/1992/4/042077
Magdy S, Abouelseoud Y, Mikhail M (2020) Privacy preserving search index for image databases based on SURF and order preserving encryption. IET Image Process 14(5):874–881. https://doi.org/10.1049/iet-ipr.2019.0575
Yang J, Jiang Y-G, Hauptmann AG, Ngo C-W (2007) Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the international workshop on workshop on multimedia information retrieval. MIR ’07, Association for Computing Machinery, New York, NY, USA, pp 197–206. https://doi.org/10.1145/1290082.1290111
Moulin C, Barat C, Ducottet C (2010) Fusion of tf.idf weighted bag of visual features for image classification. In: 2010 International Workshop on Content Based Multimedia Indexing (CBMI), pp 1–6. https://doi.org/10.1109/CBMI.2010.5529901
Suzuki Y, Mitsukawa M, Kawagoe K (2008) A image retrieval method using TFIDF based weighting scheme. In: 2008 19th International Workshop on Database and Expert Systems Applications, pp 112–116. https://doi.org/10.1109/DEXA.2008.106
Kondylidis N, Tzelepi M, Tefas A (2018) Exploiting Tf-Idf in deep convolutional neural networks for content based image retrieval. Multimedia Tools Appl 77(23):30729–30748
Chum O, Philbin J, Zisserman A (2008) Near duplicate image detection: min-hash and tf-idf weighting. In: Everingham M, Needham CJ, Fraile R (eds) Proceedings of the British Machine Vision Conference 2008, British Machine Vision Association, Leeds, UK, pp 1–10 https://doi.org/10.5244/C.22.50
Kaur G, Singh N, Kumar M (2022) Image forgery techniques: A review. Artif Intell Rev 56(2):1577–1625. https://doi.org/10.1007/s10462-022-10211-7
Shan W, Yi Y, Huang R, Xie Y (2019) Robust contrast enhancement forensics based on convolutional neural networks. Signal Process Image Commun 71:138–146. https://doi.org/10.1016/j.image.2018.11.011
Koul S, Kumar M, Khurana SS, Mushtaq F, Kumar K (2022) An efficient approach for copy-move image forgery detection using convolution neural network. Multimedia Tools Appl 81(8):11259–11277. https://doi.org/10.1007/s11042-022-11974-5
Walia S, Kumar K, Kumar M (2022) Unveiling digital image forgeries using markov based quaternions in frequency domain and fusion of machine learning algorithms. Multimedia Tools Appl 82(3):4517–4532. https://doi.org/10.1007/s11042-022-13610-8
Bansal M, Kumar M, Kumar M (2021) Performance comparison of various feature extraction methods for object recognition on Caltech-101 Image dataset. In: Choudhary A, Agrawal AP, Logeswaran R, Unhelkar B (eds) Applications of artificial intelligence and machine learning. Springer, Singapore, pp 289–303
Bansal M, Kumar M, Sachdeva M, Mittal A (2021) Transfer learning for image classification using VGG19: Caltech-101 image data set. J Ambient Intell Humanized Comput 14:3609–3620
Shaheed K, Mao A, Qureshi I, Kumar M, Hussain S, Ullah I, Zhang X (2022) DS-CNN: A pre-trained xception model based on depth-wise separable convolutional neural network for finger vein recognition. Expert Syst Appl 191(C). https://doi.org/10.1016/j.eswa.2021.116288
Arnesia PD, Madenda S (2012) Matching images with textual document using TFIDF method. In: 2012 5th International Congress on Image and Signal Processing, pp 1283–1289. https://doi.org/10.1109/CISP.2012.6469720
Schneider F, Biemann C (2022) Golden retriever: A real-time multi-modal text-image retrieval system with the ability to focus. In: Proceedings of the 45th International ACM SIGIR conference on research and development in information retrieval. SIGIR ’22, Association for Computing Machinery, New York, NY, USA, pp 3245–3250. https://doi.org/10.1145/3477495.3531666
Xie Z, Liu L, Wu Y, Li L, Zhong L (2022) Learning TFIDF enhanced joint embedding for recipe-image cross-modal retrieval service. IEEE Trans Serv Comput 15(6):3304–3316. https://doi.org/10.1109/TSC.2021.3098834
Pavlopoulos J, Kougia V, Androutsopoulos I (2019) A survey on biomedical image captioning. In: Proceedings of the second workshop on shortcomings in vision and language. Association for Computational Linguistics, Minneapolis, Minnesota, pp 26–36. https://doi.org/10.18653/v1/W19-1803
Krishnan A, Rajesh S, SS S (2021) Text-based image retrieval using captioning. In: 2021 Fourth International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp 1–5. https://doi.org/10.1109/ICECCT52121.2021.9616897
Masciari E, Moscato V, Picariello A, Sperlí G (2020) Detecting fake news by image analysis. In: Proceedings of the 24th Symposium on International Database Engineering & Applications. IDEAS ’20. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3410566.3410599
Choraś M, Demestichas K, Giełczyk A, Álvaro Herrero Ksieniewicz P, Remoundou K, Urda D, Woźniak M (2021) Advanced machine learning techniques for fake news (online disinformation) detection: A systematic mapping study. Appl Soft Comput 101:107050. https://doi.org/10.1016/j.asoc.2020.107050
Mangolin RB, Pereira RM, Britto AS, Silla CN, Feltrim VD, Bertolini D, Costa YMG (2022) A multimodal approach for multi-label movie genre classification. Multimedia Tools Appl 81(14):19071–19096. https://doi.org/10.1007/s11042-020-10086-2
Rajput NK, Grover BA (2022) A multi-label movie genre classification scheme based on the movie’s subtitles. Multimedia Tools Appl 81(22):32469–32490. https://doi.org/10.1007/s11042-022-12961-6
Giveki D (2021) Scale-space multi-view bag of words for scene categorization. Multimedia Tools Appl 80(1):1223–1245. https://doi.org/10.1007/s11042-020-09759-9
Kannao R, Guha P, Chaudhuri BB (2022) Only overlay text: novel features for TV news broadcast video segmentation. Multimedia Tools Appl 81(21):30493–30517. https://doi.org/10.1007/s11042-022-12917-w
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
von der Mosel J, Trautsch A, Herbold S (2023) On the validity of pre-trained transformers for natural language processing in the software engineering domain. IEEE Trans Softw Eng 49(4):1487–1507. https://doi.org/10.1109/TSE.2022.3178469
Robertson S (1974) Specificity and weighted retrieval. J Doc 30(1):41–46
Wong SKM, Yao Y (1992) An information-theoretic measure of term specificity. J Am Soc Inf Sci 43(1):54–61. https://doi.org/10.1002/(SICI)1097-4571(199201)43:1<54::AID-ASI5>3.0.CO;2-A
Robertson S, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146. https://doi.org/10.1002/asi.4630270302
Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval, pp 227–228. Cambridge University Press, New York, NY, USA. Chap. 11. https://doi.org/10.1017/CBO9780511809071
Wu HC, Luk RWP, Wong KF, Kwok KL (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13–11337. https://doi.org/10.1145/1361684.1361686
Dua D, Graff C (2023) UCI Machine Learning Repository. https://archive-beta.ics.uci.edu
Oussama BK (2022) Cranfield collection in TREC XML format. GitHub. https://github.com/oussbenk/cranfield-trec-dataset
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830
Irvine A, Callison-Burch C (2017) A comprehensive analysis of bilingual lexicon induction. Comput Linguist 43(2):273–310. https://doi.org/10.1162/COLI_a_00284
Cao J, Zhang S (2014) A bayesian extension of the hypergeometric test for functional enrichment analysis. Biometrics 70(1):84–94. https://doi.org/10.1111/biom.12122
Cao J (2017) Bayesian functional enrichment analysis for the Reactome database. Stati Theory Relat Fields 1(2):185–193. https://doi.org/10.1080/24754269.2017.1387444
Fan R, Cui Q (2021) Toward comprehensive functional analysis of gene lists weighted by gene essentiality scores. Bioinformatics 37(23):4399–4404. https://doi.org/10.1093/bioinformatics/btab475
Onsjö M, Sheridan P (2020) Theme enrichment analysis: A statistical test for identifying significantly enriched themes in a list of stories with an application to the Star Trek television franchise. Digit Studies/le champ numérique 10(1):1. https://doi.org/10.16995/dscn.316
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest/Competing interests
The authors declare no conflict of interest/competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sheridan, P., Onsjö, M. The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks. Multimed Tools Appl 83, 28875–28890 (2024). https://doi.org/10.1007/s11042-023-16615-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16615-z