Skip to main content
Log in

Term frequency with average term occurrences for textual information retrieval

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In the context of information retrieval (IR) from text documents, the term weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model. In this paper, we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and maybe infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TF-IDF and TF-ATO. The results show that both, stop-words removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information on the relevance judgement for the collection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Chang CH, Hsu CC (1999) The design of an information system for hypertext retrieval and automatic discovery on WWW. PhD thesis, National Taiwan University

  • Christian Middleton and Ricardo Baeza-yates. A comparison of open source search engines. Technical report, 2007. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.119.6955

  • Christopher F (1992) Information retrieval. Chapter Lexical Analysis and Stoplists, pp 102–130. Prentice-Hall, Inc., Upper Saddle River, NJ

  • Cordan O, Herrera-Viedma E, Lapez-Pujalte C, Luque M, Zarco C (2003) A review on the application of evolutionary computation to information retrieval. Int J Approx Reason 34(23):241–264 Soft Computing Applications to Intelligent Information Retrieval on the Internet

    Article  MathSciNet  MATH  Google Scholar 

  • Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York, NY

    Book  MATH  Google Scholar 

  • Cummins R (2008) The evolution and analysis of term-weighting schemes in information retrieval. PhD thesis, National University of Ireland, Galway

  • Cummins R, O’Riordan C (2006) Term-weighting in information retrieval using genetic programming: a three stage process. In: Proceedings of the 2006 conference on ECAI 2006: 17th European conference on artificial intelligence August 29–September 1, Riva Del Garda, pp 793–794, Amsterdam. IOS Press

  • Greengrass E (2000) Information retrieval : a survey. Technical Report November, University of Maryland. http://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.book.pdf

  • Hersh W, Buckley C, Leone TJ, Hickam D (1994) Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’94, New York, NY. Springer-Verlag New York Inc, pp 192–201

  • He Y, Saif H, Fernández M, Alani H (2014) On stopwords, filtering and data sparsity for sentiment analysis of Twitter. In: LREC 2014, 9th international conference on language resources and evaluationReykjavik, Iceland, pp 810–817

  • Ibrahim OAS, Landa-Silva D (2014) A new weighting scheme and discriminative approach for information retrieval in static and dynamic document collections. In: Computational intelligence (UKCI), 2014 14th UK Workshop on, pp 1–8, Sept 2014

  • Jin R, Chai JY, Si L (2005) Learn to weight terms in information retrieval using category information. In: Proceedings of the 22nd international conference on machine learning, ICML ’05, New York, NY, ACM, pp 353–360

  • Jin R, Falusos C, Hauptmann AG (2001) Meta-scoring: automatically evaluating term weighting schemes in ir without precision-recall. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01, New York, NY, ACM, pp 83–89

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Ndellec C, Rouveirol C (eds), Machine Learning: ECML-98, volume 1398 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp 137–142

  • Jones KS (1988) Document retrieval systems. Chapter a statistical interpretation of term specificity and its application in retrieval. Taylor Graham Publishing, London, pp 132–142

  • Jones KS, Willett P (eds) (1997) Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA

  • Kaden M, Riedel M, Hermann W, Villmann T (2014) Border-sensitive learning in generalized learning vector quantization: an alternative to support vector machines. Soft Comput pp 1–12. doi: 10.1007/s00500-014-1496-1

  • Kwok KL (1997) Comparing representations in Chinese information retrieval. In: SIGIR ’97 Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY, ACM, pp 34–41

  • Lemur. http://www.lemurproject.org/

  • Liu T-Y (2009) Learning to rank for information retrieval. Found Trend Inf Retrieval 3(3):225–331

    Article  Google Scholar 

  • Lo RTW, He B, Ounis I (2005) Automatically building a stopword list for an information retrieval system. Digital information management: special issue on the 5th Dutch-Belgian information retrieval Workshop (DIR 2005) 3 (1):3–8

  • Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317

    Article  MathSciNet  Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York, NY ISBN 0521865719, 9780521865715

  • McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in action, 2nd Edn Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT 2010. ISBN 1933988177, 9781933988177

  • McGill M (1979) An evaluation of factors affecting document ranking by information retrieval systems

  • Noreault T, McGill M, Koll M (1999) A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment. In: SIGIR ’80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval, Butterworth & Co., Kent, pp 57–76

  • Qin T, Liu TY, Xu J, Li H (2010) Letor: a benchmark collection for research on learning to rank for information retrieval. Inf Retrieval, 13(4):346–374. ISSN 1386–4564

  • Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2014) Tf-icf: a new term weighting scheme for clustering dynamic data streams. In: Proceedings of the 5th international conference on machine learning and applications, ICMLA ’06, Washington, DC. IEEE Computer Society, pp 258–263

  • Ricardo A, Baeza-Yates RNB, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc, Boston, MA

  • Ricardo A, Baeza-Yates RNB, Ribeiro-Neto B (2011) Modern information retrieval-the concepts and technology behind search, 2nd edn. Pearson Education Ltd, Harlow

    Google Scholar 

  • Robertson SE, Walker S, Hancock-Beaulieu MM, Jones S, Gatford M (1995) Okapi at TREC-3. In: Harman D (ed) Proceeding of 3rd text retrieval conference TREC3, Gaithersburg, pp 109–126

  • Salton G, Buckley C (1997) Readings in information retrieval. Chapter improving retrieval performance by relevance feedback. Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 355–364

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  • Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York, NY

    MATH  Google Scholar 

  • Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’96, New York, NY, ACM pp 21–29

  • Sinka MP, Corne (2003a) Towards modernised and web-specific stoplists for web document, analysis

  • Sinka MP, Corne DW (2003b) Evolving better stoplists for document clustering and web intelligence. Design and application of hybrid intelligent systems, pp 1015–1023

  • SMART. SMART System Stop-words List. http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

  • Smucker MD, Kazai G, Lease M (2012) Overview of the trec (2012) crowdsourcing track. Technical report, DTIC Document

  • Soboroff I (2014) A comparison of pooled and sampled relevance judgments. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’07, New York, NY, ACM pp 785–786

  • Song S, Myaeng SH (2012) A novel term weighting scheme based on discrimination power obtained from past retrieval results. Inf Process Manag 48(5):919–930 Large-Scale and Distributed Systems for Information Retrieval

    Article  Google Scholar 

  • Torgerson WS (1958) Theory and methods of scaling

  • University of Glasgow. Test collections. URL http://ir.dcs.gla.ac.uk/resources/test_collections/

  • Van Rijsbergen CJ (1975) Information retrieval. Butterworths. http://www.dcs.gla.ac.uk/Keith/Preface.html

  • Vinciarelli A (2005) Application of information retrieval techniques to single writer documents. Pattern Recogn Lett 26(14):2262–2271

    Article  Google Scholar 

  • Voorhees EM (2004) Overview of the trec 2004 robust retrieval track. In: Proceedings of the 13th text retrieval conference (TREC-2004), p 13

  • Winkler S, Schaller S, Dorfer V, Affenzeller M, Petz G, Karpowicz M (2014) Data-based prediction of sentiments using heterogeneous model ensembles. Soft Comput pp 1–12. doi:10.1007/s00500-014-1325-6

  • Zhou L, Lai KK, Lean Y (2009) Credit scoring using support vector machines with direct search for parameters selection. Soft Comput 13(2):149–155

    Article  MATH  Google Scholar 

  • Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Reading, MA

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to O. Ali Sadek Ibrahim.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by D. Neagu.

Appendix: Detailed experimental results of Section 4.2

Appendix: Detailed experimental results of Section 4.2

See Tables 10, 11, 12, 13, 14 in appendix.

Table 12 Average recall-precision results obtained on the FBIS collection from each case in the experiments
Table 13 Average recall-precision results obtained on the cranfield collection from each case in the experiments
Table 14 Average recall-precision results obtained on the CISI collection from each case in the experiments

The cases studies on these results are as follows:

  • Case 1: applying term-weighting scheme without using stop-words removal nor discriminative approach.

  • Case 2: applying term-weighting scheme using stop-words removal but without discriminative approach.

  • Case 3: applying term-weighting scheme without using stop-words removal but using discriminative approach.

  • Case 4: applying term-weighting scheme using both stop-words removal and discriminative approach.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ibrahim, O.A.S., Landa-Silva, D. Term frequency with average term occurrences for textual information retrieval. Soft Comput 20, 3045–3061 (2016). https://doi.org/10.1007/s00500-015-1935-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-015-1935-7

Keywords

Navigation