Advertisement

Knowledge and Information Systems

, Volume 39, Issue 2, pp 329–349 | Cite as

Combining compound and single terms under language model framework

  • Arezki Hammache
  • Mohand Boughanem
  • Rachid Ahmed-Ouamer
Regular Paper

Abstract

Most existing Information Retrieval model including probabilistic and vector space models are based on the term independence hypothesis. To go beyond this assumption and thereby capture the semantics of document and query more accurately, several works have incorporated phrases or other syntactic information in IR, such attempts have shown slight benefit, at best. Particularly in language modeling approaches this extension is achieved through the use of the bigram or n-gram models. However, in these models all bigrams/n-grams are considered and weighted uniformly. In this paper we introduce a new approach to select and weight relevant n-grams associated with a document. Experimental results on three TREC test collections showed an improvement over three strongest state-of-the-art model baselines, which are the original unigram language model, the Markov Random Field model, and the positional language model.

Keywords

Compound term weighting Term dominance Information retrieval Language model 

Notes

Acknowledgments

We thank the editor and anonymous reviewers for their very useful comments and suggestions.

References

  1. 1.
    Amati G (2003) Probabilistic models for information retrieval based on divergence from randomness, Ph.D. Thesis, Department of Computing Science, University of GlasgowGoogle Scholar
  2. 2.
    Baccini A, Déjean S, Lafage L, Mothe J (2011) How many performance measures to evaluate information retrieval systems? Knowl Inf Syst 30:693–713CrossRefGoogle Scholar
  3. 3.
    Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison Wesley, ReadingGoogle Scholar
  4. 4.
    Banerjee S, Pedersen T (2003) The design, implementation, and use of the Ngram statistic package. In: Proceedings of the fourth international conference on intelligent text processing and, computational linguistics, pp 370–381Google Scholar
  5. 5.
    Berger A, Lafferty JD (1999) Information retrieval as statistical translation. In Proceedings of the ACM SIGIR conference on research and development in information retrieval, Berkeley, CA, USA, pp 222–229Google Scholar
  6. 6.
    Buttcher S, Clarke C, Lushman B (2006) Term proximity scoring for ad-hoc retrieval on very large text collections. In: Efthimiadis E, Dumais S, Hawking D, Jarvelin K (eds) Proceedings of the international ACM SIGIR conference on Research and development in information retrieval. Seattle, Washington, USA, pp 621–622Google Scholar
  7. 7.
    Crestani F (2009) Logical models of information retrieval. Encyclopedia of Database Systems pp 1652–1658Google Scholar
  8. 8.
    Croft WB, Turtle HR, Lewis DD (1991) The use of phrases and structured queries in information retrieval. In: Proceedings of the international ACMSIGIR conference on Research and development in, information retrieval, pp 32–45Google Scholar
  9. 9.
    Fagan J (1987) Automatic phrase indexing for document retrieval: an examination of syntactic and non-syntactic methods. In: Yu C, van Rijsbergen CJ (eds) Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, USA, ACM, pp 91–101Google Scholar
  10. 10.
    Gao JF, Nie JY, Wu G, Cao G (2004) Dependence language model for information retrieval. In: Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 170–177Google Scholar
  11. 11.
    He B, Xiangji JH, Zhou X (2011) Modeling term proximity for probabilistic information retrieval models. Inf Sci 181:3017–3031CrossRefGoogle Scholar
  12. 12.
    Hiemstra D (1998) A linguistically motivated probabilistic model of information retrieval. In Proceedings of european conference on digital libraries, proceedings, number 1513 in Lecture Notes in Computer Science. Springer, pp 569–584Google Scholar
  13. 13.
    Huang X, Robertson SE (2001) Comparisons of probabilistic Compound Unit Weighting Methods. In proceedings of the ICDM workshop on text mining. San Jose, USA, Nov, pp 1–15Google Scholar
  14. 14.
    Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilitiesfor entry page search. In: Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 27–34Google Scholar
  15. 15.
    Kraaij W, Nie JY, Simard M (2003) Embedding web-based statistical translation models in cross-language information retrieval. Comput Linguist 29:381–420CrossRefMATHGoogle Scholar
  16. 16.
    Khoo C, Myaeng S, Oddy R (2001) Using cause-effect relations in text to improve information retrieval precision. Process Manag 37:119–145CrossRefMATHGoogle Scholar
  17. 17.
    Lafferty J, Zhai C (2001) Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the ACM SIGIR conference on Research and development in, information retrieval, pp 111–119Google Scholar
  18. 18.
    Lavrenko V, Croft WB (2001) Relevance-based language models. In: Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 120–127Google Scholar
  19. 19.
    Lv Y, Zhai C (2009) Positional language models for information retrieval. In: Proceedings of international ACM SIGIR conference on Research and development in, information retrieval, pp 299–306Google Scholar
  20. 20.
    Macdonald C, Ounis I (2008) Voting techniques for expert search. Knowl Inf Syst 16:259–280CrossRefGoogle Scholar
  21. 21.
    Manning C, Schutze H (2003) Foundations of statistical natural language processing, 6th edn. MIT Press, CambridgeGoogle Scholar
  22. 22.
    Metzler D, Croft WB (2005) A Markov random field model for term dependencies. In: Proceedings of the international ACM SIGIR conference on Research and development in information retrieval. Salvador, Brazil, ACM, pp 472–479Google Scholar
  23. 23.
    Miller DRH, Leek T, Schwartz RM (1999) A hidden markov model information retrieval system, In Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 214–221Google Scholar
  24. 24.
    Mitra M, Buckley C, Singhal A, Cardie C (1997) An analysis of statistical and syntactic phrases. In: Proceedings of RIAO, pp 200–214Google Scholar
  25. 25.
    Peng J, Macdonald C, He B, Plachouras J, Ounis (2007) Incorporating Term Dependency in the DFR Framework. In: Proceedings of the european conference on information retrieval research, Lecture Notes in Computer Science, vol 4425. Springer, Rome, Italy, pp 28–39Google Scholar
  26. 26.
    Petrovic S, Snajder J, Dalbelo-Basic B, Kolar M (2006) Comparison of collocation extraction measures for document indexing. J Comput Inf Technol 14:321–327Google Scholar
  27. 27.
    Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the international ACM SIGIR conference on research and development in, information retrieval, pp 275–281Google Scholar
  28. 28.
    Porter M (1980) An algorithm for suffix stripping. Program 14:130–137CrossRefGoogle Scholar
  29. 29.
    Rasolofo Y, Savoy J (2003) Term proximity scoring for keyword-based retrieval systems. In: Proceedings of the european conference on information retrieval research, Lecture Notes in Computer Science, vol. 4425, Springer, pp 207–218Google Scholar
  30. 30.
    Robertson SE, Walker S, Hancock-Beaulieu M, Gatford M, Payne A (1995) Okapi at TREC-4. In Proceedings of the text retrieval conference, Gaithersburg, Maryland, pp 73–96Google Scholar
  31. 31.
    Salton G (1971) The SMART retrieval system—experiments in automatic document processing. Prentice-Hall, Inc., Upper Saddle RiverGoogle Scholar
  32. 32.
    Shi L, Nie JY (2009) Integrating phrase inseparability in phrase-based model. In: Proceedings of the international ACM SIGIR conference on research and development in, information retrieval, pp 708–709Google Scholar
  33. 33.
    Si L, Jin R, Callan JP, Ogilvie P (2002) A language modeling framework for resource selection and results merging. In: Proceedings of conference on information and, knowledge management pp 391–397Google Scholar
  34. 34.
    Song F, Croft WB (199) A general language model for information retrieval. In: Proceedings of the international ACM SIGIR conference on research and development in, information retrieval, pp 316–321Google Scholar
  35. 35.
    Srikanth M, Srihari R (2002) Biterm language models for document retrieval. In: Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 425–426Google Scholar
  36. 36.
    Tao T, Zhai C (2007) An exploration of proximity measures in information retrieval. In: Proceedings of the international ACM SIGIR conference on research and development in, information retrieval, pp 295–302Google Scholar
  37. 37.
    You W, Fontaine D, Barthès JP (2012) An automatic key phrase extraction system for scientific documents. Knowl Inf Syst 23:29–54Google Scholar
  38. 38.
    Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, pp 334–342Google Scholar
  39. 39.
    Zhao J, Yun Y (2009) A proximity language model for information retrieval. In Proceedings of the 32th annual international ACM SIGIR conference on research and development in, information retrieval, pp 291–298Google Scholar
  40. 40.
    Zhai CJ, Lafferty A (2004) Study of smoothing methods for language models applied to information retrieval. Trans Inf Syst 22:179–214CrossRefGoogle Scholar
  41. 41.
    Zhu J, Xiangji H, Song D, Rüger S (2010) Integrating multiple document features in language models for expert finding. Knowl Inf Syst 23:29–54CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Arezki Hammache
    • 1
  • Mohand Boughanem
    • 3
  • Rachid Ahmed-Ouamer
    • 2
  1. 1.Departement of Computer scienceUniversity of Mouloud MammeriTizi-OuzouAlgeria
  2. 2.LARI Laboratory, Departement of Computer ScienceUniversity of Mouloud MammeriTizi-OuzouAlgeria
  3. 3.IRIT LaboratoryUniversity of Paul SabatierToulouse Cedex 09France

Personalised recommendations