Combining compound and single terms under language model framework

Hammache, Arezki; Boughanem, Mohand; Ahmed-Ouamer, Rachid

doi:10.1007/s10115-013-0618-x

Combining compound and single terms under language model framework

Regular Paper
Published: 08 March 2013

Volume 39, pages 329–349, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Arezki Hammache¹,
Mohand Boughanem³ &
Rachid Ahmed-Ouamer²

468 Accesses
5 Citations
Explore all metrics

Abstract

Most existing Information Retrieval model including probabilistic and vector space models are based on the term independence hypothesis. To go beyond this assumption and thereby capture the semantics of document and query more accurately, several works have incorporated phrases or other syntactic information in IR, such attempts have shown slight benefit, at best. Particularly in language modeling approaches this extension is achieved through the use of the bigram or n-gram models. However, in these models all bigrams/n-grams are considered and weighted uniformly. In this paper we introduce a new approach to select and weight relevant n-grams associated with a document. Experimental results on three TREC test collections showed an improvement over three strongest state-of-the-art model baselines, which are the original unigram language model, the Markov Random Field model, and the positional language model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Amati G (2003) Probabilistic models for information retrieval based on divergence from randomness, Ph.D. Thesis, Department of Computing Science, University of Glasgow
Baccini A, Déjean S, Lafage L, Mothe J (2011) How many performance measures to evaluate information retrieval systems? Knowl Inf Syst 30:693–713
Article Google Scholar
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison Wesley, Reading
Google Scholar
Banerjee S, Pedersen T (2003) The design, implementation, and use of the Ngram statistic package. In: Proceedings of the fourth international conference on intelligent text processing and, computational linguistics, pp 370–381
Berger A, Lafferty JD (1999) Information retrieval as statistical translation. In Proceedings of the ACM SIGIR conference on research and development in information retrieval, Berkeley, CA, USA, pp 222–229
Buttcher S, Clarke C, Lushman B (2006) Term proximity scoring for ad-hoc retrieval on very large text collections. In: Efthimiadis E, Dumais S, Hawking D, Jarvelin K (eds) Proceedings of the international ACM SIGIR conference on Research and development in information retrieval. Seattle, Washington, USA, pp 621–622
Crestani F (2009) Logical models of information retrieval. Encyclopedia of Database Systems pp 1652–1658
Croft WB, Turtle HR, Lewis DD (1991) The use of phrases and structured queries in information retrieval. In: Proceedings of the international ACMSIGIR conference on Research and development in, information retrieval, pp 32–45
Fagan J (1987) Automatic phrase indexing for document retrieval: an examination of syntactic and non-syntactic methods. In: Yu C, van Rijsbergen CJ (eds) Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, USA, ACM, pp 91–101
Gao JF, Nie JY, Wu G, Cao G (2004) Dependence language model for information retrieval. In: Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 170–177
He B, Xiangji JH, Zhou X (2011) Modeling term proximity for probabilistic information retrieval models. Inf Sci 181:3017–3031
Article Google Scholar
Hiemstra D (1998) A linguistically motivated probabilistic model of information retrieval. In Proceedings of european conference on digital libraries, proceedings, number 1513 in Lecture Notes in Computer Science. Springer, pp 569–584
Huang X, Robertson SE (2001) Comparisons of probabilistic Compound Unit Weighting Methods. In proceedings of the ICDM workshop on text mining. San Jose, USA, Nov, pp 1–15
Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilitiesfor entry page search. In: Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 27–34
Kraaij W, Nie JY, Simard M (2003) Embedding web-based statistical translation models in cross-language information retrieval. Comput Linguist 29:381–420
Article MATH Google Scholar
Khoo C, Myaeng S, Oddy R (2001) Using cause-effect relations in text to improve information retrieval precision. Process Manag 37:119–145
Article MATH Google Scholar
Lafferty J, Zhai C (2001) Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the ACM SIGIR conference on Research and development in, information retrieval, pp 111–119
Lavrenko V, Croft WB (2001) Relevance-based language models. In: Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 120–127
Lv Y, Zhai C (2009) Positional language models for information retrieval. In: Proceedings of international ACM SIGIR conference on Research and development in, information retrieval, pp 299–306
Macdonald C, Ounis I (2008) Voting techniques for expert search. Knowl Inf Syst 16:259–280
Article Google Scholar
Manning C, Schutze H (2003) Foundations of statistical natural language processing, 6th edn. MIT Press, Cambridge
Google Scholar
Metzler D, Croft WB (2005) A Markov random field model for term dependencies. In: Proceedings of the international ACM SIGIR conference on Research and development in information retrieval. Salvador, Brazil, ACM, pp 472–479
Miller DRH, Leek T, Schwartz RM (1999) A hidden markov model information retrieval system, In Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 214–221
Mitra M, Buckley C, Singhal A, Cardie C (1997) An analysis of statistical and syntactic phrases. In: Proceedings of RIAO, pp 200–214
Peng J, Macdonald C, He B, Plachouras J, Ounis (2007) Incorporating Term Dependency in the DFR Framework. In: Proceedings of the european conference on information retrieval research, Lecture Notes in Computer Science, vol 4425. Springer, Rome, Italy, pp 28–39
Petrovic S, Snajder J, Dalbelo-Basic B, Kolar M (2006) Comparison of collocation extraction measures for document indexing. J Comput Inf Technol 14:321–327
Google Scholar
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the international ACM SIGIR conference on research and development in, information retrieval, pp 275–281
Porter M (1980) An algorithm for suffix stripping. Program 14:130–137
Article Google Scholar
Rasolofo Y, Savoy J (2003) Term proximity scoring for keyword-based retrieval systems. In: Proceedings of the european conference on information retrieval research, Lecture Notes in Computer Science, vol. 4425, Springer, pp 207–218
Robertson SE, Walker S, Hancock-Beaulieu M, Gatford M, Payne A (1995) Okapi at TREC-4. In Proceedings of the text retrieval conference, Gaithersburg, Maryland, pp 73–96
Salton G (1971) The SMART retrieval system—experiments in automatic document processing. Prentice-Hall, Inc., Upper Saddle River
Google Scholar
Shi L, Nie JY (2009) Integrating phrase inseparability in phrase-based model. In: Proceedings of the international ACM SIGIR conference on research and development in, information retrieval, pp 708–709
Si L, Jin R, Callan JP, Ogilvie P (2002) A language modeling framework for resource selection and results merging. In: Proceedings of conference on information and, knowledge management pp 391–397
Song F, Croft WB (199) A general language model for information retrieval. In: Proceedings of the international ACM SIGIR conference on research and development in, information retrieval, pp 316–321
Srikanth M, Srihari R (2002) Biterm language models for document retrieval. In: Proceedings of the international ACM SIGIR conference on Research and development in, information retrieval, pp 425–426
Tao T, Zhai C (2007) An exploration of proximity measures in information retrieval. In: Proceedings of the international ACM SIGIR conference on research and development in, information retrieval, pp 295–302
You W, Fontaine D, Barthès JP (2012) An automatic key phrase extraction system for scientific documents. Knowl Inf Syst 23:29–54
Google Scholar
Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, pp 334–342
Zhao J, Yun Y (2009) A proximity language model for information retrieval. In Proceedings of the 32th annual international ACM SIGIR conference on research and development in, information retrieval, pp 291–298
Zhai CJ, Lafferty A (2004) Study of smoothing methods for language models applied to information retrieval. Trans Inf Syst 22:179–214
Article Google Scholar
Zhu J, Xiangji H, Song D, Rüger S (2010) Integrating multiple document features in language models for expert finding. Knowl Inf Syst 23:29–54
Article Google Scholar

Download references

Acknowledgments

We thank the editor and anonymous reviewers for their very useful comments and suggestions.

Author information

Authors and Affiliations

Departement of Computer science, University of Mouloud Mammeri, 15000 , Tizi-Ouzou, Algeria
Arezki Hammache
LARI Laboratory, Departement of Computer Science, University of Mouloud Mammeri, 1500 , Tizi-Ouzou, Algeria
Rachid Ahmed-Ouamer
IRIT Laboratory, University of Paul Sabatier, 118 route de Narbonne, 31062 , Toulouse Cedex 09, France
Mohand Boughanem

Authors

Arezki Hammache
View author publications
You can also search for this author in PubMed Google Scholar
Mohand Boughanem
View author publications
You can also search for this author in PubMed Google Scholar
Rachid Ahmed-Ouamer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arezki Hammache.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hammache, A., Boughanem, M. & Ahmed-Ouamer, R. Combining compound and single terms under language model framework. Knowl Inf Syst 39, 329–349 (2014). https://doi.org/10.1007/s10115-013-0618-x

Download citation

Received: 07 March 2012
Revised: 01 February 2013
Accepted: 17 February 2013
Published: 08 March 2013
Issue Date: May 2014
DOI: https://doi.org/10.1007/s10115-013-0618-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining compound and single terms under language model framework

Abstract

Access this article

Similar content being viewed by others

Leveraging Concepts and Semantic Relationships for Language Model Based Document Retrieval

Integrating Semantic Term Relations into Information Retrieval Systems Based on Language Models

Positional Translation Language Model for Ad-Hoc Information Retrieval

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining compound and single terms under language model framework

Abstract

Access this article

Similar content being viewed by others

Leveraging Concepts and Semantic Relationships for Language Model Based Document Retrieval

Integrating Semantic Term Relations into Information Retrieval Systems Based on Language Models

Positional Translation Language Model for Ad-Hoc Information Retrieval

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation