Abstract
Few works in Information Retrieval (IR) tackled the questions of Information Retrieval Systems (IRS) effectiveness and efficiency in the context of scalability in corpus size.
We propose a general experimental methodology to study the scalability influence on IR models. This methodology is based on the construction of a collection on which a given characteristic C is the same whatever be the portion of collection selected. This new collection called uniform can be split into sub-collection of growing size on which some given properties will be studied.
We apply our methodology to WT10G (TREC9 collection) and consider the characteristic C to be the distribution of relevant documents on a collection. We build a uniform WT10G, sample it into sub-collections of increasing size and use these sub-collections to study the impact of corpus volume increase on standards IRS evaluation measures (recall/precision, high precision).
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Voorhees, E., Harman, D.: Overview of the sixth text retrieval conference (trec-6). In: The Sixth text retrieval Conference, NIST Special Plublication 500-420 (1997)
Zobel, J.: How reliable are the results of large scale information retrieval experiments. In: Proceedings of the 21th ACM SIGIR Conference on research and development in information retrieval, pp. 307–314 (1998)
Cormack, G., Palmer, C., Clarke, C.L.A.: Efficient construction of large test collections. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 282–289 (1998)
Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: Proceedings of the 24th annual international ACM Conference on research and Development in Information Retrieval, pp. 66–73 (2001)
Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. Information and Processing Management, 697–716 (2000)
Frieder, O., Grossman, D.: On scalable information retrieval systems. In: Invited Paper, 2nd IEEE International Symposium on Network Computing and Applications, Massachusett, Cambridge (2003)
Gurrin, C., Smeaton, A.: Replicating web structure in small-scale test collections. Information retrieval 7, 239–263 (2004)
Chevallet, J.P., Martinez, J., Boughanem, M., Lechani-Tamine, L., Calabretto, S.: Rapport final de l’AS-91 du RTP-9 ’passage à l’échelle dans la taille des corpus’ (2004)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes - Compressing and indexing documents and images, 2nd edn., pp. 451–468. Morgan Kaufman Publishers, San Francisco (1999)
Newby, G.B.: The science of large scale information retrieval. Internet archives (2000)
Beigbeder, M., Mercier, A.: Etude des distributions de tf et de idf sur une collection de 5 millions de pages html. In: Atelier de recherche d’informations sur le passage à l’échelle Congrès INFORSID 2003, Nancy, France (2003)
Hawking, D., Thistlewaite, P.: Scaling up the trec collection. Information retrieval 1, 115–137 (1999)
Hawking, D., Robertson, S.: On collection size and retrieval effectiveness. Information retrieval 6, 99–105 (2003)
Bailey, P., Craswell, N., Hawking, D.: Engineering a multipurpose test collection for web retrieval experiments draft. In: Proceedings of the 24th annual international ACM SIGIR conference (2001)
Mercier, A.: Etude comparative de trois approches utilisant la proximité entre les termes de la requête pour le calcul des scores des documents. In: INFORSID 2004 - 22ème congrès informatique des organisations et des systèmes d’information et de décision, pp. 95–106 (2004)
Clarke, C.L.A., Cormack, G., Tudhope, E.: Relevance ranking for one to three term queries. Information Processing and Management 26, 291–311 (2000)
Hawking, D., Thistlewaite, P.: Proximity operators - so near and yet so far. In: Proceedings of the Fourth Text Retrieval Conference TREC-4, pp. 131–143 (1995)
Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval systems. In: Proceedings of European Conference on Information Retrieval Research, pp. 207–218 (2003)
Voorhees, E., Buckley, C.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international conference on Research and development in information retrieval, pp. 25–32 (2004)
Lyman, P., Varian, H.R., Swearingen, K., Charles, P., Good, N., Jordan, L.L., Pal, J.: How much informations (2003), http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Imafouo, A., Beigbeder, M. (2005). Scalability Influence on Retrieval Models: An Experimental Methodology. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_28
Download citation
DOI: https://doi.org/10.1007/978-3-540-31865-1_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)