Skip to main content

Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9626))

Abstract

The Open-Source IR Reproducibility Challenge brought together developers of open-source search engines to provide reproducible baselines of their systems in a common environment on Amazon EC2. The product is a repository that contains all code necessary to generate competitive ad hoc retrieval baselines, such that with a single script, anyone with a copy of the collection can reproduce the submitted runs. Our vision is that these results would serve as widely accessible points of comparison in future IR research. This project represents an ongoing effort, but we describe the first phase of the challenge that was organized as part of a workshop at SIGIR 2015. We have succeeded modestly so far, achieving our main goals on the Gov2 collection with seven open-source search engines. In this paper, we describe our methodology, share experimental results, and discuss lessons learned as well as next steps.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/lintool/IR-Reproducibility/.

  2. 2.

    We used the r3.4xlarge instance, with 16 vCPUs and 122 GiB memory, Ubuntu Server 14.04 LTS (HVM).

References

  1. Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: Improvements that don’t add up: Ad-hoc retrieval results since 1998. In: CIKM, pp. 601–610 (2009)

    Google Scholar 

  2. Białecki, A., Muir, R., Ingersoll, G.: Apache lucene 4. In: SIGIR 2012 Workshop on Open Source Information Retrieval (2012)

    Google Scholar 

  3. Boldi, P., Vigna, S.: MG4J at TREC 2005. In: TREC (2005)

    Google Scholar 

  4. Boldi, P., Vigna, S.: MG4J at TREC 2006. In: TREC (2006)

    Google Scholar 

  5. Buccio, E.D., Nunzio, G.M.D., Ferro, N., Harman, D., Maistro, M., Silvello, G.: Unfolding off-the-shelf IR systems for reproducibility. In: SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (2015)

    Google Scholar 

  6. Cartright, M.A., Huston, S., Field, H.: Galago: A modular distributed processing and retrieval system. In: SIGIR 2012 Workshop on Open Source IR (2012)

    Google Scholar 

  7. Clarke, C., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: TREC (2004)

    Google Scholar 

  8. Crane, M., Trotman, A., O’Keefe, R.: Maintaining discriminatory power in quantized indexes. In: CIKM, pp. 1221–1224 (2013)

    Google Scholar 

  9. Lin, J., Trotman, A.: Anytime ranking for impact-ordered indexes. In: ICTIR, pp. 301–304 (2015)

    Google Scholar 

  10. Metzler, D., Croft, W.B.: Combining the language model and inference network approaches to retrieval. Inf. Process. Manage. 40(5), 735–750 (2004)

    Article  Google Scholar 

  11. Metzler, D., Croft, W.B.: A Markov random field model for term dependencies. In: SIGIR, pp. 472–479 (2005)

    Google Scholar 

  12. Metzler, D., Strohman, T., Turtle, H., Croft, W.B.: Indri at TREC 2004: Terabyte track. In: TREC (2004)

    Google Scholar 

  13. Mühleisen, H., Samar, T., Lin, J., de Vries, A.: Old dogs are great at new tricks: Column stores for IR prototyping. In: SIGIR, pp. 863–866 (2014)

    Google Scholar 

  14. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A high performance and scalable information retrieval platform. In: SIGIR 2006 Workshop on Open Source IR (2006)

    Google Scholar 

  15. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: TREC (1994)

    Google Scholar 

  16. Trotman, A., Jia, X.F., Crane, M.: Towards an efficient and effective search engine. In: SIGIR 2012 Workshop on Open Source IR (2012)

    Google Scholar 

  17. Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and language models examined. In: ADCS, pp. 58–65 (2014)

    Google Scholar 

  18. Vigna, S.: Quasi-succinct indices. In: WSDM, pp. 83–92 (2013)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the U.S. National Science Foundation under IIS-1218043 and by Amazon Web Services. Any opinions, findings, conclusions, or recommendations expressed are those of the authors and do not necessarily reflect the views of the sponsors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jimmy Lin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Lin, J. et al. (2016). Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30671-1_30

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30670-4

  • Online ISBN: 978-3-319-30671-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics