LIA at INEX 2010 Book Track

  • Romain Deveaud
  • Florian Boudin
  • Patrice Bellot
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6932)

Abstract

In this paper we describe our participation and present our contributions in the INEX 2010 Book Track. Digitized books are now a common source of information on the Web, however OCR sometimes introduces errors that can penalize Information Retrieval. We propose a method for correcting hyphenations in the books and we analyse its impact on the Best Books for Reference task. The observed improvement is around 1%.

This year we also experimented different query expansion techniques. The first one consists of selecting informative words from a Wikipedia page related to the topic. The second one uses a dependency parser to enrich the query with the detected phrases using a Markov Random Field model. We show that there is a significant improvement over the state-of-the-art when using a large weighted list of Wikipedia words, meanwhile hyphenation correction has an impact on their distribution over the book corpus.

Keywords

Retrieval Model Query Term Query Expansion Mean Average Precision Original Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    De Marneffe, M.C., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC 2006 Conference (2006)Google Scholar
  2. 2.
    Kazai, G., Doucet, A., Koolen, M., Landoni, M.: Overview of the INEX 2009 Book Track. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 145–159. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  3. 3.
    Kazai, G., Koolen, M., Doucet, A., Landoni, M.: Overview of the inex 2010 book track: At the mercy of crowdsourcing. In: Geva, S., et al. (eds.) INEX 2010. LNCS, vol. 6932, pp. 98–117. Springer, Heidelberg (2011)Google Scholar
  4. 4.
    Koolen, M., Kazai, G., Craswell, N.: Wikipedia pages as entry points for book search. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining. WSDM 2009, pp. 45–53. ACM, New York (2009)Google Scholar
  5. 5.
    Li, Y., Luk, W.P.R., Ho, K.S.E., Chung, F.L.K.: Improving weak ad-hoc queries using wikipedia as external corpus. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 2007, pp. 797–798. ACM, New York (2007)Google Scholar
  6. 6.
    Magdy, W., Darwish, K.: Book search: indexing the valuable parts. In: Proceeding of the 2008 ACM Workshop on Research Advances in Large Digital Book Repositories, Books Online 2008, pp. 53–56. ACM, New York (2008)CrossRefGoogle Scholar
  7. 7.
    Metzler, D., Croft, W.B.: Combining the language model and inference network approaches to retrieval. Inf. Process. Manage. 40, 735–750 (2004)CrossRefGoogle Scholar
  8. 8.
    Metzler, D., Bruce Croft, W.: A markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 2005, pp. 472–479. ACM, New York (2005)Google Scholar
  9. 9.
    Milne, D.N., Witten, I.H., Nichols, D.M.: A knowledge-based search engine powered by wikipedia. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM 2007, pp. 445–454. ACM, New York (2007)CrossRefGoogle Scholar
  10. 10.
    Taghva, K., Borsack, J., Condit, A.: Results of applying probabilistic ir to ocr text. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, pp. 202–211. Springer, New York (1994)Google Scholar
  11. 11.
    Wu, H., Kazai, G., Taylor, M.: Book Search Experiments: Investigating IR Methods for the Indexing and Retrieval of Books. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 234–245. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    Yu, X., Jones, G.J.F., Wang, B.: Query dependent pseudo-relevance feedback based on wikipedia. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 2009, pp. 59–66. ACM, New York (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Romain Deveaud
    • 1
  • Florian Boudin
    • 1
  • Patrice Bellot
    • 1
  1. 1.Laboratoire Informatique d’AvignonUniversity of Avignon (CERI-LIA)Avignon Cedex 9France

Personalised recommendations