Information Retrieval

, Volume 8, Issue 4, pp 631–654 | Cite as

The Importance of Length Normalization for XML Retrieval

  • Jaap Kamps
  • Maarten de Rijke
  • Börkur Sigurbjörnsson
Article

Abstract

XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of element length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length-bias introduced by the amount of smoothing, and show the importance of extreme length bias for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate element length normalization. Even after restricting the minimal size of XML elements occurring in the index, the importance of an extreme explicit length bias remains.

Keywords

XML retrieval language models length normalization smoothing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abolhassani M, Fuhr N and Malik S (2004) HyREX at INEX 2003. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 27–32.Google Scholar
  2. Aniati G and Van Rijsbergen CJ (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20:357–389.Google Scholar
  3. Berger A and Lafferty J (1999) Information retrieval as statistical translation. In: Proceedings of the 22nd Annual International AGM-SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pp. 222–229.Google Scholar
  4. Buckley C, Singhal A and Mitra M (1996) New Retrieval Approaches Using SMART: TREC 4. In: Harman DK, Ed., The Fourth Text REtrieval Conference (TREC-4), pp. 25–48.Google Scholar
  5. Carmel D, Maarek Y, Mandelbrod M, Mass Y and Soffer A (2003) Searching XML documents via XML fragments. In: Clarke C, Cormack G, Callan J, Hawking D and Smeaton A, Eds., Proceedings of the 26th Annual International AGM-SIGIR Conference on Research and Development in Information Retrieval, pp. 151–158.Google Scholar
  6. Efron B (1979) Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7:1–26.Google Scholar
  7. Efron B and Tibshirani RJ (1993) An Introduction to the Bootstrap. Chapman and Hall, New York.Google Scholar
  8. Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds. (2003) Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM.Google Scholar
  9. Fuhr N, Lalmas M and Malik S, Eds. (2004), INEX 2003 Workshop Proceedings.Google Scholar
  10. Gouml;vert N, Abolhassani M, Fuhr N and Grossjohan K (2003) Content-based XML retrieval with HyRex. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, pp. 26–32.Google Scholar
  11. Greiff W and Morgan W (2003) Contributions of Language Modeling to the Theory and Practice of Information Retrieval. In: Croft W and Lafferty J, Eds., Language Modeling for Information Retrieval. Kluwer Academic Publishers, pp. 73–93.Google Scholar
  12. Harman D (2003) Overview of the TREC 2002 Novelty Track. In: Voorhees E and Buckland L, Eds., The Eleventh Text REtrieval Conference (TREC-11).Google Scholar
  13. Hiemstra D (2001) Using language models for information retrieval. Ph.D. thesis, University of Twente.Google Scholar
  14. Hiemstra D (2003) A Database Approach to Content-based XML Retrieval. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, pp. 111–118.Google Scholar
  15. Hiemstra D and Kraaij W (1999) Twenty-One at TREC-7: Ad-hoc and cross-language track. In: Voorhees E and Harman D, Eds., The Seventh Text REtrieval Conference (TREC-7), pp. 227–238.Google Scholar
  16. INEX (2004) Initiative for the evaluation of XML retrieval, http://www.is.informatik.uni-duisburg.de/projects/inex03/.
  17. Kamps J, de Rijke M and Sigurbjornsson B (2003a) Topic Field Selection and Smoothing for XML Retrieval. In de Vries AP, Ed., Proceedings of the 4th Dutch-Belgian Information Retrieval Workshop, pp. 69–75.Google Scholar
  18. Kamps J, de Rijke M and Sigurbjornsson B (2004) Length Normalization in XML Retrieval. In: Proceedings 27th Annual International ACM SIGIR Conference (SIGIR 2004), pp. 80–87.Google Scholar
  19. Kamps J, Marx M, de Rijke M and Sigurbjornsson B (2003b) XML Retrieval: What to Retrieve?. In: Clarke C, Cormack G, Callan J, Hawking D and Smeaton A, Eds., Proceedings of the 26th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 409–410.Google Scholar
  20. Kraaij W (2004) Variations on Language Modeling for Information Retrieval. Ph.D. thesis, University of Twente.Google Scholar
  21. Kraaij W, Pohlmann R and Hiemstra D (2000) Twenty-One at TREC-8: Using language technology for information retrieval. In: Voorhees E and Harman D, Eds., The Eighth Text REtrieval Conference (TREC-8), pp. 285–300.Google Scholar
  22. Kraaij W and Westerveld T (2001) Twenty-UT at TREC-9: How different are web documents?. In: Voorhees E and Harman D, Eds. The Ninth Text REtrieval Conference (TREC-9), pp. 665–672.Google Scholar
  23. Kraaij W, Westerveld T and Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 27–34.Google Scholar
  24. Lafferty J and Zhai C (2003) Probabilistic relevance models based on document and query generation. In: Croft W and Lafferty J, Eds., Language Modeling for Information Retrieval. Kluwer Academic Publishers, pp. 1–10.Google Scholar
  25. List J and de Vries A (2003) CWI at INEX 2002. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM., pp. 133–140.Google Scholar
  26. List J, Mihajlovic V, Vries AD, Ramirez G and Hiemstra D (2004) The TIJAH XML-IR system at INEX 2003. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 102–109.Google Scholar
  27. Mass Y and Mandelbrod M (2004) Retrieving the most relevant XML components. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 53–58.Google Scholar
  28. Mass Y, Mandelbrod M, Amitay E, Carmel D, Maarek Y and Soffer A (2003) JuruXML–-an XML retrieval system at INEX’02. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, pp. 73–80.Google Scholar
  29. Miller D, Leek T and Schwartz R (1999) A hidden Markov model information retrieval system. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 214–221.Google Scholar
  30. Ogilvie P and Callan J (2003) Language Models and Structured Document Retrieval. In: Fuhr N, Gouml;vert N, Kazai G and Lalmas M, Eds., Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX 2002). ERCIM, pp. 33–44.Google Scholar
  31. Ogilvie P and Callan J (2004) Using Language Models for flat text queries in XML retrieval. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 12–18.Google Scholar
  32. Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval, McGraw-Hill computer science series. McGraw-Hill, New York.Google Scholar
  33. Savoy J (1997) Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and Management, 33:495–512.Google Scholar
  34. Sigurbjornsson B, Kamps J and de Rijke M (2004) An Element-Based Approch to XML Retrieval. In: Fuhr N, Lalmas M and Malik S, Eds., INEX 2003 Workshop Proceedings, pp. 19–26.Google Scholar
  35. Singhal A, Salton G, Mitra M and Buckley C (1996) Document length normalization. Information Processing & Management, 32:619–633.Google Scholar
  36. Voorhees E (2003) Overview of the TREC 2002 Question Answering Track. In: Voorhees E and Buckland L, Eds. The Eleventh Text REtrieval Conference (TREC-11).Google Scholar
  37. Wilbur J (1994) Non-parametric significance tests of retrievalperformance comparisons. Journal of Information Science, 20:270–284.Google Scholar
  38. Wilkinson R (1994) Effective retrieval of structured documents. In: Proceedings of the 17th Annual International AGM-SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pp. 311–317.Google Scholar
  39. Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 334–342.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • Jaap Kamps
    • 1
    • 2
  • Maarten de Rijke
    • 1
  • Börkur Sigurbjörnsson
    • 1
  1. 1.Informatics InstituteUniversity of AmsterdamAmsterdam
  2. 2.Archives and Information Studies, Faculty of HumanitiesUniversity of AmsterdamAmsterdam

Personalised recommendations