Information Retrieval

, Volume 8, Issue 4, pp 571–600 | Cite as

Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database

  • Jovan Pehcevski
  • James A. Thom
  • Anne-Marie Vercoustre

Abstract

This paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses eXist to extract elements from those articles. For the content-only topics, we undertake a preliminary analysis of the INEX 2003 relevance assessments in order to identify the types of highly relevant document components. Further analysis identifies two complementary sub-cases of relevance assessments (General and Specific) and two categories of topics (Broad and Narrow). We develop a novel retrieval module that for a content-only topic utilises the information from the resulting answer list of a native XML database and dynamically determines the preferable units of retrieval, which we call Coherent Retrieval Elements. The results of our experiments show that—when each of the three systems is evaluated against different retrieval scenarios (such as different cases of relevance assessments, different topic categories and different choices of evaluation metrics)—the XML retrieval systems exhibit varying behaviour and the best performance can be reached for different values of the retrieval parameters. In the case of INEX 2003 relevance assessments for the content-only topics, our newly developed hybrid XML retrieval system is substantially more effective than either Zettair or eXist, and yields a robust and a very effective XML retrieval.

Keywords

XML information retrieval XML databases eXist Zettair INEX 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adler S, Berglund A, Caruso J, Deach S, Graham T, Grosso P, Gutentag E, Milowski A, Parnell S, Richman J and Zilles S, Eds. (2001) Extensible stylesheet language (XSL) Version 1.0. W3C Recommendation 15 October 2001, The World Wide Web Consortium. http://www.w3.org/TR/xsl/ (Visited Aug. 27th, 2004).
  2. Al-Khalifa S, Yu C and Jagadish HV (2003) Querying structured text in an XML database. In: Papakonstantinou Y, Halevy A and Ives Z, Eds., Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9–12, 2003. ACM Press, pp. 4–15.Google Scholar
  3. Amer-Yahia S, Botev C and Shanmugasundaram J (2004) TeXQuery: A full-text search extension to XQuery. In: Feldman S, Uretsky M, Najork M and Wills C, Eds., Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA, May 17–20, 2004. ACM Press, pp. 583–594.Google Scholar
  4. Amer-Yahia S and Case P, Eds. (2004) XQuery 1.0 and XPath 2.0 Full-Text Use Cases. W3C Working Draft 09 July 2004, The World Wide Web Consortium. http://www.w3.org/TR/2004/WD-xmlquery-full-text-use-cases-20040709/ (Visited Aug. 23rd, 2004).
  5. Berglund A, Boag S, Chamberlin D, Fernandez MF, Kay M, Robie J, Simeon J, Eds. (2004) XML path language (XPath) 2.0. W3C working draft 23 July 2004, The World Wide Web Consortium. http://www.w3.org/TR/xpath20/ (Visited Aug. 27th, 2004).
  6. Boag S, Chamberlin D, Fernandez MF, Florescu D, Robie J and Simeon J, Eds. (2004) XQuery 1.0: An XML query language. W3C Working Draft 23 July 2004, The World Wide Web Consortium. http://www.w3.org/TR/2004/WD-xquery-20040723/(Visited Aug. 23rd, 2004).Google Scholar
  7. Botev C, Amer-Yahia S and Shanmugasundaram J (2004) A TeXQuery-based XML full-text search engine. In: Weikum G, Konig AC and Debloch S, Eds., Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18, 2004. ACM Press, pp. 943–944.Google Scholar
  8. Buxton S and Rys M, Eds. (2003) XQuery and XPath full-text requirements. W3C Working Draft 02 May 2003, The World Wide Web Consortium. http://www.w3.org/TR/xquery-full-text-requirements/ (Visited Aug. 23rd, 2004).
  9. Chiaramella Y, Mulhem P and Fourel F (1996) A model for multimedia information retrieval. Technical Report Fermi ESPRIT BRA 8134, University of Glasgow. http://www.dcs.gla.ac.uk/fermi/tech_reports/reports/fermi96-4.ps.gz (Visited April 19th, 2004).
  10. Cohen S, Mamou J, Kanza Y and Sagiv Y (2003) XSEarch: A semantic search engine for XML. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG and Heuer A, Eds., Proceedings of 29th International Conference on Very Large Data Bases (VLDB), Berlin, Germany, Sept. 9–12, 2003. Morgan Kaufmann, pp. 45–56.Google Scholar
  11. Doucet A, Aunimo L, Lehtonen M and Petit R (2004) Accurate retrieval of XML document fragments using EXTIRP. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 73–80.Google Scholar
  12. Florke H (2004) The SearX-engine at INEX’03: XML enabled probabilistic retrieval. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 155–157.Google Scholar
  13. Fuhr N and Grossjohann K (2001) XIRQL: A query language for information retrieval in XML documents. In: Croft WB, Harper DJ, Kraft DH and Zobel J, Eds., Proceedings of the 24th Annual International ACM SIGIR, Conference on Research and Development in Information Retrieval, New Orleans, Louisiana, USA, Sept. 9–13, 2001. ACM Press, pp. 172–180.Google Scholar
  14. Fuhr N, Malik S and Lalmas M (2004) Overview of the INitiative for the evaluation of XML retrieval (INEX) 2003. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 12–11.Google Scholar
  15. Goevert N and Kazai G (2003) Overview of the INitiative for the evaluation of XML retrieval (INEX) 2002. In: Fuhr N, Goevert N, Kazai G and Lalmas M, Eds., Proceedings of the 1st Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 8–11, 2002. ERCIM, pp. 1–17.Google Scholar
  16. Goevert N, Kazai G, Fuhr N and Lalmas M (2003) Evaluating the effectiveness of content-oriented XML retrieval. Technical Report, University of Dortmund, Computer Science 6. http://www.is.informatik.uni-duisburg.de/bib/fulltext/ir/Goevert_etal:03a.pdf (Visited April 19th, 2004).Google Scholar
  17. Guo L, Shao F, Botev C and Shanmugasundaram J (2003) XRANK: Ranked keyword search over XML documents. In: Papakonstantinou Y, Halevy A and Ives Z, Eds., Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9–12, 2003. ACM Press, pp. 16–27.Google Scholar
  18. Hatano K, Kinutani H, Watanabe M, Mori Y, Yoshikawa M and Uemura S (2004) Keyword-based XML fragment retrieval: Experimental evaluation based on INEX 2003 relevance assessments. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 81–88.Google Scholar
  19. Hatano K, Kinutani H, Watanabe M, Yoshikawa M and Uemura S (2003) Determining the unit of retrieval results for XML documents. In: Fuhr N, Goevert N, Kazai G and Lalmas M, Eds., Proceedings of the 1st Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 8–11, 2002. ERCIM, pp. 57–64.Google Scholar
  20. Kazai G, Lalmas M and Piwowarski B (2004) INEX’03 relevance assessment Guide. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 204–209.Google Scholar
  21. Kazai G, Lalmas M and Rolleke T (2002) Focussed structured document retrieval. In: Laender AHF and Oliveira AL, Eds., Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE), Lisbon, Portugal, Sept. 11–13, 2002. Springer-Verlag, pp. 241–247.Google Scholar
  22. Kazai G, Lalmas M and de Vries AP (2004) The Overlap problem in content-oriented XML retrieval evaluation. In: Sanderson M, Jarvelin K, Allan J and Bruza P, Eds., Proceedings of the 27th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, July 25–29, 2004. ACM Press, pp. 72–79.Google Scholar
  23. Mass Y and Mandelbrod M (2004) Retrieving the most relevant XML components. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, 2003. pp. 53–58.Google Scholar
  24. Meier W (2003) eXist: An open source native XML database. In: Chaudri AB, Jeckle M, Rahm E and Unland R, Eds., Web, Web-Services, and Database Systems. NODe 2002 Web- and Database-Related Workshops, Erfurt, Germany, Oct. 7–10, 2002. Springer-Verlag, pp. 169–183.Google Scholar
  25. Pehcevski J, Thom J and Vercoustre A-M (2003) RMIT INEX Experiments: XML rRetrieval using zettair/eXist. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 134–141.Google Scholar
  26. Pehcevski J, Thom J and Vercoustre A-M (2004) Enhancing content-and-structure information retrieval using a native XML database. In: Mihajlovic V and Hiemstra D, Eds., Proceedings of The First Twente Data Management Workshop (TDM’04) on XML Databases and Information Retrieval, Enschede, The Netherlands, June 21. CTIT - University of Twente, pp. 24–31.Google Scholar
  27. Schenkel R, Theobald A and Weikum G (2004) XXL @ INEX 2003. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17. pp. 59–66.Google Scholar
  28. Sigurbjornsson B, Kamps J and de Rijke M (2004) An element-based approach to XML retrieval. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 19–28.Google Scholar
  29. Singhal A, Buckley C and Mitra M (1996) Pivoted document length normalization. In: Frei H-P, Harman D, Schauble P and Wilkinson R, Eds., Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, Aug. 18–22. ACM Press, pp. 21–29.Google Scholar
  30. Theobald A and Weikum G (2002) The index-based XXL search engine for querying XML data with relevance ranking. In: Jensen CS, Jeffery KG, Pokorny J, Saltenis S, Bertino E, Bohm K and Jarke M, Eds., Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology (EDBT 2002), Prague, Czech Republic, March 25–27, 2002. Springer-Verlag, pp. 477–495.Google Scholar
  31. Trotman A (2004) Searching Structured Documents. Information Processing & Management, 40(4):619–632.Google Scholar
  32. Trotman A and O’Keefe R (2004) Identifying and ranking relevant document elements. In: Fuhr N, Lalmas M and Malik S, Eds., Proceedings of the 2nd Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 15–17, pp. 149–154.Google Scholar
  33. Vercoustre A-M, Thom JA, Krumpholz A, Mathieson I, Wilkins P, Wu M, Craswell N and Hawking D (2003) CSIRO INEX experiments: XML search using PADRE. In: Fuhr N, Goevert N, Kazai G and Lalmas M, Eds., Proceedings of the 1st Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Dagstuhl, Germany, Dec. 8–11, 2002. ERCIM, pp. 65–72.Google Scholar
  34. Wilkinson R (1994) Effective retrieval of structured documents. In: Croft WB and van Rijsbergen CJ, Eds., Proceedings of the 17th Annual International ACM SIGIR, Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 3–6, 1994. ACM Press/Springer-Verlag, pp. 311–317.Google Scholar
  35. Witten IH, Moffat A and Bell TC (1999, Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, 1999.Google Scholar
  36. Yu C, Jagadish HV and Radev DR (2003) Querying XML using structures and keywords in Timber. In: Clarke C, Cormack G, Callan J, Hawking D and Smeaton A, Eds., Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, July 28–Aug. 01, 2003. ACM Press, pp. 463–463.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • Jovan Pehcevski
    • 1
  • James A. Thom
    • 1
  • Anne-Marie Vercoustre
    • 2
  1. 1.RMIT UniversityMelbourneAustralia
  2. 2.INRIARocquencourtFrance

Personalised recommendations