The VLDB Journal

, Volume 15, Issue 1, pp 53–83 | Cite as

Integrating document and data retrieval based on XML

Regular Paper

Abstract

For querying structured and semistructured data, data retrieval and document retrieval are two valuable and complementary techniques that have not yet been fully integrated. In this paper, we introduce integrated information retrieval (IIR), an XML-based retrieval approach that closes this gap. We introduce the syntax and semantics of an extension of the XQuery language called XQuery/IR. The extended language realizes IIR and thereby allows users to formulate new kinds of queries by nesting ranked document retrieval and precise data retrieval queries. Furthermore, we detail index structures and efficient query processing approaches for implementing XQuery/IR. Based on a new identification scheme for nodes in node-labeled tree structures, the extended index structures require only a fraction of the space of comparable index structures that only support data retrieval.

Keywords

Integrated information retrievals Data retrieval Document retrieval XML Index structures Structural join 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco (2000)Google Scholar
  2. 2.
    Abiteboul, S., Kaplan, H., Milo, T.: Compact labeling schemes for ancestor queries. In: Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 547–556(2001)Google Scholar
  3. 3.
    Al-Khalifa, S., Jagadish, H., Koudas, N., Patel, J.M., Srivastava, D., Wu, Y.: Structural joins: a primitive for efficient XML query pattern matching. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE), pp. 141–152 (2002)Google Scholar
  4. 4.
    Al-Khalifa, S., Yu, C., Jagadish, H.: Querying structured text in an XML database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 4–15 (2003)Google Scholar
  5. 5.
    Alonso, O.: Oracle text white paper. Technical Report, Oracle Corporation, Redwood Shores, CA (2001)Google Scholar
  6. 6.
    Amer-Yahi, S., Botev, C., Shanmugasundaram, J.: TeXQuery: a full-text search extension to XQuery. In: Proceedings of the 13th World Wide Web Conference, pp. 583–594 (2004)Google Scholar
  7. 7.
    Baeza-Yates, R.A., Navarro, G.: Integrating contents and structure in text retrieval. SIGMOD Rec. 25, 67–79 (1996)CrossRefGoogle Scholar
  8. 8.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading, MA (1999)Google Scholar
  9. 9.
    Berglund, A., Boag, S., Chamberlin, D., Fernández, M.F., Kay, M., Robie, J., Siméon, J.: XML Path Language (XPath) 2.0. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-xpath20-20031112
  10. 10.
    Boag, S., Chamberlin, D., Fernández, M.F., Florescu, D., Robie, J., Siméon, J.: XQuery 1.0: An XML Query Language. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-xquery-20031112/
  11. 11.
    Botev, C., Amer-Yahia, S., Shanmugasundaram, J.: On the completeness of full-text search languages for XML. Technical Report, Cornell University (December 2003)Google Scholar
  12. 12.
    Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., Yergeau, F., Cowan, J.: Extensible Markup Language (XML) 1.1. W3C recommendation, W3C (February 2004). www.w3.org/TR/2004/REC-xml11-20040204
  13. 13.
    Bremer, J.-M., Gertz, M.: Query processing and index structures for integrated XML document and data retrieval. Technical Report CSE-2002-22, Department of Computer Science, University of California at Davis (2002)Google Scholar
  14. 14.
    Bremer, J.-M., Gertz, M.: XQuery/IR: integrating XML document and data retrieval. In: Proceedings of the 4th International Workshop on the Web and Databases (WebDB), pp. 1–6(2002)Google Scholar
  15. 15.
    Bremer, J.-M., Gertz, M.: An efficient XML node identification and indexing scheme. Technical Report CSE-2003-04, Department of Computer Science, University of California at Davis (2003)Google Scholar
  16. 16.
    Brin, S., Page, L.: The anatomy of a large scale hypertextual Web search engine. In: Proceedings of the 7th World Wide Web Conference, Elsevier, Amsterdam, pp. 107–117 (1998)Google Scholar
  17. 17.
    Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 310–311 (2002)Google Scholar
  18. 18.
    Buxton, S., Rys, M.: XQuery and XPath full-text requirements. W3C working draft, W3C (May 2003). www.w3.org/TR/2003/WD-xquery-full-text-requirements-20030502/
  19. 19.
    Callan, J., Croft, W.B., Broglio, J.: TREC and Tipster experiments with InQuery. Inf. Process. Manage. 31, 327–332, 343 (1995)CrossRefGoogle Scholar
  20. 20.
    Chamberlin, D., Frankhauser, P., Florescu, D., Marchiori, M., Robie, J.: XML query use cases. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-xmlquery-use-cases-20031112/
  21. 21.
    Chien, S.Y., Vagena, Z., Zhang, D., Tsotras, V.J., Zaniolo, C.: Efficient structural joins on indexed XML documents. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 263–274 (2002)Google Scholar
  22. 22.
    Chinenyanga, T.T., Kushmerick, N.: An expressive and efficient language for XML information retrieval. In: Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 163–171 (2001)Google Scholar
  23. 23.
    Chung, C.W., Min, J.K., Shim, K.: Apex: an adaptive path index. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 121–132 (2002)Google Scholar
  24. 24.
    Cohen, E., Kaplan, H., Milo, T.: Labeling dynamic XML trees. In: Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 271–281 (2002)Google Scholar
  25. 25.
    Cooper, B.F., Samle, N., Franklin, M.J., Hjaltson, G.R., Shadmon, M.: A fast index for semistructured data. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), pp. 341–250 (2001)Google Scholar
  26. 26.
    Cowan, J., Tobin, R.: XML Information Set, 2nd edn. W3C recommendation, W3C (February 2004). www.w3c.org/TR/2004/REC-xml-infoset-20040204
  27. 27.
    Croft, W.B.: “What do people want from information retrieval?”. D-Lib. Mag. (1995)Google Scholar
  28. 28.
    DeHaan, D., Toman, D., Consens, M.P., Özsu, M.T.: A comprehensive XQuery to SQL translation using dynamic interval encoding. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 623–634 (2003)Google Scholar
  29. 29.
    Dessloch, S., Mattos, N.M.: Integrating SQL databases with content-specific search engines. In: Proceedings of the 23rd International Conference on Very Large Databases (VLDB), pp. 528–537 (1997)Google Scholar
  30. 30.
    Eickler, A., Gerlhof, C.A., Kossmann, D.: A performance evaluation of OID mapping techniques. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB), pp. 18–29 (1995)Google Scholar
  31. 31.
    Fernández, M., Marsh, J., Malhotra, A., Nagy, M., Walsh, N.: XQuery 1.0 and XPath 2.0 data model. W3C working draft, W3C (November 2003). www.w3.org/TR/2003/WD-path-datamodel-20031112
  32. 32.
    Fiebig, T., Helmer, S., Kanne, K.C., Moerkotte, G., Neumann, J., Schiele, R.: Anatomy of a native XML base management system. VLDB J. 11, 292–314 (2002)CrossRefGoogle Scholar
  33. 33.
    Florescu, D., Kossmann, D., Manolescu, I.: Integrating keyword search into XML query processing. In: Proceedings of the 9th International Word Wide Web Conference/Computer Networks. 33(1–6), 119–135 (2000)Google Scholar
  34. 34.
    Fuhr, N., Gövert, N., Kazai, G., Lalmas, M.: INEX: initiative for the evaluation of XML retrieval. In: Proceedings of the ACM SIGIR 2002 Workshop on XML and Information Retrieval(2002)Google Scholar
  35. 35.
    Fuhr, N., Grossjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 172–180(2001)Google Scholar
  36. 36.
    Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In: Proceedings of the 23rd International Conference on Very Large Databases (VLDB), pp. 436–445 (1997)Google Scholar
  37. 37.
    Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 95–106 (2002)Google Scholar
  38. 38.
    Grabs, T., Schek, H.J.: Generating vector spaces on-the-fly for flexible XML retrieval. In: Proceedings of the ACM SIGIR 2002 Workshop on XML and Information Retrieval (2002)Google Scholar
  39. 39.
    Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25, 73–169 (1993)CrossRefGoogle Scholar
  40. 40.
    Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRank: ranked keyword search over XML documents. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 16–27 (2003)Google Scholar
  41. 41.
    Holmes, N.: The great term robbery. Computer 34, 94–96 (2001)CrossRefGoogle Scholar
  42. 42.
    Jacobsen, G., Krishnamurthy, B., Srivastava, D., Suciu, D.: Focusing search in hierarchical structure with directory sets. In: Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM), pp. 1–9 (1998)Google Scholar
  43. 43.
    Jagadish, H., Lakshmanan, L.V., Milo, T., Srivastava, D., Vista, D.: Querying network directories. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 133–144 (1999)Google Scholar
  44. 44.
    Kaplan, H., Milo, T., Shabo, R.: A comparison of labeling schemes for ancestor queries. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 954–963 (2002)Google Scholar
  45. 45.
    Kaszkiel, M., Zobel, J., Sacks-Davis, R.: Efficient passage ranking for document databases. ACM Trans. Inf. Syst. 17, 406–439 (1999)CrossRefGoogle Scholar
  46. 46.
    Li, Q., Moon, B.: Indexing and querying XML data for regular path expressions. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), pp. 361–370(2001)Google Scholar
  47. 47.
    Maier, A., Novak, H.J.: DB2's full-text search products –white paper. Technical Report, International BusinessMachines Corporation (2001). www-900.ibm.com/cn/software/db2/products/download/whitepaper/whitense.pdf
  48. 48.
    McHugh, J., Widom, J., Abiteboul, S., Luo, Q., Rajaraman, A.: Indexing semistructured data. Technical Report, Stanford University, Stanford, CA (1998)Google Scholar
  49. 49.
    Milo, T., Suciu, D.: Index structures for path expressions. In: Proceedings of the 7th International Conference on Database Theory (ICDT99). Lecture Notes in Computer Science, vol. 1540, pp. 277–295. Springer, Berlin Heidelberg New York (1999)Google Scholar
  50. 50.
    Myaeng, S.H., Jang, D.H., Kim, M.S., Zhoo, Z.C.: A flexible model for retrieval of SGML documents. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 138–145 (1998)Google Scholar
  51. 51.
    Navarro, G., Baeza-Yates, R.: Proximal nodes: a model to query document databases by content and structure. ACM Trans. Inf. Syst. 15, 401–435 (1997)CrossRefGoogle Scholar
  52. 52.
    Papakonstantinou, Y., Garcia-Molina, H., Widom, J.: Object exchange across heterogeneous information sources. In: Proceedings of the 11th International Conference on Data Engineering (ICDE), pp. 251–260 (1995)Google Scholar
  53. 53.
    Peleg, D.: Informative labeling schemes for graphs. In: Proceedings of the 25th International Symposium on Mathematical Foundations of Computer Science. Lecture Notes in Computer Science, vol. 1893, Springer, Berlin Heidelberg New York (2000)Google Scholar
  54. 54.
    Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19, release data 2000-11-03 Format version 1, correction level 0 (2000). http://about.reuters.com/researchandstandards/corpus/
  55. 55.
    van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)Google Scholar
  56. 56.
    Rizzolo, F., Mendelzon, A.: Indexing XML data with Toxin. In: Proceedings of the 3rd International Workshop on the Web and Databases (WebDB), pp. 49–54 (2001)Google Scholar
  57. 57.
    Sacks-Davis, R., Dao, T., Thom, J.A., Zobel, J.: Indexing documents for queries on structure, content and attributes. In: Proceedings of the International Symposium on Digital Media Information Base, pp. 236–245 (1997)Google Scholar
  58. 58.
    Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pp. 49–58 (1993)Google Scholar
  59. 59.
    Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)Google Scholar
  60. 60.
    Santoro, N., Khatib, R.: Labeling and implicit routing in networks. The Computer Journal 28, 5–8 (1985)MathSciNetCrossRefGoogle Scholar
  61. 61.
    Schlieder, T., Meuss, H.: Querying and ranking XML documents. J. Am. Soc. Inf. Sci. Technol. 53(6), 489–503 (2002)CrossRefGoogle Scholar
  62. 62.
    Schmidt, A.R., Waas, F., Kersten, M.L., Manolescu, I., Carey, M.J., Manolescu, I., Busse, R.: XMark: a benchmark for XML data management. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 974–985 (2002)Google Scholar
  63. 63.
    Shekita, E.J., Carey, M.J.: A performance evaluation of pointer-based joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 300–311 (1990)Google Scholar
  64. 64.
    Shin, D., Jang, H., Jin, H.: BUS: an effective indexing and retrieval scheme in structured documents. In: Proceedings of the 3rd ACM International Conference on Digital Libraries, pp. 235–243 (1998)Google Scholar
  65. 65.
    Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and querying ordered XML using a relational database system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 204–215 (2002)Google Scholar
  66. 66.
    Theobald, A., Weikum, G.: Adding relevance to XML. In: Proceedings of the 3rd International Workshop on the Web and Databases (WebDB). Lecture Notes in Computer Science, vol. 1997, pp. 105–124. Springer, Berlin Heidelberg New York (2001)Google Scholar
  67. 67.
    Tolani, P.M., Haritsa, J.R.: XGRIND: a query-friendly XML compressor. In: Proceedings of the 18th International Conference on Data Engineering (ICDE), pp. 225–234 (2002)Google Scholar
  68. 68.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes—Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Mateo, CA (1999)Google Scholar
  69. 69.
    Yan, T.W., Annevelink, J.: Integrating a structured-text retrieval system with an object-oriented database system. In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), pp. 740–749 (1994)Google Scholar
  70. 70.
    Yoshikawa, M., Amagasa, T., Shimura, T., Shunsuke, U.: XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Technol. 1, 110–141 (2001)CrossRefGoogle Scholar
  71. 71.
    Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 425–436 (2001)Google Scholar

Copyright information

© Springer-Verlag 2006

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of California at DavisDavisUSA

Personalised recommendations