Efficient Evaluation of Partial Match Queries for XML Documents Using Information Retrieval Techniques

  • Young-Ho Park
  • Kyu-Young Whang
  • Byung Suk Lee
  • Wook-Shin Han
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3453)

Abstract

We propose XIR, a novel method for processing partial match queries on heterogeneous XML documents using information retrieval (IR) techniques. A partial match query is defined as the one having the descendent-or-self axis “//” in its path expression. In its general form, a partial match query has branch predicates forming branching paths. The objective of XIR is to efficiently support this type of queries for large-scale documents of heterogeneous schemas. XIR has its basis on the conventional schema-level methods using relational tables and significantly improves their efficiency using two techniques: an inverted index technique and a novel prefix match join. The former indexes the labels in label paths as keywords in texts, and allows for finding the label paths matching the queries more efficiently than string match used in the conventional methods. The latter supports branching path expressions, and allows for finding the result nodes more efficiently than containment joins used in the conventional methods. We compare the efficiency of XIR with those of XRel and XParent using XML documents crawled from the Internet. The results show that XIR is more efficient than both XRel and XParent by several orders of magnitude for linear path expressions, and by several factors for branching path expressions.

Keywords

Query Processing Inverted Index Query Pattern Path Expression Very Large Data Base 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aboulnaga, A., Alameldeen, A.R., Naughton, J.: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. In: Proc. the 27th Int’l Conf. on Very Large Data Bases(VLDB), Rome, Italy, September 11-14 , pp. 591–600 (2001)Google Scholar
  2. 2.
    Amer-Yahia, S., Cho, S., Lakshmanan, L.V.S., Srivastava, D.: Minimization of Tree Pattern Queries. In: Proc. 2001 ACM SIGMOD Int’l Conf. on Management of Data, Santa Barbara, California, May 21-24 , pp. 497–508.(2001)Google Scholar
  3. 3.
    Altinel, M., Franklin, M.J.: Efficient Filtering of XML Documents for Selective Dissemination of Information. In: Proc. the 26th Int’l Conf. on Very Large Data Bases(VLDB), Cairo, Egypt, September 10-14, pp. 53–64 (2000)Google Scholar
  4. 4.
    Al-Khalifa, S., Jagadish, H.V., Koudas, N., Patel, J.M.: Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In: Proc. the 18th Int’l Conf. on Data Engineering (ICDE), San Jose, California, February 26 - March 1, pp. 141–152 (2002)Google Scholar
  5. 5.
    Bremer, J.-M., Gertz, M.: XQuery/IR: Integrating XML Document and Data Retrieval. In: Proc. the Fifth Int’l Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, pp. 1–6 (2002)Google Scholar
  6. 6.
    Bruno, N., Koudas, N., Srivastava, D.: Holistic Twig Joins: Optimal XML Pattern Matching. In: Proc. 2002 ACM SIGMOD Int’l Conf. on Management of Data, Madison, Wisconsin, June 3-6, pp. 310–321 (2002)Google Scholar
  7. 7.
    Clark, J., DeRose, S.: XML Path Language (XPath), W3C Recommendation (November 1999), http://www.w3.org/TR/xpath
  8. 8.
    Cooper, B.F., Sample, N., Franklin, M.J., Hjaltason, G.R., Shadmon, M.: A Fast Index for Semistructured Data. In: Proc. the 27th Int’l Conf. on Very Large Data Bases(VLDB), Rome, Italy, September 11-14, pp. 341–350 (2001)Google Scholar
  9. 9.
    Florescu, D., Kossmann, D., Manolescu, I.: Integrating Keyword Search into XML Query Processing. In: Proc. the 9th WWW Conference/Computer Networks, Amsterdam, NL (May 2000), pp. 119–135 (2000)Google Scholar
  10. 10.
    Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked Keyword Search over XML Documents. In: Proc. 2003 ACM SIGMOD Int’l Conf. on Management of Data, San Diego, California, June 9-12, pp. 16–27 (2003)Google Scholar
  11. 11.
    Goldman, R., Widom, J.: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In: Proc. the 23th Int’l Conf. on Very Large Data Bases(VLDB), Athens, Greece, August 26-29, pp. 436–445 (1997)Google Scholar
  12. 12.
    Halverson, A., Burger, J., Galanis, L., Kini, A., Krishnamurthy, R., Rao, A.N., Tian, F., Viglas, S., Wang, Y., Naughton, J.F., DeWitt, D.J.: Mixed Mode XML Query Processing. In: Proc. the 29th Int’l Conf. on Very Large Data Bases(VLDB), Berlin, Germany, September 9-12, pp. 225–236 (2003)Google Scholar
  13. 13.
    Ives, Z., Levy, A., Weld, D.: Efficient Evaluation of Regular Path Expressions on Streaming XML Data, Technical Report UW-CSE-2000-05-02, University of Washington (2000)Google Scholar
  14. 14.
    Jiang, H., Lu, H., Wang, W., Xu Yu, J.: Path Materialization Revisited: An Efficient Storage Model for XML Data. In: Proc. the 13th Australasian Database Conference (ADC), Melbourne, Australia, January 28 - February 1, pp. 85–94 (2002)Google Scholar
  15. 15.
    Jiang, H., Lu, H., Wang, W., Xu Yu, J.: XParent: An Efficient RDBMS-Based XML Database System. In: Proc. the 18th Int’l Conf. on Data Engineering (ICDE), San Jose, California, February 26 - March 1, pp. 335–336 (2002)Google Scholar
  16. 16.
    Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic Twig Joins on Indexed XML Documents. In: Proc. the 29th Int’l Conf. on Very Large Data Bases(VLDB), Berlin, Germany, September 9-12, pp. 273–284 (2003)Google Scholar
  17. 17.
    Li, Q., Moon, B.: Indexing and Querying XML Data for Regular Path Expressions. In: Proc. the 27th Int’l Conf. on Very Large Data Bases(VLDB), Rome, Italy, September 11-14, pp. 361–370 (2001)Google Scholar
  18. 18.
    Mandreoli, F., Martoglia, R., Tiberio, P.: Searching Similar (Sub)Sentences for Example-Based Machine Translation. In: Proc. 2002 Italian Symposium on Sistemi Evoluti per Basi di Dati (SEBD 2002), Isola d’Elba, Italy, (June 2002)Google Scholar
  19. 19.
    McHugh, J., Widom, J.: Query Optimization for XML. In: Proc. the 25th Int’l Conf. on Very Large Data Bases (VLDB), Edinburgh, Scotland, UK, September 7-10, pp. 315–326 (1999)Google Scholar
  20. 20.
    Naughton, J., et al.: The Niagara Internet Query System. IEEE Data Engineering Bulletin 24(2), 27–33 (2001)Google Scholar
  21. 21.
    Polyzotis, N., Garofalakis, M.: Statistical Synopses for Graph-structured XML Databases. In: Proc. 2002 ACM SIGMOD Int’l Conf. on Management of Data, Madison, Wisconsin, June 3-6, pp. 358–369 (2002)Google Scholar
  22. 22.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)MATHGoogle Scholar
  23. 23.
    Tatarinov, I., Viglas, S., Beyer, K.S., Shanmugasundaram, J., Shekita, E.J., Zhang, C.: Storing and Querying Ordered XML Using a Relational Database System. In: Proc. 2002 ACM SIGMOD Int’l Conf. on Management of Data, Madison, Wisconsin, June 3-6, pp. 204–215 (2002)Google Scholar
  24. 24.
    Yoshikawa, M., Amagasa, T., Shimura, T., Uemura, S.: XRel: A Path-based Approach to Storage and Retrieval of XML Documents using Relational Databases. ACM Trans. on Internet Technology(TOIT) 1(1), 110–141 (2001)CrossRefGoogle Scholar
  25. 25.
    Park, Y., Whang, K.-Y., Lee, B., Han, W.: Efficient Evaluation of Partial Match Queries for XML Documents Using Information Retrieval Techniques, Technical Report CS-TR-2004-212, Department of Computer Science, KAIST (December 2004), Also, available on AITrc Technical Report No. 04-11-048, http://aitrc.kaist.ac.kr/util_tr.htm (December 28, 2004)
  26. 26.
    Zhang, C., Naughton, J.F., DeWitt, D.J., Luo, Q., Lohmann, G.M.: On Supporting Containment Queries in Relational Database Management Systems. In: Proc. 2001 ACM SIGMOD Int’l Conf. on Management of Data, Santa Barbara, California, May 21-24, pp. 425–436 (2001)Google Scholar
  27. 27.
    ReGet Deluxe 3.3 Beta (build 173), http://deluxe.reget.com/en/
  28. 28.
  29. 29.
  30. 30.
    eXtensible Markup Language(XML), http://www.w3.org/XML/

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Young-Ho Park
    • 1
  • Kyu-Young Whang
    • 1
  • Byung Suk Lee
    • 2
  • Wook-Shin Han
    • 3
  1. 1.Department of Computer Science and Advanced Information Technology Research Center (AITrc)Korea Advanced Institute of Science and Technology (KAIST)Korea
  2. 2.Department of Computer ScienceUniversity of VermontBurlingtonUSA
  3. 3.Department of Computer EngineeringKyungpook National UniversityKorea

Personalised recommendations