Skip to main content

Estimating Answer Sizes for XML Queries

  • Conference paper
  • First Online:
Advances in Database Technology — EDBT 2002 (EDBT 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2287))

Included in the following conference series:

Abstract

Estimating the sizes of query results, and intermediate results, is crucial to many aspects of query processing. In particular, it is necessary for effective query optimization. Even at the user level, predictions of the total result size can be valuable in “next-step” decisions, such as query refinement. This paper proposes a technique to obtain query result size estimates effectively in an XML database.

Queries in XML frequently specify structural patterns, requiring specific relationships between selected elements. Whereas traditional techniques can estimate the number of nodes (XML elements) that will satisfy a node-specific predicate in the query pattern, such estimates cannot easily be combined to provide estimates for the entire query pattern, since element occurrences are expected to have high correlation.

We propose a solution based on a novel histogram encoding of element occurrence position. With such position histograms, we are able to obtain estimates of sizes for complex pattern queries, as well as for simpler intermediate patterns that may be evaluated in alternative query plans, by means of a position histogram join (pH-join) algorithm that we introduce. We extend our technique to exploit schema information regarding allowable structure (the no-overlap property) through the use of a coverage histogram.

We present an extensive experimental evaluation using several XML data sets, both real and synthetic, with a variety of queries. Our results demonstrate that accurate and robust estimates can be achieved, with limited space, and at a miniscule computational cost. These techniques have been implemented in the context of the TIMBER native XML database [22] at the University of Michigan.

H.V. Jagadish and Yuqing Wu were supported in part by NSF under grant IIS- 9986030 and DMI-0075447. Jignesh M. Patel was supported in part by a research gift donation from NCR Corporation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Aboulnaga, A.R. Alameldeen and J.F. Naughton. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. VLDB, 2001

    Google Scholar 

  2. T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible markup language (XML) 1.0. W3C Recommendation. Available at http://www.w3.org/TR/1998/REC-xml-19980210, Feb. 1998.

  3. A. Broder. On the Resemblance and Containment of Documents. IEEE SEQUENCES’ 97, pages 21–29, 1998.

    Google Scholar 

  4. D. Chamberlin, J. Clark, D. Florescu, J. Robie, J. Siméon and M. Stefanescu XQuery 1.0: An XML Query Language. W3C Working Draft, http://www.w3.org/TR/xquery/, June 7, 2001.

  5. Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R.T. Ng, D. Srivastava. Counting Twig Matches in a Tree. ICDE, 2001.

    Google Scholar 

  6. Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity estimation for boolean queries. In Proceedings of the ACM Symposium on Principles of Database Systems, 2000.

    Google Scholar 

  7. Yannis E. Ioannidis. Universality of Serial Histograms. In VLDB, pages 256–267, 1993.

    Google Scholar 

  8. Y.E. Ioannidis, V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD Conference, pages 233–244, 1995.

    Google Scholar 

  9. H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K.C. Sevcik, T. Suel. Optimal Histograms with Quality Guarantees. VLDB, pages 275–286, 1998.

    Google Scholar 

  10. H. V. Jagadish, O. Kapitskaia, R. T. Ng, and D. Srivastava. One-dimensional and multi-dimensional substring selectivity estimation. In VLDB Journal, 9(3), pp.214–230, 2000.

    Article  Google Scholar 

  11. H. V. Jagadish, L. V. S. Lakshmanan, T. Milo, D. Srivastava, and D. Vista. Querying network directories. In Proceedings of the ACM SIGMOD Conference on Management of Data, Philadelphia, PA, June 1999.

    Google Scholar 

  12. R. J. Lipton and J. F. Naughton. Query size estimation by adaptive sampling. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, March 1990.

    Google Scholar 

  13. J. McHugh and J. Widom. Query optimization for XML. In Proceedings of the International Conference on Very Large Databases, pages 315–326, 1999.

    Google Scholar 

  14. M. Muralikrishna and D.J. DeWitt. Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries. In SIGMOD Conference, pages 28–36, 1988.

    Google Scholar 

  15. A.R. Schmidt, F. Waas, M.L. Kersten, D. Florescu, I. Manolescu, M.J. Carey and R. Busse. The XML Benchmark Project. Technical Report INS-R0103, CWI, Amsterdam, The Netherlands, April 2001.

    Google Scholar 

  16. M. Wang, J. S. Vitter, and B. Iyer. Selectivity estimation in the presence of alphanumeric correlations. In Proceedings of the IEEE International Conference on Data Engineering, pages 169–180, 1997.

    Google Scholar 

  17. Yuqing Wu, Jignesh M. Patel, H.V. Jagadish. Histogram-based Result Size Estimation for XML Queries. University of Michigan Tech Report, 2002.

    Google Scholar 

  18. C. Zhang, J.F. Naughton, D.J. DeWitt, Q. Luo and G.M. Lohman. On Supporting Containment Queries in Relational Database Management Systems. SIGMOD, 2001

    Google Scholar 

  19. IBM. XML generator. Available at http://www.alphaworks.ibm.com/tech/xmlgenerator.

  20. TIMBER Group. TIMBER Project at Univ. of Michigan. Available at http://www.eecs.umich.edu/db/timber/.

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wu, Y., Patel, J.M., Jagadish, H.V. (2002). Estimating Answer Sizes for XML Queries. In: Jensen, C.S., et al. Advances in Database Technology — EDBT 2002. EDBT 2002. Lecture Notes in Computer Science, vol 2287. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45876-X_37

Download citation

  • DOI: https://doi.org/10.1007/3-540-45876-X_37

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43324-8

  • Online ISBN: 978-3-540-45876-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics