Estimating Answer Sizes for XML Queries

  • Yuqing Wu
  • Jignesh M. Patel
  • H. V. Jagadish
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2287)


Estimating the sizes of query results, and intermediate results, is crucial to many aspects of query processing. In particular, it is necessary for effective query optimization. Even at the user level, predictions of the total result size can be valuable in “next-step” decisions, such as query refinement. This paper proposes a technique to obtain query result size estimates effectively in an XML database.

Queries in XML frequently specify structural patterns, requiring specific relationships between selected elements. Whereas traditional techniques can estimate the number of nodes (XML elements) that will satisfy a node-specific predicate in the query pattern, such estimates cannot easily be combined to provide estimates for the entire query pattern, since element occurrences are expected to have high correlation.

We propose a solution based on a novel histogram encoding of element occurrence position. With such position histograms, we are able to obtain estimates of sizes for complex pattern queries, as well as for simpler intermediate patterns that may be evaluated in alternative query plans, by means of a position histogram join (pH-join) algorithm that we introduce. We extend our technique to exploit schema information regarding allowable structure (the no-overlap property) through the use of a coverage histogram.

We present an extensive experimental evaluation using several XML data sets, both real and synthetic, with a variety of queries. Our results demonstrate that accurate and robust estimates can be achieved, with limited space, and at a miniscule computational cost. These techniques have been implemented in the context of the TIMBER native XML database [22] at the University of Michigan.


Grid Cell Query Pattern Result Size Faculty Node Descendant Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    A. Aboulnaga, A.R. Alameldeen and J.F. Naughton. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. VLDB, 2001Google Scholar
  2. 2.
    T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible markup language (XML) 1.0. W3C Recommendation. Available at, Feb. 1998.
  3. 3.
    A. Broder. On the Resemblance and Containment of Documents. IEEE SEQUENCES’ 97, pages 21–29, 1998.Google Scholar
  4. 4.
    D. Chamberlin, J. Clark, D. Florescu, J. Robie, J. Siméon and M. Stefanescu XQuery 1.0: An XML Query Language. W3C Working Draft,, June 7, 2001.
  5. 5.
    Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R.T. Ng, D. Srivastava. Counting Twig Matches in a Tree. ICDE, 2001.Google Scholar
  6. 6.
    Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity estimation for boolean queries. In Proceedings of the ACM Symposium on Principles of Database Systems, 2000.Google Scholar
  7. .
    Yannis E. Ioannidis. Universality of Serial Histograms. In VLDB, pages 256–267, 1993.Google Scholar
  8. 8.
    Y.E. Ioannidis, V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD Conference, pages 233–244, 1995.Google Scholar
  9. 9.
    H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K.C. Sevcik, T. Suel. Optimal Histograms with Quality Guarantees. VLDB, pages 275–286, 1998.Google Scholar
  10. 10.
    H. V. Jagadish, O. Kapitskaia, R. T. Ng, and D. Srivastava. One-dimensional and multi-dimensional substring selectivity estimation. In VLDB Journal, 9(3), pp.214–230, 2000.CrossRefGoogle Scholar
  11. 11.
    H. V. Jagadish, L. V. S. Lakshmanan, T. Milo, D. Srivastava, and D. Vista. Querying network directories. In Proceedings of the ACM SIGMOD Conference on Management of Data, Philadelphia, PA, June 1999.Google Scholar
  12. 12.
    R. J. Lipton and J. F. Naughton. Query size estimation by adaptive sampling. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, March 1990.Google Scholar
  13. 13.
    J. McHugh and J. Widom. Query optimization for XML. In Proceedings of the International Conference on Very Large Databases, pages 315–326, 1999.Google Scholar
  14. 14.
    M. Muralikrishna and D.J. DeWitt. Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries. In SIGMOD Conference, pages 28–36, 1988.Google Scholar
  15. 15.
    A.R. Schmidt, F. Waas, M.L. Kersten, D. Florescu, I. Manolescu, M.J. Carey and R. Busse. The XML Benchmark Project. Technical Report INS-R0103, CWI, Amsterdam, The Netherlands, April 2001.Google Scholar
  16. 16.
    M. Wang, J. S. Vitter, and B. Iyer. Selectivity estimation in the presence of alphanumeric correlations. In Proceedings of the IEEE International Conference on Data Engineering, pages 169–180, 1997.Google Scholar
  17. 17.
    Yuqing Wu, Jignesh M. Patel, H.V. Jagadish. Histogram-based Result Size Estimation for XML Queries. University of Michigan Tech Report, 2002.Google Scholar
  18. 18.
    C. Zhang, J.F. Naughton, D.J. DeWitt, Q. Luo and G.M. Lohman. On Supporting Containment Queries in Relational Database Management Systems. SIGMOD, 2001Google Scholar
  19. 21.
    IBM. XML generator. Available at
  20. 22.
    TIMBER Group. TIMBER Project at Univ. of Michigan. Available at

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Yuqing Wu
    • 1
  • Jignesh M. Patel
    • 1
  • H. V. Jagadish
    • 1
  1. 1.Univ. of MichiganAnn ArborUSA

Personalised recommendations