Estimating Answer Sizes for XML Queries
Estimating the sizes of query results, and intermediate results, is crucial to many aspects of query processing. In particular, it is necessary for effective query optimization. Even at the user level, predictions of the total result size can be valuable in “next-step” decisions, such as query refinement. This paper proposes a technique to obtain query result size estimates effectively in an XML database.
Queries in XML frequently specify structural patterns, requiring specific relationships between selected elements. Whereas traditional techniques can estimate the number of nodes (XML elements) that will satisfy a node-specific predicate in the query pattern, such estimates cannot easily be combined to provide estimates for the entire query pattern, since element occurrences are expected to have high correlation.
We propose a solution based on a novel histogram encoding of element occurrence position. With such position histograms, we are able to obtain estimates of sizes for complex pattern queries, as well as for simpler intermediate patterns that may be evaluated in alternative query plans, by means of a position histogram join (pH-join) algorithm that we introduce. We extend our technique to exploit schema information regarding allowable structure (the no-overlap property) through the use of a coverage histogram.
We present an extensive experimental evaluation using several XML data sets, both real and synthetic, with a variety of queries. Our results demonstrate that accurate and robust estimates can be achieved, with limited space, and at a miniscule computational cost. These techniques have been implemented in the context of the TIMBER native XML database  at the University of Michigan.
Unable to display preview. Download preview PDF.
- 1.A. Aboulnaga, A.R. Alameldeen and J.F. Naughton. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. VLDB, 2001Google Scholar
- 2.T. Bray, J. Paoli, and C. M. Sperberg-McQueen. Extensible markup language (XML) 1.0. W3C Recommendation. Available at http://www.w3.org/TR/1998/REC-xml-19980210, Feb. 1998.
- 3.A. Broder. On the Resemblance and Containment of Documents. IEEE SEQUENCES’ 97, pages 21–29, 1998.Google Scholar
- 4.D. Chamberlin, J. Clark, D. Florescu, J. Robie, J. Siméon and M. Stefanescu XQuery 1.0: An XML Query Language. W3C Working Draft, http://www.w3.org/TR/xquery/, June 7, 2001.
- 5.Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R.T. Ng, D. Srivastava. Counting Twig Matches in a Tree. ICDE, 2001.Google Scholar
- 6.Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity estimation for boolean queries. In Proceedings of the ACM Symposium on Principles of Database Systems, 2000.Google Scholar
- .Yannis E. Ioannidis. Universality of Serial Histograms. In VLDB, pages 256–267, 1993.Google Scholar
- 8.Y.E. Ioannidis, V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD Conference, pages 233–244, 1995.Google Scholar
- 9.H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K.C. Sevcik, T. Suel. Optimal Histograms with Quality Guarantees. VLDB, pages 275–286, 1998.Google Scholar
- 11.H. V. Jagadish, L. V. S. Lakshmanan, T. Milo, D. Srivastava, and D. Vista. Querying network directories. In Proceedings of the ACM SIGMOD Conference on Management of Data, Philadelphia, PA, June 1999.Google Scholar
- 12.R. J. Lipton and J. F. Naughton. Query size estimation by adaptive sampling. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, March 1990.Google Scholar
- 13.J. McHugh and J. Widom. Query optimization for XML. In Proceedings of the International Conference on Very Large Databases, pages 315–326, 1999.Google Scholar
- 14.M. Muralikrishna and D.J. DeWitt. Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries. In SIGMOD Conference, pages 28–36, 1988.Google Scholar
- 15.A.R. Schmidt, F. Waas, M.L. Kersten, D. Florescu, I. Manolescu, M.J. Carey and R. Busse. The XML Benchmark Project. Technical Report INS-R0103, CWI, Amsterdam, The Netherlands, April 2001.Google Scholar
- 16.M. Wang, J. S. Vitter, and B. Iyer. Selectivity estimation in the presence of alphanumeric correlations. In Proceedings of the IEEE International Conference on Data Engineering, pages 169–180, 1997.Google Scholar
- 17.Yuqing Wu, Jignesh M. Patel, H.V. Jagadish. Histogram-based Result Size Estimation for XML Queries. University of Michigan Tech Report, 2002.Google Scholar
- 18.C. Zhang, J.F. Naughton, D.J. DeWitt, Q. Luo and G.M. Lohman. On Supporting Containment Queries in Relational Database Management Systems. SIGMOD, 2001Google Scholar
- 21.IBM. XML generator. Available at http://www.alphaworks.ibm.com/tech/xmlgenerator.
- 22.TIMBER Group. TIMBER Project at Univ. of Michigan. Available at http://www.eecs.umich.edu/db/timber/.