Abstract
Distributing data collections by fragmenting them is an effective way of improving the scalability of a database system. While the distribution of relational data is well understood, the unique characteristics of the XML data and query model present challenges that require different distribution techniques. In this paper, we show how XML data can be fragmented horizontally and vertically. Based on this, we propose solutions to two of the problems encountered in distributed query processing and optimization on XML data, namely localization and pruning. Localization takes a fragmentation-unaware query plan and converts it to a distributed query plan that can be executed at the sites that hold XML data fragments in a distributed system. We then show how the resulting distributed query plan can be pruned so that only those sites are accessed that can contribute to the query result. We demonstrate that our techniques can be integrated into a real-life XML database system and that they significantly improve the performance of distributed query execution.
Similar content being viewed by others
References
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems (3rd edn.). Springer, Berlin (2011)
Abiteboul, S., Gottlob, G., Manna, M.: Distributed XML design. In: Proc. of PODS, pp. 247–257 (2009)
Deutsch, A., Tannen, V.: MaRS: A system for publishing XML from mixed and redundant storage. In: Proc. of VLDB, pp. 201–212 (2003)
Abiteboul, S., Benjelloun, O., Cautis, B., Manolescu, I., Milo, T., Preda, N.: Lazy query evaluation for Active XML. In: Proc. of ACM SIGMOD, pp. 227–238 (2004)
Bremer, J.-M., Gertz, M.: On distributing XML repositories. In: Proc. of WebDB, pp. 73–78 (2003)
Cong, G., Fan, W., Kementsietsidis, A.: Distributed query evaluation with performance guarantees. In: Proc. of ACM SIGMOD, pp. 509–520 (2007)
Buneman, P., Cong, G., Fan, W., Kementsietsidis, A.: Using partial evaluation in distributed query evaluation. In: Proc. of VLDB, pp. 211–222 (2006)
Suciu, D.: Distributed query evaluation on semistructured data. ACM Trans. Database Syst. 27(1), 1–62 (2002)
Kling, P., Özsu, M.T., Daudjee, K.: Generating efficient execution plans for vertically partitioned XML databases. In: Proc. of VLDB Endow., pp. 1–11 (2010)
Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D.J., Naughton, J.F.: Relational databases for querying XML documents: Limitations and opportunities. In: Proc. of ICDE, pp. 302–314 (1999)
Miklau, G., Suciu, D.: Containment and equivalence for a fragment of XPath. J. ACM 51(1), 2–45 (2004)
Ives, Z.G., Halevy, A.Y., Weld, D.S.: An XML query engine for network-bound data. VLDB J. 11(4), 380–402 (2002)
Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: Proc. of ACM SIGMOD, pp. 310–321 (2002)
Zhang, N., Kacholia, V., Özsu, M.T.: A succinct physical storage scheme for efficient evaluation of path queries in XML. In: Proc. of ICDE, pp. 54–65 (2004)
Buswell, S., Devitt, S., Diaz, A., Ion, P., Miner, R., Poppelier, N., Smith, B., Soiffer, N., Sutor, R., Watt, S.: Mathematical Markup Language (MathML) 1.01 Specification (1999). http://www.w3.org/TR/REC-MathML/
Murray-Rust, P.: Chemical markup language. World Wide Web J. 2(4), 135–147 (1997)
Fernàndez, M., Malhotra, A., Marsh, J., Nagy, M., Walsh, N.: XQuery 1.0 and XPath 2.0 data model (XDM) (2007). http://www.w3.org/TR/xpath-datamodel/
Brantner, M., Helmer, S., Kanne, C.-C., Moerkotte, G.: Full-fledged algebraic XPath processing in Natix. In: Proc. of ICDE, pp. 705–716 (2005)
Al-Khalifa, S., Jagadish, H., Koudas, N., Patel, J., Srivastava, D., Wu, Y.: Structural joins: A primitive for efficient XML query pattern matching. In: Proc. of ICDE, pp. 141–152 (2002)
Kling, P., Özsu, M.T., Daudjee, K.: Distributed XML query processing: Fragmentation, localization and pruning. University of Waterloo, Tech. Rep. CS-2010-02 (2010)
Tarjan, R.: Depth-first search and linear graph algorithms. SIAM Journal on Computing, 114–121 (1972)
Dewey, M.: A classification and subject index for cataloguing and arranging the books and pamphlets of a library (1876)
Zhang, N., Haas, P.J., Josifovski, V., Lohman, G.M., Zhang, C.: Statistical learning techniques for costing XML queries. In: Proc. of VLDB, pp. 289–300 (2005)
Aboulnaga, A., Alameldeen, A.R., Naughton, J.F.: Estimating the selectivity of XML path expressions for internet scale applications. In: Proc. of VLDB, pp. 591–600 (2001)
Franceschet, M.: XPathMark: An XPath benchmark for XMark generated data. In: Proc. of XSym, pp. 129–143 (2005)
Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: XMark: a benchmark for XML data management. In: Proc. of VLDB, pp. 974–985 (2002)
Andrade, A., Ruberg, G., Baião, F.A., Braganholo, V.P., Mattoso, M.: Efficiently processing XML queries over fragmented repositories with PartiX. In: Proc. of EDBT, pp. 150–163 (2006)
Abiteboul, S., Benjelloun, O., Milo, T.: The active XML project: an overview. VLDB J. 17(5), 1019–1040 (2008)
Abiteboul, S., Benjellourn, O., Manolescu, I., Milo, T., Weber, R.: Active XML: Peer-to-peer data and web services integration. In: Proc. of VLDB (2002)
Abiteboul, S., Bonifati, A., Cobéna, G., Manolescu, I., Milo, T.: Dynamic XML documents with distribution and replication. In: Proc. of ACM SIGMOD, pp. 527–538 (2003)
Ma, H., Schewe, K.-D.: Fragmentation of XML documents. In: Proc. of SBBD, pp. 200–214 (2003)
Ma, H., Schewe, K.-D.: Heuristic horizontal XML fragmentation. In: Proc. of CAiSE, pp. 131–136 (2005)
Kido, K., Amagasa, T., Kitagawa, H.: Processing XPath queries in PC-clusters using XML data partitioning. In: Special Workshop on Databases for Next-Generation Researchers, ICDE, p. 114 (2006)
Marian, A., Siméon, J.: Projecting XML documents. In: Proc. of VLDB, pp. 213–224 (2003)
Bose, S., Fegaras, L.: XFrag: A query processing framework for fragmented XML data. In: Proc. of WebDB, pp. 97–102 (2005)
Chan, C.-Y., Ni, Y.: Content-based dissemination of fragmented XML data. In: Proc. of ICDCS, p. 44 (2006)
Kanne, C.-C., Brantner, M., Moerkotte, G.: Cost-sensitive reordering of navigational primitives. In: Proc. of ACM SIGMOD, pp. 742–753 (2005)
Zhang, Y., Boncz, P.: XRPC: interoperable and efficient distributed XQuery. In: Proc. of VLDB, pp. 99–110 (2007)
Zhang, Y., Boncz, P.: XRPC: distributed xquery and update processing with heterogeneous xquery engines. In: Proc. of ACM SIGMOD. ACM, New York (2008), pp. 1331–1336
Re, C., Brinkley, J., Hinshaw, K., Suciu, D.: Distributed XQuery. In: Workshop on Information Integration on the Web, pp. 116–121 (2004)
Fernàndez, M.F., Jim, T., Morton, K., Onose, N., Siméon, J.: Highly distributed XQuery with DXQ. In: Proc. of ACM SIGMOD, pp. 1159–1161 (2007)
Andrade, A., Ruberga, G., Baião, F.A., Braganholo, V.P., Mattoso, M.: Partix: processing XQuery queries over fragmented XML repositories. Universidade Federal do Rio de Janeiro, Tech. Rep. (2005)
Hammerschmidt, B.C., Kempa, M., Linnemann, V.: On the intersection of XPath expressions. In: International Database Engineering and Applications Symposium, pp. 49–57 (2005)
Zhang, Y., Tang, N., Boncz, P.: Efficient distribution of full-fledged XQuery. In: Proc. of ICDE, pp. 565–576 (2009)
Le, T.T.T., Doan, D.D., Bhavsar, V.C., Boley, H.: A bottom-up algorithm for query decomposition. Int. J. Innov. Comput. Appl. 1(3), 185–193 (2008)
Tajima, K., Fukui, Y.: Answering xpath queries over networks by sending minimal views. In: Proc. of VLDB, pp. 48–59 (2004)
Koloniari, G., Pitoura, E.: Distributed structural relaxation of XPath queries. In: Proc. of ICDE, pp. 529–540 (2009)
Haustein, M.P., Härder, T., Mathis, C., Wagner, M.: DeweyIDs – the key to fine-grained management of XML documents. In: Proc. of Brazilian Symposium on Databases, pp. 85–99 (2005)
Lalmas, M.: XML retrieval. Synth. Lect. Inf. Concept. Retr. Services 1(1), 1–111 (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ahmed K. Elmagarmid.
Rights and permissions
About this article
Cite this article
Kling, P., Özsu, M.T. & Daudjee, K. Scaling XML query processing: distribution, localization and pruning. Distrib Parallel Databases 29, 445–490 (2011). https://doi.org/10.1007/s10619-011-7085-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-011-7085-8