Abstract
Several web applications (such as processing RSS feeds or web service messages) rely on XPath-based data manipulation tools. Web developers need to use XPath queries effectively on increasingly larger web collections containing hundreds of thousands of XML documents. Even when tasks only need to deal with a single document at a time, developers benefit from understanding the behaviour of XPath expressions across multiple documents (e.g., what will a query return when run over the thousands of hourly feeds collected during the last few months?). Dealing with the (highly variable) structure of such web collections poses additional challenges.
This paper introduces DescribeX, a powerful framework that is capable of describing arbitrarily complex XML summaries of web collections, enabling the efficient evaluation of XPath workloads (supporting all the axes and language constructs in XPath). Experiments validate that DescribeX enables existing document-at-a-time XPath tools to scale up to multi-gigabyte XML collections.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Afanasiev, L., Franceschet, M., Marx, M.: XCheck: a platform for benchmarking XQuery engines. In: VLDB, pp. 1247–1250 (2006)
Afanasiev, L., Manolescu, I., Michiels, P.: MemBeR: A micro-benchmark repository for XQuery. In: XSym, pp. 144–161 (2005), http://ilps.science.uva.nl/Resources/MemBeR/
Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R., Viglas, S.: Vectorizing and querying large XML repositories. In: ICDE, pp. 261–272 (2005)
Buneman, P., Grohe, M., Koch, C.: Path queries on compressed XML. In: VLDB, pp. 141–152 (2003)
Chung, C.-W., Min, J.-K., Shim, K.: APEX: An adaptive path index for XML data. In: SIGMOD, pp. 121–132 (2002)
Consens, M.P., Liu, J.W., Rizzolo, F.: XPlainer: Visual explanations of XPath queries. In: ICDE (2007)
Consens, M.P., Milo, T.: Optimizing queries on files. In: SIGMOD, pp. 301–312 (1994)
Consens, M.P., Rizzolo, F., Vaisman, A.A.: Exploring the (semi-)structure of XML web collections. Technical report, University of Toronto - DCS (2007), http://www.cs.toronto.edu/~consens/describex/
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Dovier, A., Piazza, C., Policriti, A.: An efficient algorithm for computing bisimulation equivalence. Theoretical Computer Science 311(1-3), 221–256 (2004)
Goldman, R., Widom, J.: Dataguides: Enabling query formulation and optimization in semistructured databases. In: VLDB, pp. 436–445 (1997)
Kaushik, R., Bohannon, P., Naughton, J.F., Korth, H.F.: Covering indexes for branching path queries. In: SIGMOD, pp. 133–144 (2002)
Kaushik, R., Shenoy, P., Bohannon, P., Gudes, E.: Exploiting local similarity for indexing paths in graph-structured data. In: ICDE, pp. 129–140 (2002)
Martens, W., Neven, F., Schwentick, T.: Complexity of decision problems for simple regular expressions. In: Fiala, J., Koubek, V., Kratochvíl, J. (eds.) MFCS 2004. LNCS, vol. 3153, pp. 889–900. Springer, Heidelberg (2004)
Marx, M.: XPath with conditional axis relations. In: EDBT, pp. 477–494 (2004)
Mendelzon, A.O., Wood, P.T.: Finding regular simple paths in graph databases. SIAM Journal on Computing 24(6), 1235–1258 (1995)
Milo, T., Suciu, D.: Index structures for path expressions. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 277–295. Springer, Heidelberg (1998)
Nestorov, S., Ullman, J.D., Wiener, J.L., Chawathe, S.S.: Representative objects: Concise representations of semistructured, hierarchial data. In: ICDE, pp. 79–90 (1997)
Paige, R., Tarjan, R.E.: Three partition refinement algorithms. SIAM Journal on Computing 16(6), 973–989 (1987)
Polyzotis, N., Garofalakis, M.N.: XCLUSTER synopses for structured XML content. In: ICDE (2006)
Polyzotis, N., Garofalakis, M.N.: XSKETCH synopses for XML data graphs. ACM Transactions on Database Systems (TODS) 31(3), 1014–1063 (2006)
Polyzotis, N., Garofalakis, M.N., Ioannidis, Y.E.: Approximate XML query answers. In: SIGMOD, pp. 263–274 (2004)
Qun, C., Lim, A., Ong, K.W.: D(k)-index: An adaptive structural summary for graph-structured data. In: SIGMOD, pp. 134–144 (2003)
Rizzolo, F., Mendelzon, A.O.: Indexing XML data with ToXin. In: WebDB, pp. 49–54 (2001)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Consens, M.P., Rizzolo, F. (2007). Fast Answering of XPath Query Workloads on Web Collections. In: Barbosa, D., Bonifati, A., Bellahsène, Z., Hunt, E., Unland, R. (eds) Database and XMLTechnologies. XSym 2007. Lecture Notes in Computer Science, vol 4704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75288-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-75288-2_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75287-5
Online ISBN: 978-3-540-75288-2
eBook Packages: Computer ScienceComputer Science (R0)