Fast Answering of XPath Query Workloads on Web Collections

Consens, Mariano P.; Rizzolo, Flavio

doi:10.1007/978-3-540-75288-2_4

Fast Answering of XPath Query Workloads on Web Collections

Mariano P. Consens¹ &
Flavio Rizzolo¹

Conference paper

293 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4704))

Abstract

Several web applications (such as processing RSS feeds or web service messages) rely on XPath-based data manipulation tools. Web developers need to use XPath queries effectively on increasingly larger web collections containing hundreds of thousands of XML documents. Even when tasks only need to deal with a single document at a time, developers benefit from understanding the behaviour of XPath expressions across multiple documents (e.g., what will a query return when run over the thousands of hourly feeds collected during the last few months?). Dealing with the (highly variable) structure of such web collections poses additional challenges.

This paper introduces DescribeX, a powerful framework that is capable of describing arbitrarily complex XML summaries of web collections, enabling the efficient evaluation of XPath workloads (supporting all the axes and language constructs in XPath). Experiments validate that DescribeX enables existing document-at-a-time XPath tools to scale up to multi-gigabyte XML collections.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Afanasiev, L., Franceschet, M., Marx, M.: XCheck: a platform for benchmarking XQuery engines. In: VLDB, pp. 1247–1250 (2006)
Google Scholar
Afanasiev, L., Manolescu, I., Michiels, P.: MemBeR: A micro-benchmark repository for XQuery. In: XSym, pp. 144–161 (2005), http://ilps.science.uva.nl/Resources/MemBeR/
Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R., Viglas, S.: Vectorizing and querying large XML repositories. In: ICDE, pp. 261–272 (2005)
Google Scholar
Buneman, P., Grohe, M., Koch, C.: Path queries on compressed XML. In: VLDB, pp. 141–152 (2003)
Google Scholar
Chung, C.-W., Min, J.-K., Shim, K.: APEX: An adaptive path index for XML data. In: SIGMOD, pp. 121–132 (2002)
Google Scholar
Consens, M.P., Liu, J.W., Rizzolo, F.: XPlainer: Visual explanations of XPath queries. In: ICDE (2007)
Google Scholar
Consens, M.P., Milo, T.: Optimizing queries on files. In: SIGMOD, pp. 301–312 (1994)
Google Scholar
Consens, M.P., Rizzolo, F., Vaisman, A.A.: Exploring the (semi-)structure of XML web collections. Technical report, University of Toronto - DCS (2007), http://www.cs.toronto.edu/~consens/describex/
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Google Scholar
Dovier, A., Piazza, C., Policriti, A.: An efficient algorithm for computing bisimulation equivalence. Theoretical Computer Science 311(1-3), 221–256 (2004)
Article MATH MathSciNet Google Scholar
Goldman, R., Widom, J.: Dataguides: Enabling query formulation and optimization in semistructured databases. In: VLDB, pp. 436–445 (1997)
Google Scholar
Kaushik, R., Bohannon, P., Naughton, J.F., Korth, H.F.: Covering indexes for branching path queries. In: SIGMOD, pp. 133–144 (2002)
Google Scholar
Kaushik, R., Shenoy, P., Bohannon, P., Gudes, E.: Exploiting local similarity for indexing paths in graph-structured data. In: ICDE, pp. 129–140 (2002)
Google Scholar
Martens, W., Neven, F., Schwentick, T.: Complexity of decision problems for simple regular expressions. In: Fiala, J., Koubek, V., Kratochvíl, J. (eds.) MFCS 2004. LNCS, vol. 3153, pp. 889–900. Springer, Heidelberg (2004)
Google Scholar
Marx, M.: XPath with conditional axis relations. In: EDBT, pp. 477–494 (2004)
Google Scholar
Mendelzon, A.O., Wood, P.T.: Finding regular simple paths in graph databases. SIAM Journal on Computing 24(6), 1235–1258 (1995)
Article MATH MathSciNet Google Scholar
Milo, T., Suciu, D.: Index structures for path expressions. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 277–295. Springer, Heidelberg (1998)
Chapter Google Scholar
Nestorov, S., Ullman, J.D., Wiener, J.L., Chawathe, S.S.: Representative objects: Concise representations of semistructured, hierarchial data. In: ICDE, pp. 79–90 (1997)
Google Scholar
Paige, R., Tarjan, R.E.: Three partition refinement algorithms. SIAM Journal on Computing 16(6), 973–989 (1987)
Article MATH MathSciNet Google Scholar
Polyzotis, N., Garofalakis, M.N.: XCLUSTER synopses for structured XML content. In: ICDE (2006)
Google Scholar
Polyzotis, N., Garofalakis, M.N.: XSKETCH synopses for XML data graphs. ACM Transactions on Database Systems (TODS) 31(3), 1014–1063 (2006)
Article Google Scholar
Polyzotis, N., Garofalakis, M.N., Ioannidis, Y.E.: Approximate XML query answers. In: SIGMOD, pp. 263–274 (2004)
Google Scholar
Qun, C., Lim, A., Ong, K.W.: D(k)-index: An adaptive structural summary for graph-structured data. In: SIGMOD, pp. 134–144 (2003)
Google Scholar
Rizzolo, F., Mendelzon, A.O.: Indexing XML data with ToXin. In: WebDB, pp. 49–54 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Toronto,
Mariano P. Consens & Flavio Rizzolo

Authors

Mariano P. Consens
View author publications
You can also search for this author in PubMed Google Scholar
Flavio Rizzolo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Denilson Barbosa Angela Bonifati Zohra Bellahsène Ela Hunt Rainer Unland

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Consens, M.P., Rizzolo, F. (2007). Fast Answering of XPath Query Workloads on Web Collections. In: Barbosa, D., Bonifati, A., Bellahsène, Z., Hunt, E., Unland, R. (eds) Database and XMLTechnologies. XSym 2007. Lecture Notes in Computer Science, vol 4704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75288-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-75288-2_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75287-5
Online ISBN: 978-3-540-75288-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics