OXPath: A language for scalable data extraction, automation, and crawling on the deep web
Authors
 First Online:
 Received:
 Revised:
 Accepted:
DOI: 10.1007/s0077801202866
 Cite this article as:
 Furche, T., Gottlob, G., Grasso, G. et al. The VLDB Journal (2013) 22: 47. doi:10.1007/s0077801202866
 13 Citations
 785 Views
Abstract
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s pageatatime evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.
Keywords
Web extractionCrawlingData extractionAutomationXPathDOMAJAXWeb applications1 Introduction
The dream that the wealth of information on the web is easily accessible to everyone is at the heart of the current evolution of the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed, many invaluable web services, such as Amazon, Facebook, or Pandora, already offer limited automation, focusing on filtering or recommendation. But in many cases, we cannot expect data providers to comply with yet another interface designed for automatic processing. Neither can we afford to wait another decade for publishers to implement these interfaces. Rather, data should be extracted from existing humanoriented user interfaces. This lessens the burden for providers, yet allows automated processing of everything accessible to human users, not just arbitrary fragments exposed by providers. This approach complements initiatives, such as Linked Open Data, which push providers toward publishing in open, interlinked formats.
For automation, data accessible to humans through existing interfaces must be transformed into structured data, for example, each gray span with class source on Google News should be recognized as news source. These observations call for a new generation of web extraction tools, which (1) can interact with rich interfaces of (scripted) web applications by simulating user actions, (2) provide extraction capabilities sufficiently expressive and precise to specify the data for extraction, (3)scale well even if the number of relevant web sites is very large, and (4) are embeddable in existing programming environments for servers and clients.
Previous approaches to web extraction languages [34, 46] use a declarative approach akin to OXPath; however, mainly due to their age, they often do not adequately facilitate deep web interaction, for example, form submissions. Also, they do not provide native constructs for page navigation, apart from retrieving pages for given URIs. Where scripting is addressed [12, 47], the simulation of user actions is neither declarative nor succinct, but rather relies on imperative scripts and standalone, heavyweight extraction interfaces. Lixto [12], Web Content Extractor [2], and Visual Web Ripper [3] are moving towards interactive wrapper generator frameworks, recording user actions in a browser and replaying these actions for extracting data. As large commercial extraction environments, their feature set goes beyond the scope of OXPath. They all emphasize the visual aspect of wrapper generation that ease the design of extraction tasks by specifying only few examples, mainly selecting layout elements on the rendered page. Further, Lixto adopts tree generalization techniques to produce more robust wrappers, whereas Web Content Extractor allows to write complex userdefined text manipulation scripts. However, despite their feature richness, none of these systems addresses memory management and our experimental evaluation (Sect. 6) demonstrates that such systems indeed take memory linear in the number of accessed pages.
OXPath restricts its focus to data extraction in the context of deep web crawling. This sets OXPath apart from information extraction (IE) systems that aim at extracting structured data (entities and relations) from unstructured text. Systems like [11, 23] and [19] extract factual information from textual description (mainly by lexicosyntactic patterns) and HTML tables on the web, respectively. Though OXPath could be extended to support libraries of userdefined functions to refine extraction from text (along the lines of procedural predicates in [46]), this is out of the scope of this paper. Whereas OXPath takes advantage of the structures on the web (possibly revealed through simulating user interaction) to extract, for example, infoboxes on Wikipedia or products on Amazon along with all their reviews, the task of extracting, for example, named entities from these reviews, is not addressed by OXPath, but delegated to postprocessing of the extracted information, for example, using a IE systems.
As far as web automation tools are concerned, though some of them [17, 31] can deal with scripted web applications, they are tailored to single action sequences and prove to be inconvenient and inefficient in largescale extraction tasks requiring multiway navigation (Sect. 6).
It is against this backdrop that we introduce OXPath, a careful, declarative extension of XPath for interacting with web applications to extract information revealed during such interactions. It extends XPath with a few concise extensions, yet addresses all the above desiderata:
1—Interaction.OXPath allows the simulation of user actions to interact with the scripted multipage interfaces of web applications: (I) Actions are specified declaratively with action types and context elements, such as the links to click on, or the form field to fill. (II) In contrast to most previous web extraction and automation tools, actions have a formal semantics (Sect. 4.4) based on a (III) novel multipage data model for web applications that captures both page navigation and modifications to a page (Sect. 4.3).
2—Expressive and precise.OXPath inherits the precise selection capabilities of XPath (rather than heuristics for element selection as in [17]) and extends them: (I)OXPath allows selection based on visual features by exposing all CSS properties via a new axis. (II)OXPath deals with navigation through page sequences, including multiway navigation, for example, following multiple links from the same page, and unbounded navigation sequences, for example, following next links on a result page until there is no further such link. (III)OXPath provides intensional axes to relate nodes through multiple conditions, for example, to select all nodes that are at the same vertical position and have the same color as the current node. (IV)OXPath enables the identification of data for extraction, which can be assembled into (hierarchical) records, regardless of its original structure. (V) Based on the formal semantics of OXPath (Sect. 4.4), we show that its extensions considerably increase the language’s expressiveness (Sect. 4.10).
3—Scale.OXPathscales well both in time and in memory: (I) We show that OXPath’s memory requirements are independent of the number of pages visited (Sect. 5). To the best of our knowledge, OXPath is the first web extraction tool with such a guarantee, as confirmed by a comparison with five commercial and academic web extraction tools. (II) We show that the combined complexity of evaluating OXPath remains polynomial (Sect. 4.10) and is only slightly higher than that of XPath (Sect. 5). (III) We also show that OXPath is highly parallelizable (Sect. 4.10). (IV) We provide a normal form that reduces the size of the memoization tables during evaluation and rewriting rules to normalize arbitrary expressions (Sect. 4.9). (V) We verify these theoretical results in an extensive experimental evaluation (Sect. 6), showing that OXPath outperforms existing extraction tools on largescale experiments by at least one order of magnitude.
4—Embeddable, standard API.OXPath is designed to integrate with other technologies, such as Java, XQuery, or Javascript. Following the spirit of XPath, we provide an API and host language to facilitate OXPath’s interoperation with other systems.
Bonus: Open Source. We provide our OXPath implementation and API at http://diadem.cs.ox.ac.uk/oxpath, for distribution under the new BSD license.
OXPath has been employed within DIADEM [24], a domaindriven, largescale data extraction framework developed at Oxford University, proving to be a practically viable tool for (1) succinctly describing web interaction and extraction tasks on sophisticated web interfaces, for (2) generating and processing such task descriptions, and for (3) efficiently executing these wrappers on the cloud.
(1) It clarifies the design of OXPath and introduces intensional axes (Sect. 4.7) as a further instrument for extracting data from rich web applications. With intensional axes, the OXPath user can onthefly specify new types of relations between elements of a web page, for example, to select all paragraphs with the same color and dimension.
(2) The pageatatime evaluation algorithm (Sect. 5) has been further refined to cater to these additions and to improve the complexity bounds from [25]: By splitting and specializing the memoization table, we achieve a reduction by up to a factor of \(n\) in time and memory. An additional factor of \(n\) reduction (at a slight increase in expression size) can be achieved by applying a new normalization rewriting (Sect. 4.9).
(3) An extensive study (Sect. 5.7) of the impact of the main features of OXPath, extraction markers, actions, and Kleene star iteration on evaluation performance is used to define a normal form for OXPath (Sect. 4.9) together with a sound and complete rewriting.
1.1 A gentle introduction to OXPath
OXPath extends XPath with five concepts: actions to navigate the interface of web applications, means for interacting with highly visual pages, intensional axes to identify nodes by multiple relations, extraction markers to specify data to extract, and the Kleene star to extract from a set of pages with unknown extent.
Actions. For simulating user actions such as clicks or mouseovers, OXPath introduces (i)contextual, as in {click}, and (ii)absolute action steps with a trailing slash, as in {click /}. Since actions may modify or replace the DOM, we assume that they always return a new DOM. Absolute actions continue at DOM roots, and contextual actions continue at those nodes in the new DOM matched by the actionfree prefix (Sect. 4.4) of the performed action. This prefix is obtained from the segment starting at the previous absolute action by dropping all intermediate contextual actions and extraction markers.
Style Axis and Visible Field Access. For lightweight visual navigation, we expose the computed style of rendered HTML pages with (i) a new axis for accessing CSS DOM node properties and (ii) a new node test for selecting only visible form fields. The style axis navigates the actual CSS properties of the DOM style object, for example, it is possible to select nodes by their (rendered) color or font size. To ease field navigation, we introduce the node test field() that relies on the style axis to access the computed CSS style to exclude nonvisible fields, for example, /descendant:: field()[1] selects the first visible field in document order.
Intensional axes allow to compare node sets by more than one relation with the use of the reserved variables $lhs and $rhs.
The nesting in the result mirrors the structure of the OXPath expression: extraction markers in a particular predicate, such as title and source, yield attributes associated with the last marker outside this predicate, in our example the story marker.
To limit the range of the Kleene star, one can specify upper and lower bounds on the multiplicity, for example, (...)*{3,8}.
2 Application scenario
History Books on Seattle. To extract data about history books on Seattle as offered on amazon.co.uk, a user has to perform the following sequence of actions to retrieve the page listing these books (see Fig. 1): (1) Select “Books” from the “Search in” select box, (2) enter “Seattle” into the “Search for” text field, and (3) press the “Go” button to start the search. On the returned page, (4) refine the search to only “History” books, (5) and open the details page for retrieving further details. Figure 1 shows an OXPath expression that realizes this extraction (each action is numbered according to the involved step). Lines 1–5 implement the above steps: To select the two input fields, we use OXPath’s field() node test (matching only visible form elements) and each node’s title attribute (@title). A contextual action (enclosed in {}) selects “Books” from the select box and continues the navigation from that field. The other actions are not contextual but absolute (with an added/before the closing brace) to continue at the root of the page retrieved by the action. To select the “History” link, we adopt the . notation from CSS for selecting elements with a class attribute refinementLink and use OXPath’s \(\sim \) shorthand for XPath’s contains() function to match the “History” text.
For the obtained books, we extract their title, price, and publisher in step (5), as shown in Lines 6–8 of Fig. 1: The element with class result serves as indicator of book records, denoted by the record extraction marker: \(<\)book\(>\). From there, we navigate to the contained title links, extract their value as a title attribute, and click on the link to obtain the page for the individual book, where we find and extract the publisher. Finally, we extract the price from the previous page—without caring for the order in which the pages are visited during extraction. OXPath buffers pages when necessary, yet guarantees that the number of buffered pages is independent of the number of visited pages.
Figure 2 shows an OXPath solution for this extraction task: Lines 1–3, navigate to Google Scholar, fill and submit the search form. Line 4 realizes the iteration over the set of result pages by repeatedly clicking the “Next” link, denoted with a Kleene star. Lines 5–6 identify a result record and its author and title, lines 7–9 navigate to the citedby page and extract the papers. The expression yields nested records of the following shape:
Stock Quotes from Yahoo Finance. Figure 4 illustrates an OXPath expression extracting stock quotes from Yahoo Finance. In particular, note the use of optional predicates ([? ]) for conditional extraction: If the change is formatted in red, it is prefixed with a minus, otherwise with a plus.
3 Preliminaries: XPath data model
XPath is used to query XML documents modeled as unranked and ordered trees of nodes. The set of nodes within a document are given as DOM, with nodes of seven types, namely root, element, text, comment, attribute, namespace, and processing instruction. Documents begin at a unique root node with elements as the most common nonterminal. All nodes except comment and text have an associated name (or label). Within a document, nodes \(x\) and \(y\) are ordered by document order, which is defined as the binary relation \(x <_{doc} y\), iff the opening tag of \(x\) occurs before the opening tag of \(y\) in the wellformed XML document. Node types and labels are formally represented in this paper through a set of unary relations, \((\mathsf{{unary}} _\nu )_{\nu \in \mathsf{{Unary}}}\) with, for example, \(\mathsf{{text}}, \mathsf{{element}}, \mathtt{a}\, \in \mathsf{{Unary}} \) (all text nodes, elements, and alabeled nodes).
An XPath query result has one out of four possible data types, which is either (1) an unordered collection of distinct document nodes, called node sets, or a scalar value of type (2)Boolean, (3)string, or (4)number.
Axis navigation is further refined by node tests. In addition to a wildcard node test (node()) covering all document nodes, XPath defines node tests to filter by each of the unnamed node types (text(), processinginstruction(), and comment()). Node tests can also filter elements by name or by \(*\), which is a wildcard for all elements.
Finally, steps can be filtered further with an arbitrary number of predicate expressions. Predicates contain a single input expression and return each node from the context that evaluates to true for its input expression. We leave further details on XPath to Sect. 4, where we discuss XPath implicitly as OXPath’s sublanguage.
4 Language
4.1 Design principles
(1)Spirit ofXPath. OXPath maintains, where possible, the principles upon which XPath is built, in particular the use of a single, navigational expression, polynomial time evaluation, and concise syntax. Where we extend XPath, we do so using existing web standards such as CSS or DOM events. (I)Single Expression.OXPath expressions are path expressions just like plain XPath. We choose not to extend OXPath with a separate “construct” clause specifying the result of the expression (as in SPARQL, SQL, or XQuery), but rather to embed the specification of the result of the extraction into the path expression through extraction markers. This requires the shape of the expression to mirror the shape of the extracted result, a limitation, however, that is insignificant due to XPath’s flexible axes.^{1}(II)Tree result. The result of an OXPath expression is a tree constructed from matches for the record extraction markers and their attributes. This poses only a small limitation, as data on the web are usually presented in a hierarchical way. (III)Polynomial time. We design OXPath to remain polynomial and in twovariable logic for both selection and construction (see Sect. 4.10). Though there are cases, where full firstorder logic is necessary for extraction (e.g., to refer back to values encountered on other pages), we believe that OXPath presents a more useful tradeoff in most cases (see Sect. 2).
(2)Low memory. The second core principle is that OXPath should not require an unbounded buffer for pages, but rather be able to extract from hundreds of thousands of pages with very little memory use. This is necessary for largescale data extraction. (I)No page identity.OXPath does not manage page “identity”: If two links lead to the same URL, OXPath considers the pages reached by clicking on those links as distinct. This avoids issues with server state where the same URL returns different results at different times or points in an interaction. It also avoids the need to maintain pages in OXPath in case they are later encountered again. (II)No back.OXPath does not allow the reverse (or “back”) navigation over pages. Once we have moved from page \(A\) to \(B\), there is no way back to \(A\). This is a limitation as it (together with the lack of variables) prohibits a class of wrappers that refer back to values encountered on earlier pages. However, it is essential to maintain the low memory profile of OXPath.
4.2 Syntax
Even with the added selection capabilities of CSS, OXPath’s text extraction capabilities are still rather weak compared to full IE system. We are currently extending OXPath with rich text processing operators (e.g., regular expression matching) inspired by XPath 2.0. However, proper entity or relation extraction is beyond the scope of OXPath and, if necessary, can be performed in a postprocessing step.
Finally, we define the main path to a step \(s\) in an expression \(e\) as the sequence of steps occurring on the path to from the root of \(e\)’s expression tree to the step \(s\). For example, given \(e=\)a[b[c]/d]/e, we obtain for \(s=\)d and \(s=\)e, respectively, the main paths a/b/d and a/e.
The grammar in Fig. 6 omits a few restrictions necessary to avoid an undesirable interplay of our new language features with functions, predicates and sorting operators as well as to avoid unintuitive extraction: (R1) Actions and extraction markers may not occur in other extraction markers, function, or operator arguments. Further, extraction markers may not occur inside intensional axes (see Sect. 4.7). This limitation is mainly for simplicity, although actions in operator arguments can also affect the lowmemory principle of OXPath (see Sect. 4.1). (R2) We disallow position() on the node set obtained from (Kleenestarred) bracketed expressions having actions or extraction markers in their main paths, to avoid sorting nodes that originate from different documents or relate to different extraction markers. This restriction is necessary to maintain the lowmemory goal of OXPath, as it would require storing all these nodes from different pages if allowed. Expressions with extractions inside Kleene stars are rewritable into expressions satisfying this restriction, as shown in Theorem 2. (R3) Extraction markers may only occur in a predicate if there is a marker on the path, leading to the predicate and the last such marker does not extract a value. Extraction markers extracting a value may not occur outside a predicate. The value of an extraction marker must yield a scalar value. These three restrictions ensure a sane use of extraction markers such that the expressions always produce a proper output tree (“tree result” principle). (R4) All predicates that are followed by a contextual action with no absolute action between must be free of actions. This ensures that the actionfree prefix (see Sect. 4.4) of a contextual action does not contain any actions in predicates and thus can safely be executed multiple times on different pages without violating the “no page identity” principle.
4.3 Data model
The remaining XPath axes, such as \(\mathsf{{descendant}}\), are derived from the basic relations, as in Sect. 3. We also add for each OXPath action \(\alpha \) a binary relation \((\mathsf{{action}} _{\alpha })_{\alpha \in \mathsf{{Action}}}\). Herein, \(\mathsf{{action}} _\alpha (x,y)\) indicates that action \(\alpha \) triggered on node \(x\) yields the page rooted at \(y\). The actions \((\mathsf{{action}} _{\alpha })_{\alpha \in \mathsf{{Action}}}\) and the \(\mathsf{{child}}\) axis together form a tree: Each page has a unique parent node connected by a single action edge, that is, if two links point to the same URI, they yield two different pages in the page tree.
4.4 Semantics
The semantics of OXPath is defined with its extraction semantics\(\left[\![\, { expr}\,\right]\!]_{\mathsf{E}}^\beta (c)\), specifying the result tree for expression \(expr\), context tuple \(c\), and variable assignment \(\beta \). Each context tuple\(c=\left<n,p,l\right>\) consists of an input node \(n\), the parent output node \(p\), and the last sibling output node \(l\), accessed through the notation \(c.n\), \(c.p\), \(c.l\). We define the extraction semantics atop the value semantics\(\left[\![\, { expr}\,\right]\!]_{\mathsf{V}}^\beta (c)\), which matches nodes from the page tree, akin to XPath’s semantics in [13]. The latter computes the value reached via \(expr\), which is either a set of context tuples, a Boolean, integer, or string value. For lucidity’s sake, we first develop the semantics for the OXPath language without the intensional axes, which we add to the semantics in Sect. 4.7.
The OXPath semantics deviates from XPath in the contents of its context tuples. We maintain both the preceding parent and sibling match to organize the extracted nodes hierarchically: At context \(c=\left<n,p,l\right>\), an extraction match outside predicates (1) yields an output tuple \(\left<o,c.p,M,v\right>\), with a fresh output node \(o\), becoming descendant of \(c.p\) and sibling of \(c.l\), and (2) changes the context to \(c^{\prime }=\left<c.n,c.p,o\right>\). On entering a predicate, \(c.l\) replaces \(c.p\) as parent output node, such that further extraction matches yield nodes that are nested as descendants of \(c.l\) instead of \(c.p\).
Value semantics of OXPath
Extraction semantics of OXPath (partial)
4.5 Value semantics
In Table 1, Rule V1, \(\left[\![\, path\,\right]\!]_{\mathsf{V}}\) delegates the evaluation of a \(path\) to \(\left[\![\, path\,\right]\!]_{\mathsf{N}}\), which handles expressions computing node sets. The Rules V2–V4 deal with functions and operators and apply the corresponding semantic functions on the evaluated subexpressions and operands. Rule V5 handles literals and Rule V6 maps a variable \(\$var\) to its value \(\beta (\$var)\).
The major part of the value semantics, consisting of Rules N1–N10, deals with \(path\) expressions. At first, Rule N1 decomposes a given \(path\) into its first element, which is either a \(step\) or an \(istep\), and the tail expression \(path\). Then, the semantics evaluates the \((i)step\) with Rules N2–N10 and evaluates \(path\) on the resulting node set recursively.
N2–N5:Axes, node tests, and predicates. In Rules N2 and N3, we handle OXPath axes and node tests as in standard XPath. The case of predicates in Rules N4 and N5 follows XPath as well, but requires additional provisions: To manage the nesting of the extracted data, upon entering a predicate, OXPath takes the last sibling output node as new parent output node, both in Rule N4 and in N5. Expressions in predicates are cast to Booleans by means of \(\left[\![\, expr\,\right]\!]_{\mathsf{B}}\) with Rule B1, as in XPath. Rule N5, for positional predicates, relies on two new functions, rewrite\(_{+}\) and rewrite\(_{}\), of the form \(expr \times C \times c \rightarrow expr^{\prime }\), where \(expr\) is an input expression, \(C\) is a context set, \(c \in C\), and \(expr^{\prime }\) is a rewritten expression as follows: Both functions replace in the input expression each nonnested occurrence of last() with \(C\) and of position() with the position of \(c\) within \(C\) according to document order. This order is welldefined within all possible context sets at this point, since all such context sets contain only tuples with nodes from the same page with identical parent and sibling matches: If \((i)step\) starts with an axis/node navigation or an action, this property holds. Otherwise \((i)step\) must a be bracketed (Kleene star) expression, and in this case, the expression is not allowed to contain actions or extraction markers (Restriction (R2) on page 7), preventing any changes in the underlying page or context tuples. We write rewrite\(_{\pm }\) for conciseness, where the specific function applied depends on the \((i)step\): For axis navigation, we choose rewrite\(_{+}\) for forward and rewrite\(_{}\) for reverse axes. Otherwise, for (Kleenestarred) bracketed or action expressions, we always select rewrite\(_{+}\).^{2}
N6–N7:Actions. Actions map the context node \(c.n\) to a node in a different page. Absolute actions in Rule N6 map \(c.n\) to the root \(n^{\prime }\) of some other page with \((\mathsf{{action}} _\alpha )_{\alpha \in \mathsf{{Action}}}\). By our data model, we assume that the node \(n^{\prime }\) is different from any other node previously reached. For contextual actions, Rule N7 first also evaluates the absolute action to obtain the root of the new page, but then attempts to move to a node that corresponds closely to the node \(c.n\) on the original page. We select the corresponding node by applying the same OXPath expression to the new page root as we used to select \(c.n\) in the original page. The actionfree prefixafp\((action,c.n)\) of \(action\) and \(c.n\) in OXPath expression \(expr\) returns the following OXPath expression: Let \(base\) be the subexpression of \(expr\) between action and its last preceding absolute action, stripped of all extraction markers and all contextual actions occurring in the main path of \(expr\). Then, afp\({}(action,c.n) = (base)[i]\) where \(i\) is the position of \(c.n\) in document order among all nodes in the current page matching \(base\). In the original page, this expression uniquely identifies \(c.n\). When evaluated from the root returned by the absolute action on \(c.n\), it selects a unique node reached by the same path and in the same relative position.
N8:Extraction markers. For extraction markers, the context set of the corresponding step is modified by replacing the last sibling output node \(c.l\) with the one generated from this marker \(M\) and current node \(c.n\). OXPath computes the new node with out\((c.n, M)\) where outinjectively maps an input node \(c.n\) and extraction marker \(M\) to an output node. N9–N10:Kleene star. Unbounded and bounded Kleenestarred expressions match the Kleenerepeated path multiple times, enforcing optional iteration bounds. N11:Empty expressions. Return the context node.
4.6 Extraction semantics
The extraction semantics \(\left[\![\, expr\,\right]\!]_{\mathsf{E}}(c)\) for OXPath in Table 2 takes context tuple \(c\) and extracts an output tree from the input page tree. For the sake of brevity, we omit those rules that recursively decompose the expression but have no other effect. Except for extraction markers, the extraction semantics is straightforward: For expressions with subexpressions, we compute the extraction semantics for each subexpression and take the union of all extracted results. To compute the extraction semantics for a subexpression \(expr\), we first compute with the value semantics \(\left[\![\, \,\right]\!]_{\mathsf{V}}\) the context set to apply \(expr\) on, and then apply \(\left[\![\, expr\,\right]\!]_{\mathsf{E}}\) recursively on each obtained context node. Rules E1 and E2 exemplify this case for paths and predicates. For all other expressions not shown here, we collect all extraction markers returned by their subexpressions (if any), regardless of the (value) semantics of the involved expressions.
Extraction markers are treated in Rules E3 and E4: For markers without extracted values (Rule E3), OXPath extracts the tuple \(\left<{out}(c.n,M),c.p,M,\mathsf{{null}}\right>\). The resulting node is thus a child of the parent match \(c.p\). For markers with values (Rule E4), we evaluate additionally the value expression \(v\): We take the value returned by \(\left[\![\, v\,\right]\!]_{\mathsf{V}}(c)\) (a string or other scalar) and output \(\left<{out}(c.n,M),c.p,M,\left[\![\, v\,\right]\!]_{\mathsf{V}}(c)\right>\).
4.7 Intensional axis
In XPath, axes relate nodes through a fixed set of relations such as child or following. Together with functions and operators, these are the only means in XPath for relating two nodes. Unfortunately, these means are rather limited, for example, we cannot identify all nodes that follow the current node in document order and are displayed in the same font size as the current node. In general, XPath is not able to express queries where nodes from two node sets are related by more than one relation.
Theorem 1
Let \(\fancyscript{I}\) be the set of firstorder queries of the form \(Q(x,y) \Leftarrow \phi (x,y) \wedge \psi (x,y)\) where \(\phi (x,y)\) and \(\psi (x,y)\) are nonempty firstorder formulas expressible in XPath. Then there are queries in \(\fancyscript{I}\) that cannot be expressed in XPath.
Proof
(sketch) From [13] and [36], this follows for navigational XPath, which expresses exactly all XPNF queries. An XPNF query is a FO\(^2\) query over page trees (without action relations) built from relations between two node sets, which are limited to disjunctions of binary atomic formulas.
For full XPath, we need to show that neither relational operators, functions, aggregation, or positional arithmetic allow us to express multiple relations between two node sets. All queries in \(\fancyscript{I}\) relate two nodes, and thus we can ignore boolean and other value queries.
First, outside predicates,id() is the only functional operator allowed in such queries. As id() returns the same result for any context node, it does not relate multiple nodes.
Second, inside predicates, XPath allow conjunctions of functions, relational operators, and aggregations. However, the only “shared variable” between such conjuncts is the context node (see V1–V6 in Table 1). Though it is possible to build up several node sets originating from the same context node, once constructed, its individual elements are only accessible via quantification. For example, [.//a=.//band.//a=.//c] does not require the existence of a single \(a\) node having the same value as some \(b\) and some \(c\): In contrast, the predicate is already satisfied if there exist two \(a\) nodes that match, respectively, the values of some \(b\) and \(c\).
Though the same context set can be matched by multiple conjuncts (by using two equivalent subexpressions), the individual nodes in these node sets cannot be related. This even holds for count() that can be used to relate entire node sets, but not individual nodes. \(\square \)
we obtain the context set \(C\) containing all nodes \(c^{\prime }\) such that the expression inside the brackets evaluates to true for the variable assignment \(\beta ^{\prime } = \beta [\textit{\$lhs}\leftarrow c.n, \textit{\$rhs}\leftarrow c^{\prime }.n]\). We bind \(\$lhs\) to the current node \(c.n\), try for \(\$rhs\) every node in the current page, and return those for which the axis expression is satisfied. Thus, this expression solves the query from the beginning of this section: It returns all nodes that follow the current node in document order and are displayed in the same font size. The expression uses OXPath’s subset operator to test whether \(\$rhs\) is among the nodes following \(\$lhs\) (see Sect. 4).
Definition 1
An intensional step [\(\phi \)]::n\(\psi \) consists of an intensional axis of [\(\phi \)], a node test \(n\), an arbitrary number of predicates \(\psi \), and an arbitrary OXPath expression \(\phi \) that may use the reserved variables \(\$lhs\) and \(\$rhs\). An intensional axis [\(\phi \)] returns, for a context node \(c\), all nodes \(m\) such that \(\left[\![\, \phi \,\right]\!]_{\mathsf{B}}\) is true if \(\$lhs\) is bound to \(c\) and \(\$rhs\) to \(m\).
It should be noted that an intensional axis that does not refer to \(\$lhs\) or the context node relates all context nodes to the same bindings for \(\$rhs\) (since \(\left[\![\, \phi \,\right]\!]_{\mathsf{B}}\) does not depend on \(\$lhs\). If it does not refer to \(\$rhs\), it acts as a filter to the context nodes, but each context node that matches the filter is related to all nodes in the DOM. If neither is referenced, it is either \(\emptyset \) or the set of all pairs of DOM nodes, depending on whether \(\left[\![\, \phi \,\right]\!]_{\mathsf{B}}\) holds (which is absolute and independent of the context node).
Value semantics for intensional axes
Intensional axis examples
Examples. Table 4 lists five applications of intensional axes in OXPath. Expression (1) selects all nodes that have the same font color and size as an a. In (2), we select all divs that are rendered northwest of an a. Expression (3) selects all books having a common author with another book that is cited by the latter one, that is, it selects books containing selfcitations. Example (4) shows a nested intensional axis: It selects all em children of elements that have the same font family as a div and that are to the north of an a of that div with the same value. This case requires a nested relational axis, as the div is related to the em by more than one relation, but so is the a to the em’s parent. Expression (5) selects all elements that are visually contained in some div.
4.8 OXPath properties
As discussed in Sect. 4.1, OXPath is designed to avoid buffering many pages or result tuples at the same time. This design goal is expressed in two formal properties:
OXPath avoids sorting context sets that contain nodes from different pages, since it is unclear how to order nodes from different pages, without first retrieving (and thus buffering) those pages.
Proposition 1
(No Node Sorting across Pages) The evaluation of an OXPath expression never requires sorting context sets which contain nodes from different pages.
Proof
OXPath requires sorting only for positional qualifiers in Rule N5, where the function rewrite\(_\pm (q,C,c^{\prime })\) sorts the tuples in the context set \(C\) and determines the position of \(c^{\prime }\) within \(C\). Thus, it suffices to show that \(C\) in Rule N5 never contains nodes from different pages.
This holds true, since in Rule N5, \(C=\left[\![\, (i)step_\pm \,\right]\!]_{\mathsf{V}}(c)\) is computed from a single axis navigation \(axis::nodes\) (N3), followed by a sequence of (positional) qualifiers (N4 and N5) and markers (N8). Our language restriction (4) from Sect. 4 ensures that actions cannot occur in \((i)step_\pm \), as they are disallowed in positionally qualified steps. Since Rule N3 always results in a context set with nodes from the same page, and since N45,8 can only remove nodes from the context set, \(C\) must contain nodes from a single page only. \(\square \)
OXPath’s semantics does not require any further processing on result tuples, and hence allows them to be streamed out as they are extracted. Extracted tuples are never modified, deleted, or reaccessed again.
Proposition 2
(No Output Buffer) The evaluation of OXPath expressions requires no output buffer.
Proof
Only Rules E3 and E4 in Table 2 create output tuples. We visit each input node \(c.n\) at most once with extraction marker \(M\), and thus, each created output tuple is unique (and is not overridden or altered in any anymore). The output tuples are immediately added to the output relation \(O\), regardless of whether the current expression evaluates eventually to true or not. Also, when the output tuples are created, the parent output nodes are known by construction and thus no buffering is needed to obtain all tuple values. All other rules in the extraction semantics (as in E1 and E2) only collect the tuples returned by their subexpressions. Since no duplicate tuples are created, this requires no buffering. \(\square \)
More intuitively, this holds as the structure of the output tree reflects the structure of the OXPath expression and parent nodes are therefore always created before their children nodes. Also, tuples are extracted immediately upon creation, regardless of whether the current subexpression matches or not. For example, consider expr[p\(_1\)][p\(_2\)]. If \(p_1\) contains extraction markers, the extracted tuples are returned whether \(p_2\) matches or not.
Requiring that tuples extracted by \(p_1\) are returned only if \(p_2\) matches, would require an unbounded buffer, as the visited pages and extracted results for \(p_1\) are both unbounded. Furthermore, we can achieve the same effect in the existing OXPath semantics at the cost of an increased query size: We can rewrite the above expression into expr[p\(_2^{\prime }\)][p\(_1\)][p\(_2\)] where \(p_2^{\prime }\) is obtained from \(p_2\) by removing all extraction markers. Then \(p_2^{\prime }\) matches if and only if \(p_2\) matches, as extraction markers do not affect matching, and tuples extracted by \(p_1\) are only returned if \(p_2\) matches.
4.9 Normalizing OXPath
We can reduce the size of the necessary memoization tables by a factor of \(n\), without restricting the language, at potential expense of longer queries by rewriting general OXPath into normalizedOXPath, a fragment denoted OXPath\(_{\mathsf{{norm}}}\).
We introduce two normalization properties for OXPath expressions: Property (A) does not allow any extraction markers within Kleene star expressions, and Property (B) disallows any two extraction markers on the same expression branch. The latter property means that all extraction markers after any given marker must be nested within predicates. For example, Property (B) is violated by a:\(<\)R\(>\)b:\(<\)S\(>\) but satisfied by a:\(<\)R\(>\)[b:\(<\)S\(>\)]. In this section, we do not distinguish record markers, such as :\(<\)R\(>\), and attribute extraction markers of the form :\(<\)R=\(\dots >\).
To normalize general OXPath expressions, we apply two rewriting steps. The first rewriting, shown in Theorem 2, produces expressions that meet Property (A). When we cannot apply Theorem 2 anymore, we rewrite the obtained expression following Theorem 3, again until inapplicable, to meet Property (B) as well.
The proof of Theorems 2 and 3 relies on the loose coupling of value and extraction semantics in OXPath: The extraction semantics does not influence the value semantics at all, as stated in Fact 3. On the other hand, the extraction semantics depends on the value semantics, as the value semantics determines which nodes to extract. But extractions take place immediately, independently of whether the tailing expression matches any values or not, as stated in Fact 4.
Fact 3
(Extraction agnostic value semantics) The introduction or removal of extraction markers does not affect the value semantics of an OXPath expression.
Fact 4
(Tail agnostic extraction) When extraction marker \(<\)R\(>\) in a:\(<\)R\(>\)b is evaluated after having matched \(a\), the corresponding pairs \(\left<n,R\right>\) are extracted immediately, regardless of whether \(b\) matches subsequently.
4.9.1 Extractionfree Kleene star
For the next theorem, we rely on OXPath not being shortcircuited, for example, the evaluation of [ab] cannot be aborted once \(a\) is evaluated to true since \(b\) may contain extraction markers which must be matched, even if they do not affect the value semantics.
Theorem 2
Proof
First, we show the claim for the value semantics and subsequently for the extraction semantics. Using \(\equiv _V\) to denote the equivalence with respect to value semantics, we obtain p* \(\equiv _V \)q* \(\equiv _V \)self[?x]q* \(\equiv _V \)self[?q*p]q*. Therein, Step (1) holds for Fact 3, (2) holds for every expression \(x\), since optional predicates are not required to evaluate to true (in fact, they have been introduced for conditional extraction), and (3) instantiates \(x\) with q*p. Similarly, we have p*\(\equiv _V \)selfp*p\(\equiv _V \)self q*p, where Step (1) unrolls the Kleene star expression, and (2) holds again for Fact 3. Taken together, these identities show the claim for value semantics, that is, for \(\equiv _V\).
yielding the sought for equivalences after Steps (2) and (4). Step (1) unrolls the first iteration of the Kleene star. In (2), we replace an instance of \(p\) with \(q\), and hence we need to show that every pair extracted by an instance of \(p\) in p*p is also extracted by q*p. But this is the case: If p*p extracts some pair, then there must exist a minimal \(i\ge 0\) such that p\(^{i}\)p extracts this pair. Because of Fact 4, we only consider the prefix leading to the extraction, while we ignore the subsequent expressions to be matched. Since \(i\) is minimal, the extraction does not occur within p\(^i\) but in the tailing \(p\), and therefore, q\(^i\)p produces the same pair. Hence, q*p does so as well, proving the soundness of this step. Step (3) holds for any \(y\) without extraction markers: All pairs extracted by q*p are also extracted by the conditional predicate self [?q*p], regardless of the tailing \(y\). On the other hand, since \(y\) does not contain extraction markers, self [? q*p]y cannot extract more than q*p. Step (4) instantiates \(y\) with q*, which is valid since \(q\) contains no extraction markers. \(\square \)
Both rewriting options in Theorem 2 have exponential upper bounds: If we rewrite hp*t with hthq*pt, we need to duplicate the entire expression, with additional occurrences of \(h\), \(t\), and \(p\) (in terms of \(q\)). On the other hand, if we use h[?q*p]q*t, we triplicate \(p\) with two additional copies of \(q\). In our experience, if extraction markers occur in Kleene star expressions, then the Kleenestarred expressions are rather short, that is, the second option is usually the better choice. Finally, if the Kleene star must match at least once (e.g., when using the Kleene + instead of *), then we can use an even more efficient rewriting:
Corollary 1
(Rewriting for Kleene \(+\)) Let \(e\) be an OXPath expression hp+t, with \(h, p\), and \(t\) arbitrary. Then the expression \(e\) is rewritable with hp+t\(\equiv \)hq*pt where \(q\) denotes the expression derived from \(p\) by removing all extraction markers.
In general, all these rewritings are exponential in the expression size. However, the overhead introduced by rewriting Kleene star expressions with bounded Kleene nesting depth is polynomial. This bound is practically relevant, as we never encountered natural OXPath expressions with a nesting depth larger than 2.
4.9.2 Siblingfree extraction
Theorem 3
Proof
where Step (1) holds because of Fact 3, and Step (2) holds since [self:\(<\)R\(>\)] evaluates to true under value semantics for every node.
For extraction semantics,a:\(<\)R\(>\) and a[self:\(<\)R\(>\)] must extract the same pairs, since \(<\)R\(>\) is applied to the same node sets, as a and a/self select same nodes. Again, because of Fact 4, the respective tail expressions are irrelevant.\(\square \)
4.10 Complexity
Considering the complexity of OXPath, we note that expressions containing Kleene star repeated actions may require access to an unbounded number of pages. In particular, when we evaluate such an expression, we do not know whether the evaluation terminates and how many pages are accessed during evaluation. Thus, when we discuss the complexity of evaluating OXPath, we only consider expressions whose evaluations terminate and consider all accessed pages as input. Furthermore, we assume that traversing an action takes constant time, as most pages execute their actions quickly.
Theorem 4
(Complexity) OXPath evaluation without multiplication and string concatenation is in NLogSpacefor data complexity. OXPath evaluation is PTimecomplete for combined complexity.
Proof
We show the theorem statements separately, starting with data complexity: From all extensions over XPath, only the Kleene star causes an increase in complexity: Actions are assumed to take constant time, extraction markers do not require additional memory as they are streamed out, and the additional axis does not introduce further complexity. XPath 1.0 without string concatenation and multiplication has data complexity LogSpace [13]. Each Kleene star expression can be realized as transitive, reflexive closure of the Kleene star repeated expression; therefore, we arrive at NLogSpacedata complexity for OXPath without string concatenation and multiplication.
Combined Complexity:PTime hardness follows immediately from the PTimehardness for XPath query evaluation [13]. To evaluate an OXPath query, we process the query left to right, and decompose it recursively. Since we show that evaluating each subexpression requires at most polynomial time, the overall evaluation runs in polynomial time as well.
For XPath subexpressions, we rely on one of the known polynomial time algorithms for XPath [26], which can be easily extended with style, intensional axes, and the other selection only features added in OXPath. If the expression is an extraction marker, we stream out the extracted tuple, which is in polynomial time, too, while actions are assumed to take constant time.
The only remaining case is the Kleene star: If the Kleene star repeated expression contains a nonnested action, we know that each iteration of the repeated expression leads to a new page. Consequently, there are at most input size many iterations. If the Kleene star does not contain nonnested actions, we evaluate it like an ordinary Regular XPath, leading to a polynomial time algorithm [35]: If a predicate within the expression contains an action, we can evaluate this predicate in polynomial time, and since we need to evaluate this predicate at most for a polynomial number of context tuples, the theorem statement follows. \(\square \)
5 Pageatatime evaluation
5.1 Algorithmic design goals
Starting with a standard XPath evaluation with memoization [26], only two of OXPath extensions demand significant additional treatment, leading to the following three design goals: (1)Actions visit different pages, and multiple actions on the same page yield branches in the page tree (see Sect. 5.4). Unfortunately, if the same page is fetched multiple times, we may obtain different results, for example, if the underlying data have changed or the page contains timesensitive information. Thus, we need to buffer such pages, but at the same time, need to minimize the number of necessary page buffers without reloading pages. (2) With extraction markers, we can return multiple, possibly related data items, requiring the evaluation to collect these items. To scale well with largescale extraction tasks, we need to efficiently propagate matches on extraction markers and their relations. Aside these two goals, we also need to (3)maintain the polynomial evaluation of XPath, catering the other extensions of OXPath efficiently.
To address (1), our pageatatime algorithm traverses the page tree in a depthfirst manner without retaining information on pages not visited again. However, a naive depthfirst traversal of the DOM nodes within individual pages would cause an exponential worstcase runtime in violation of (3), necessitating memoization of intermediate results. To address (2), our algorithms stream out extraction matches, requiring no buffering at all.
Memoization. As essential design goal, we need to prevent multiple evaluations of the same expression with the same context tuple. For example, while evaluating //p//a[...] in our example, there are two a nodes (7 and 10) that are descendants of multiple p nodes. While a naive implementation processes such a nodes multiple times, we avoid this overhead by inserting memoization at two strategic positions: We (1) encapsulate the evaluation of simple expressions into eval_ extending the memoizationbased XPath evaluation from [26] and (2) additionally memoize the outcome of nonsimple predicates of eval). Only recursion branches starting at these points can possibly process the same node and expression more than once. Thus, it suffices to memoize at these points (see Sect. 5.7 for details).
Page Management. To keep the resource consumption of PAAT low, PAAT minimizes the number of simultaneously retained pages and frees a page as soon as possible, either by explicitly or by implicitly replacing it with a new page. To decide whether we can replace a page with a new one, eval maintains recursively a flag Free which is set to true if a page is not required anymore by the caller—and thus can be overridden or freed. In our example, we visit for nodes 4, 7, 10, and 12 a new page recursively. In the first three of recursive calls, Free is set to false, since we still need the current page to follow the last link to another page. Only in case of the last link, Free is set to true. When a page is removed from memory, both the browser DOM and the corresponding entries in the lookup tables are freed.
5.2 Context tuples
eval and eval_ take as input a (full or simple) OXPath expression, and two sets ICtx and Ctx of context tuples, both of the same type (out of two possibilities): (1)XPath context tuples\(c_x = \left<c_x.n_x, c_x.p_x \right>\) depend on their context set to evaluate position and size and consist of the context node and its parent context node, see [26]. (2)Extraction context tuples\(c_e =\left<c_e.c_x,c_e.p,c_e.l\right>\) consist of one XPath context tuple and the ids of the last parent and sibling extraction match—reminiscent of the context tuples in the semantics, allowing subsequent extraction matches to be nested according to the predicate nesting (Sect. 4.4).
As remarked above, XPath context tuples are always relative to some context set, which determines the position of the tuple within the set. Since we want to evaluate only part of the Ctx, we maintain those tuples in \(\mathsf{{ICtx}}\subseteq \mathsf{{Ctx}}\), and use Ctx only for determining the position of a tuple. For efficiency, we maintain several context sets in the same program variable. Hence, we select all tuples with same parent context node (and same parent and sibling extraction matches) to obtain the restricted tuple set \(\mathsf{{Ctx}}_{c_x}\) (\(\mathsf{{Ctx}}_{c_e}\)) containing only tuples from the (proper) context of \(c_x\). Furthermore, we write \(\mathsf{Ctx }_{c_e,c_e^{\prime }}=\{\left<c_e^{\prime \prime }. c_x, c_e^{\prime }.p, c_e^{\prime }.\right> \mid c_e^{\prime \prime } \in \mathsf{{Ctx}}_{c_e}\}\) to adapt parent and sibling match in a restricted context set.
5.3 PAAT simple evaluation
The first procedure, evalT_ (Algorithm 1), evaluates a simple expressionExpr on a context tuple \(c_x\), belonging to a context set Ctx. As simple expression, Expr is free of actions or extraction markers, but may use other OXPath features such as Kleene stars. For brevity, Algorithm 1 only deals with the most important expressions, that is, axis navigation, Kleene stars, and predicates. The omitted parts (mostly functions and operators) do not affect the algorithm design and can be added analogously to predicates. In our design of evalT_, we are inspired by polynomial time XPath evaluation algorithms from [26]: evalT_ implements a dynamic programming approach in maintaining a memoization table Lookup, which maps subexpressions and context tuples to intermediate results.
(1) For axis navigation (Line 6), we obtain \(\mathsf{{ICtx}}=\mathsf{Ctx}^{\prime }\) via \(\mathsf{{axis}}\) and \(unary_{\mathsf{{nodes}}}\), adjusting the parent node to \(c_x.n_x \).
(2) In case of Kleene star expressions without actions or extraction markers (Line 9), each successive iteration might return nodes already reached through prior iterations. Thus, in each of the \(w\) application of path, we avoid traversing paths starting at already analyzed context tuples. More specifically, we collect in OCtx the tuples reached by the compound Kleene star expression, and maintain in ICtx the tuples to be explored. Inside the loop, if we have reached the lower bound \(v\), we remove from ICtx all tuples already collected in OCtx (as they would be redundant), and take all new nodes into OCtx (Line 12). If there are no new tuples left, we break the loop (Line 14); otherwise, we evaluate path once (Line 15) to complete the iteration. Finally, we add OCtx to ICtx and set in all resulting tuples \(c_x.n_x \) as parent context (Line 16). The latter groups the tuples in the context ICtx, as ICtx might become part of a larger context set subsequently: This ensures, for example, that in an expression \(\psi \)/(\(\phi \)){2,3}[\(i\)], for each node \(n\) matching \(\psi \), the \(i\)th node reached via \(\phi \) from \(n\) is returned, instead of the single \(i\)th node among all nodes reached via \(\psi \)/(\(\phi \)){2,3}.
(3) We deal with predicate expressions\([q]\) (Line 18), by recursively evaluating \(q\) with evalT_, since \(q\) must be simple itself. If the evaluation returns a nonempty set, we keep \(c_x\) in ICtx, and likewise, determine \(\mathsf{{Ctx}}^{\prime }\) as the subset of Ctx whose nodes satisfy \(q\) (Line 20).
Algorithm 2 iterates over a set of context tuples ICtx and calls evalT_ for each tuple. It is split into two parts: The first part (Lines 33–66) covers the case that ICtx contains extraction context tuples, the other part the case of XPath context tuples. In the former case, we strip the extraction matches from the tuples to reduce the space for Lookup in evalT_, see Sect. 5.7. Either way, we call evalT_ for each tuple in ICtx, providing the restriction \(\mathsf{{ Ctx}}_{ c_x}\) (or \(\mathsf{{Ctx}}_{c_e}\)) of Ctx to the proper context set of \(c_x\)(or \(c_e\)). In case of extraction tuples, the algorithm reattaches the parent and sibling matches to the returned nodes in Line 6.
5.4 PAAT full evaluation
In Algorithm 3, we show eval for handling full OXPath. Building upon eval_, eval deals with actions and extraction markers, and expressions that contain actions and extraction markers as subexpressions, that is, Kleene stars and predicates. Next to exploiting the memoization of eval_, eval also memorizes the outcome of predicate evaluations in its own lookup table \(\mathtt{{Lookup}}_{\exists }\). As in evalT_, we omit functions and operators for clarity; they are handled analog to predicates.
Performing the actions in the expression, eval traverses the page tree in a depthfirst manner. For Kleene stars without actions on the main path, eval works similar to evalT_, since all resulting nodes are on the same page, even if actions within the predicates of the Kleene star expressions may navigate to other pages. While the input context set Ctx only contains nodes from a single page, the result set Result, however, may contain nodes from many pages.
(1) The first main case deals with actions (Line 7), covering both absolute \(\{\textit{action}\}\) and contextual actions \(\{\textit{action}/\}\). Roughly, we iterate over all \(c_e\) in ICtx, obtaining per \(c_e\) a new context set ICtx\(^{\prime }\) with a single tuple. That tuple is either the root of the page returned by the action applied to \(c_e\) or the result of evaluating the actionfree prefix on that root. Either way, the parent and sibling extraction matches are copied from \(c_e\). In getPage, we free the page to perform the action upon, if the input flag Free is set and \(c_e\) is last in the iteration (Line 9). Upon freeing a page, all memoization information in Lookup and Lookup\(_{\exists }\) related to this page is freed, too. If the action is contextual (Line 11) and did not alter the page, we stay at \(c_e\) to avoid evaluating the actionfree prefix afp\((\textit{action},c_e.c_x)\) (Line 12), as done otherwise. Either way, we evaluate \(t\) recursively on \(\mathsf{{ICtx}}^{\prime }\), descending one step further in the depthfirst traversal of the page tree (Line 14). We set Free to true, since the page and all related memoized information is freed after the invocation in any case (Line 15).
(2) For extraction markers (Line 16), first the value to be extracted is evaluated with evalT_, as extraction values are always computed from simple expressions (Line 19), and the obtained result tuple written to the output (Line 20). Finally, we add a tuple to \(\mathsf{{ICtx}}^{\prime }\) that is identical to \(c_e\) up to the sibling match which is set to out\((c_e.c_x.n_x,m)\). The tail is then recursively evaluated with the new context set \(\mathsf{{ICtx}}^{\prime }\).
(3)Kleene stars containing actions are treated in two cases: (i) If a Kleene star contains an action on its main path, we expand the expression (Line 27) and recursively evaluate the expanded expression (Line 28). Once the expansion has reached the lower bound, that is, \(v\) became 0, we also collect results by evaluating the tail \(t\) (Line 26). The results for the final iteration are collected separately (Line 29). By evaluating the tail \(t\) at each individual recursion step, we avoid context sets with nodes from different pages and nevertheless evaluate the same expression only once for the same context, as each iteration yields nodes from different pages. (ii) Otherwise, without actions on the main path (Line 30), different iterations can never rereach a node already processed, and hence, we use similar strategy as in evalT_.
(4) It remains to address predicates containing actions or extraction markers (Line 40). Here, we need to evaluate the contained expression \(q\) for every \(c_e\) in ICtx to test whether it yields \(\emptyset \). Since \(q\) contains an action or extraction marker, we need to use eval. Doing so without memoization would lead to an exponential runtime, due to expressions such as //a[.//b[{click/}][.//c[{click/}][\(\ldots \)]]]—requiring another lookup table \(\mathtt{{Lookup}}_{\exists }\). Here, we need to memoize an entry per extraction context tuple and expression; however, as result we only store true or .false We construct the extraction context tuple \(c_e^{\prime }\) for evaluating the predicate by taking the sibling extraction match as new parent match (Line 44). Then \(q\) is evaluated over \(\{c_e^{\prime }\}\) with \(\mathsf{{Ctx}}_{c_e,c_e^{\prime }}\) (see Sect. 5.2) as new context set (Line 46). It remains to evaluate tail \(t\) on the filtered context \(\mathsf{{ICtx}}^{\prime }\) (Line 50).
5.5 PAAT example
With the algorithms at hand, we now revisit the example shown in Figs. 9 and to discuss its processing in detail.
1—Navigation. PAAT starts with \(\mathsf{{eval}}(\mathsf{Expr},\mathsf{Ctx},\mathsf{Ctx},\mathbf{true})\) where \(\mathsf{{Ctx}} = \{\left<\left<\bot ,\bot \right>,\mathsf{{results}},\mathsf{{results}}\right>\}\). Expr is split into \(h=\epsilon , e=\)doc(uri), and \(t=\)descendantorself::node()..., see Line 4 of Algorithm 3. As \(h\) is empty, the call to eval_ in Line 3 returns the unchanged context tuples, continuing with \(e=\)doc(u), an absolute action to load the new page (Line 10). Thereby, we Free\(^{\prime }=\)Free\(=\)true, since there is only a single tuple in the context set. On loading \(u\), getPage returns \(0\), the root of the page tree in Fig. 9. The context tuple \(\left<\left<0, \bot \right>, \mathsf{{results}}, \mathsf{{results}}\right>\) is used for processing the tail recursively (Line 14).
In the recursive invocation on \(0\) (yielding the second box in Fig. 10), the former tail \(t\) becomes ,Expr is split into \(h =\)/descendantorself::node()/child::div, \(e =\)\(<\)R\(>\), and the rest \(t\). First, eval evaluates \(h\) with eval_ which calls evalT_ for each single context node. Since evalT_ has no memoized data available for \(h\) (Line 3), it splits Expr into descendantorself::node() and child::div (Line 7). Evaluating the first expression, the context is expanded in Line 7 to all nodes in the page and child::div is evaluated on these nodes with eval_ which calls evalT_ once for each node. In Fig. 10, we summarize these calls with three boxes: Starting at 0, descendantorself::node() yields a summary box for the DOM nodes 1...12 (albeit, there is one call for each node), and a box for 0. Finally, child::div leads from 0 to 1.
Thus eval_ returns \(\left<\left<1,\bot \right>,\mathsf{{results}},\mathsf{{results}}\right>\) to eval to continue with the evaluation of ::\(<\)R\(>\) (Line 16). For that context, \(\left<{out}(1,R),\mathsf{{results}},R,\mathsf{{null}}\right>\) is written to the output by the evaluation of ::\(<\)R\(>\) (Line 20). In Fig. 9, the new tuple is shown as a (square) node R.
eval continues recursively with the rest of the expression using the context tuple \(\left<\left<1,\bot \right>, \mathsf{{results}},{out}(1,R)\right>\), carrying the fresh output node \({out}(1,R)\) as new sibling marker.
The rest of the expression is split again, this time into the simple expression \(h\) of four XPath steps between ::\(<\)R\(>\) and the predicate, the predicate (as \(e\)) and an empty tail. The evaluation of the simple expression is again delegated to evalT_ (via eval_): 1 has all nodes except \(\bot \) as descendants, but only the only nodes with p are 1, 5, and 8, all others will return subsequently empty results. The evaluation continues in a depthfirst manner, evaluating the left most branches first. Figure 10 shows how we first find all a descendants of p descendants of 1, memoizing the results at every step. When we later search for such nodes for 5 and 8, we find that all matching nodes have been already evaluated and the corresponding results memoized.
2—Predicate Evaluation. The evaluation of the predicate in eval starts with the context set containing one context tuple for 4, 7, 10, and 12. For example, with 4, we get the tuple \(\left<\left<4,2\right>,\mathsf{{results}},{out}(1,R)\right>\). To evaluate \(e=\) :[{click/}/....], eval changes the tuple to \(\left<\left<4,2\right>, {out}(1,R), {out}(1,R)\right>\), such that the last sibling match \({out}(\mathtt{div}_1,R)\) becomes the parent in the recursive call to eval (Line 46). Thus, all tuples extracted during the predicate evaluation (i.e., within this branch of the call tree) are descendants of \({out}(1,R)\).
The recursive calls to eval split Expr into \(e=\) :[{click/}] and \(t=\) :/descorself::node()/title:\(<\)t=string(.)\(>\). In the first three of these four calls, Free is set to ,false since the page is still needed, but in the last invocation, Free is set to ,true such that the page buffer of the current page can be reused. Accordingly, the action in \(e\) opens the linked page with getPage (Line 10) into either a new or the current page buffer, overwriting the old page. In any case, the context now refers to the root node of the new page. eval evaluates \(t\) recursively (Line 14), setting Free to true in any case, the newly loaded page will be freed after this invocation anyway (Line 15).
The subsequent recursive invocations on the new page navigate to the title node for value extraction. Hence, with \(i\) ranging over the title nodes, \(\mathsf{{evalT}}\_\) outputs the tuple \(\left< {out (i,t),out (1,R),t,\mathsf{val}_i}\right>\) (diamond in Fig. 9) as child of \({out}(1,R)\), associated with its textual content \(\mathsf{{val}}_i\) (Line 20).
5.6 Intensional axes
So far, we did not consider the evaluation of intensional axes. As we will show in Theorem 8, we could rather assume that they are precomputed on page load. In practice, however, it is usually preferable to delay that computation and to use memoization, requiring a small modification to the PAAT algorithm: First, we must explicitly manage the environment containing the variable bindings. While technically necessary for plain XPath, we omitted the variable bindings in the algorithms for OXPath without intensional axes for clarity, as these bindings are handed through all recursive invocations unaltered. Second, in Line 7 of evalT_, we can no longer assume that the axis is precomputed, but rather need to determine all nodes related to the current context node by calling eval with the expression defining the intensional axis and an updated environment. If the result is not empty, the node is related to the context node and added to Ctx\(^{\prime }\). To avoid multiple evaluations of the same intensional axis, we guard this evaluation with a memoization table similar to Lookup\(_{\exists }\), but with an additional entry for the related node. In Sect. 6, we show that, implemented in this way, intensional axes have little impact on the practical performance of OXPath.
5.7 Analysis of PAAT
Complexity of OXPath family
Extraction  Actions  Kleene  Time  Space  Page buffers 
 

Norm.  Any  Abs.  Context.  Bounded  Any  
–  –  –  –  –  –  \(O(q^2 \cdot n^4)\)  \(O(q^2 \cdot n^3)\)  \(O(1)\)  XPath + style, ... 
–  –  –  –  X  (X)  \(O(q^2 \cdot n^4)\)  \(O(q^2 \cdot n^3)\)  \(O(1)\)  Simple 
X  –  –  –  –  –  \(O(q^2 \cdot n^4)\)  \(O(q^2 \cdot n^3)\)  \(O(1)\) 

X  X  –  –  –  –  \(O(q^2 \cdot n^4)\)  \(O(q^2 \cdot n^4)\)  \(O(1)\) 

–  –  X  (X)  –  –  \(O(q^2 \cdot (p \cdot n)^4)\)  \(O(q^2 \cdot (\min (q,d) \cdot n)^3)\)  \(O(\min (q,d))\) 

–  –  X  (X)  X  X  \(O(q^2 \cdot d \cdot p^2 \cdot n^4)\)  \(O(q^2 \cdot d^3 \cdot n^3)\)  \(O(\min (q,d))\)  Extractionfree 
X  –  X  (X)  –  –  \(O(q^3 \cdot p^4 \cdot n^4)\)  \(O(q^6\cdot n^3)\)  \(O(\min (q,d))\) 

X  X  X  (X)  –  –  \(O(q^3 \cdot p^4 \cdot n^4)\)  \(O(q^7 \cdot n^4)\)  \(O(\min (q,d))\)  Kleenefree 
X  –  X  (X)  X  X  \(O(q^2 \cdot d \cdot p^3 \cdot n^4)\)  \(O(q^2 \cdot d^4 \cdot n^3)\)  \(O(d)\)  Normalized 
X  X  X  (X)  X  X  \(O(q^3 \cdot d \cdot p^4 \cdot n^4)\)  \(O(q^3 \cdot d^5 \cdot n^4)\)  \(O(d)\)  full 
Theorem 5
Evaluating a simple OXPath expression with evalT_ takes at most \(O(q^2 \cdot n^4)\) time and \(O(q^2 \cdot n^3)\) space where \(q\) is the size of the expression and \(n\) the number of nodes in all documents reached by the evaluation.
In case of simple OXPath,^{3}\(n\) is the number of nodes in the start document.
Proof
The proof follows closely the proof of Theorem 6.6 in [26], since simple OXPath is roughly comparable to XPath, adding only Kleene stars, the style axis and a few operators. Due to the memoization, the Kleene star does not affect the complexity (it does, however, impact practical performance, as it generates on average far more Lookup entries than other expressions).
PAAT is a topdown, recursive implementation of the context value table principle from [26].
We first show that evaluating a simple OXPath expression using evalT_ takes \(O(l_{} \cdot T_\mathsf{{op}})\) time and \(O(l_{} \cdot S_\mathsf{{val}})\) space where \(l_{} \) is the maximum number of entries, \(S_\mathsf{{val}} \) the maximum size of an entry in the lookup table Lookup in evalT_, and \(T_\mathsf{{op}} \) the maximum cost for evaluating a function or operator. This holds as the body of evalT_ runs at most once per entry in .Lookup For each entry, the time for executing evalT_ is bounded by the maximum time for evaluating a function or operator, as this time dominates the other cases in evalT_ (all bounded by \(O(n)\), whereas the function or operator evaluation is \(\ge n\)). For size, observe that other than the size of the lookup table, we only manage the three contexts Ctx, \(\mathsf{{Ctx}}^{\prime }\), and \(\mathsf{{Ctx}}^{\prime \prime }\) and the Result, the first three bounded by \(O(n^2)\), the second by \(O(n)\), and thus dominated by the size of Lookup.
Second, we show that (1)\(l_{} \in O(q \cdot n^2)\). This follows immediately from the signature of Lookup. (2)\(S_\mathsf{{val}} \in O(q \cdot n) \). concat() and multiplication are the operations that yield the largest increase in value size and the resulting values are bounded by \(O(q \cdot n)\), see [26]. (3)\(T_\mathsf{{op}} \in O(q \cdot n^2)\). Again following Theorem 6.6 in [26], we observe that the most expensive operation is \(=\) which compares two nodesets of up to \(O(n)\) size. Together with the bound on value size, we obtain a bound of \(O(q \cdot n^2)\) for \(T_\mathsf{{op}} \). The added axes or comparison operators in OXPath do not affect this result if we assume a precomputed table for \(\sim \) and \(\sim \)= as for =. We also deviate in how we compute position() and size() by projecting the context set to tuples with the same parent and sorting the result. However, this is done in \(O(n \log n)\) and thus dominated by the time for =. \(\square \)
Proposition 5
Evaluating an OXPath expression with eval takes \(O(N_{\mathsf{Expr}} \cdot N_\mathsf{{c_e}} + N_{\mathsf{Expr}} \cdot N_\mathsf{{c_x}} \cdot T_\mathsf{{op}})\) time and \(O(N_{\mathsf{Expr}} \cdot S_{\mathsf{Ctx}} + l_\exists + l_{} \cdot S_\mathsf{{val}}))\) where \(N_{\mathsf{Expr}}\) the number of subexpressions evaluated by recursive calls in the evaluation, \(N_\mathsf{{c_e}} \) the number of extraction context tuples, \(N_\mathsf{{c_x}} \) the number of XPath context tuples from all reached pages, \(S_{\mathsf{Ctx}}\) the maximum size of a context set in ,eval\(l_\exists \) the maximum size of Lookup\(_{\exists }\), \(l_{} \) the maximum size of Lookup, and \(S_\mathsf{{val}}\), \(T_\mathsf{{op}}\) as in Theorem 5.
Proof
We first consider space complexity: eval uses \(\mathtt{{Lookup}}_{\exists }\) and (indirectly) the lookup table Lookup for evalT_. Additionally, we have to account for the various context sets and the result set .Result The latter can be streamed out rather than stored, as it is never processed further. The context sets are accounted for by the \(S_{\mathsf{Ctx}} \cdot N_{\mathsf{Expr}}\), as each call to eval uses a constant number of such sets and the depth of the call graph (and thus the number of simultaneously stored context sets) is bounded by \(N_{\mathsf{Expr}}\). It suffices to consider \(l_{}\cdot S_\mathsf{{val}}\) for the impact of evalT_, as Theorem 5 shows that this expression dominates its space complexity.
For time complexity, observe that eval is called at most \(N_{\mathsf{Expr}} \cdot N_\mathsf{{c_e}}\) times. This holds since, eval is never called twice on the same expression for the same context tuple other than in Line 47 where two calls may use the same context tuple \(c_e^{\prime }\) (originating from context tuples with different parent matches). However, in that case, the total number of calls is still bounded by the original ICtx set.
In each such call, eval may delegate the evaluation of the expression or some subexpression to evalT_. Due to the memoization in evalT_, the total time for all these calls is, however, bounded by \(N_{\mathsf{Expr}} \cdot N_\mathsf{{c_x}} \cdot T_\mathsf{{op}}\) as any repeated calls immediately return the memoized result and the memoization tables are kept until all nodes from the page are processed.
Other than those calls and recursive calls to itself on a proper subexpression, eval only requires constant time per node if we assume that action execution is constant. \(\square \)
Theorem 6
Evaluating a full OXPath expression requires \(O(q^3 \cdot d \cdot p^4 \cdot n^4)\) time and \(O(q^3 \cdot d^5 \cdot n^4)\) space.
Proof
For this proof, we first observe the invariant on :eval each context set contains only context tuples with context nodes from one page.
For full OXPath, the following bounds hold (\(q\) query size, \(d\) depth of the page tree reached by the evaluation, \(n\) maximum number of nodes on a page, \(p\) number of pages reached by the evaluation) (1)\(N_{\mathsf{Expr}}\) is bounded by \(O(q \cdot d)\) not \(O(q)\) as the expansion of Kleene stars in Line 27 introduces new expressions. However, there can be at most \(d+1\) such expansions on the path to the root from any leaf expression, as each expansion includes at most one action and, after \(d\) expansions any additional expansion yields an expression with \(d+1\) actions on a single path and thus an empty result as the page tree is bounded by \(d\). In this case, the expansion is stopped and we obtain the bound of \(O(q \!\cdot \! d)\). For the evaluation of actionfree prefixes, we at most double this number (if each step contains a contextual action). (2)\(N_\mathsf{{c_e}}\) is bounded by \(O(q^2 \cdot p^4 \cdot n^4)\) since there are at most \(p \cdot n\) (parent or actual) context nodes, each such pair combined with at most \(q \cdot p \cdot n\) different parent and sibling matches, since they must originate from an extraction marker on the path to the root and such a path has at most \(q\) distinct such expressions. Kleene star expansion may cause the same extraction marker to occur in several positions, but matches for all occurrences are indistinct. (3)\(N_\mathsf{{c_x}} \) is similarly bounded by \(O((p \cdot n)^2)\). (4)\(T_\mathsf{{op}} \) and \(S_\mathsf{{val}} \) are as in Theorem 5, as we do not allow actions in operands of functions and operators and thus an operand is limited to nodes from a single page. The value size is also not affected by Kleene star expansions. (5)\(l_{} \) is bounded by \(O(N_{\mathsf{Expr}} \cdot (d \cdot n)^2)\), since only the lookup entries from at most \(d\) pages are active at a time. (6)\(l_\exists \) is similarly bounded by \(O(N_{\mathsf{Expr}} \cdot q^2 \cdot d^4 \cdot n^4)\) since there are only tuple from at most \(d\) pages with at most \(n\) nodes stored in \(\mathtt{{Lookup}}_{\exists }\) and each of those may be combined with \(q {\cdot } d {\cdot } n\) parent and the same number of sibling markers. (7)\(S_{\mathsf{Ctx}}\) is bounded by \(O(n^2 \cdot q^2 \cdot d^2 \cdot n_m^2)\) as each context set contains only extraction context tuples for context nodes from one page (but extraction matches may originate from any page on the current branch of the page tree). With this, the complexity follows from Theorem 5. \(\square \)
Theorem 7
Evaluating a normalized OXPath expression takes \(O(q^2 \cdot d \cdot p^3 \cdot n^4)\) time and \(O(q^2 \cdot d^4 \cdot n^3)\).
Proof
For normalized OXPath, we can drop the sibling extraction markers from the extraction context tuples in eval and eval_.
The complexity remains as for full OXPath except for (1)\(N_\mathsf{{c_e}} \) is bounded by \(O(q \cdot p^3 \cdot n^3)\) since we do not need to maintain sibling extraction matches. (2)\(l_\exists \) is bounded by \(N_{\mathsf{Expr}} \cdot (q \cdot d^3 \cdot n^3)\)for the same reason. (3)\(S_{\mathsf{Ctx}}\) is bounded by \(O(q \cdot d \cdot n^3)\) as each context set contains only extraction context tuples for context nodes from one page (but parent extraction matches may originate from any page on the current branch of the page tree). With this, the complexity follows from Theorem 5.\(\square \)
Theorem 8
Intensional axes increase the complexity of OXPath and any sublanguage including XPath by at most a factor of \(O(n^2)\) where \(n\) is the total number of nodes in the page tree.
Proof
In general, intensional axes can be evaluated as follows: First, we materialize all intensional axes in an expression bottomup. Then, the expression is evaluated over the intensional axis as usual. The actual evaluation complexity is not affected as the two necessary operations testing whether a pair of nodes is in an axis and iterating over all nodes that are related to a given context node by an axis, retain their (constant, resp. linear) complexity. For the materialization of the intensional axes, we first note that there are at most \(q\) such axes. For each axis, we need to store at most \(O(n^2)\) tuples. To compute the materialization, we need to evaluate the expression inside the intensional axis at most once for each pair of nodes, binding each successively to \(\$lhs\) and \(\$rhs\). \(\square \)
Theorem 9
(Memory minimality) Let \(L\) be an OXPath expression without actions in predicates. Then there exists a page tree for which every algorithm that computes \(\left[\![\, \cdot \,\right]\!]_{\mathsf{V}}\) without prior knowledge of the page tree requires at least as many page buffers as PAAT.
Proof
An expression from \(L\) has the shape \(e_d=\) (docw)\(r_d\) with \(r_d=\) /\(\phi _d\)/{action}\(r_{d1}\) for \(d>1\) and \(r_1= \epsilon \). Assume further that in the page tree of the expressions \(e_d\), each page has at least two nodes with an action that leads again to another page of this form. \(e_d\) executes {action} on all nodes of a page \(w\) that match the corresponding \(\phi _i\), and continues recursively from all pages thus reached. It returns the roots of the pages finally reached.
When we evaluate \(e_d\) with PAAT, we access the page tree up to a depth of \(d\) and use exactly \(d\) page buffers. This holds, since the accessed page tree has at least two branches at each page.
Any other algorithm \(A\) must load the leaves of the accessed page tree of PAAT as these nodes are the result of evaluating \(\left[\![\, e_d\,\right]\!]_{\mathsf{V}}\). To visit such a leaf node \(l\) of the accessed page tree, we have to load its parent \(p\) first, because, without prior knowledge, all children of \(p\) are only accessible by performing {action} on the respective node in \(p\). Thus, \(A\) must have loaded all \(d1\) ancestors of \(l\) to finally access \(l\). Assume that \(l\) is the first leaf reached by \(A\). Then, \(A\) must buffer all \(d1\) ancestors in addition to \(l\), because for each ancestor of \(l\), there are further children to be visited. \(\square \)
6 Evaluation
(1) The theoretical complexity bounds from Sect. 5.7 are confirmed in several largescale extraction experiments in diverse settings, in particular the constant memory use even for extracting millions of records from hundreds of thousands of web pages.
(2) We illustrate that OXPath’s evaluation is dominated by the browser rendering time, even for complex queries on small web pages. None of the extensions of OXPath (Sect. 1.1) significantly affects the scaling behavior or the dominance of page rendering.
(3) In an extensive comparison with commercial and academic web extraction tools, we show that OXPath outperforms previous approaches by at least an order of magnitude. Where OXPath stays within constant memory bounds independently of the number of accessed pages, most other tools require linear memory.
Profiling: Page Rendering is Dominant. We profile each stage of OXPath’s evaluation performing five sets of queries on the following sites: apple.com (D1), diademproject.info (D2), bing.com (D3), www.vldb.org/2011 (D4), and the Seattle page on Wikipedia (D5). On each, we click all links and extract the html tag of the resulting pages. Figure 12a, b show the total and pagewise averages, respectively. For bing.com, the page rendering time and number of links is very low, and thus also the overall evaluation time. Wikipedia pages, on the other hand, are comparatively large and contain many links, thus the overall evaluation time is high.
Actions (as well as extraction markers and Kleene star expressions) affect the evaluation notably, as expected. This contrasts sharply with the added selection capabilities: Neither style, field(), nor the added selectors significantly impact evaluation performance. The same holds for intensional axes. Figure 13 summarizes this observation showing the evaluation time for a series of expansions of a simple expression. It compares the case where the expression is pure XPath (/descendantorself::*[self::*] repeated \(1\) to \(5\) times) with the case where the expression uses a style axis (instead of self::*). The results are nearly identical for intensional axis or any of the other added features, if expression size is properly accounted for. We show results for up to 5 repetitions, but observe that the roughly 10 % overhead holds also for much larger expressions. Note that these results are affected by our use of the XPath engine provided by the browser, which does not perform any optimization of XPath expressions.
OrderofMagnitude Improvement. We benchmark OXPath against four commercial web extraction tools, namely, Web Content Extractor [2] (WCE), Lixto [12], Visual Web Ripper [3] (VWR), the academic web automation and extraction system Chickenfoot [17], and the open source extraction toolkit Web Harvest [4]. Where the first three can express at least the same extraction tasks as OXPath (and, for example, Lixto goes considerably beyond), Chickenfoot and Web Harvest require scripted iteration and manual memory management for many extraction tasks, particularly for realizing multiway navigation. We do not consider tools such as CoScripter and iMacros as they focus on automation only and offer no iterative constructs as required for extraction tasks. We also disregard tools such as RoadRunner [22] or XWRAP [33], since they work on single pages and lack the ability to traverse to new web pages.
In contrast to OXPath, many of these tools cannot process scripted web sites easily. Thus, we choose an extraction task on Google Scholar as benchmark example, since it does not require scripted actions. On heavily scripted pages, the performance advantage of OXPath is even more pronounced. With each system, we navigate the citation graph to a depth of 3 for papers on “Seattle”.
An equivalent Web Harvest program takes 54 lines, whereas an equivalent Chickenfoot script takes 27, and the other tools use visual interfaces.
We record evaluation time and memory consumption for each system. We measure the normalized evaluation time, in which we discount the time for page loading, cleaning, and rendering. This allows for a more balanced comparison as the differences in the employed browser or web cleaning engines affect the overall runtime considerably. Figure 14a shows the results for each system up to 150 pages. Though Chickenfoot and Web Harvest do not render pages at all or do not manage page and browser state, OXPath still outperforms them. The systems that manage state similar to OXPath are between two and four times slower than OXPath even on this small number of pages.
Figure 14c illustrates the memory use of these systems. WCE and VWR are again excluded, but they show a clearly linear memory usage in those tests we were still able to run. Among the systems in Fig. 14c, only Web Harvest comes close to the memory usage of OXPath, which is not surprising as it does not render pages. Yet, even Web Harvest shows a clear linear trend. Chickenfoot exhibits a constant memory consumption just as OXPath, though it takes about ten times more memory in absolute terms. The constant memory is due to Chickenfoot’s lack of support for multiway navigation that we compensate by manually using the browser’s history whenever possible. This forces reloading when a page is no longer cached, but requires only a single active DOM instance at any time. We also tried to simulate multiway navigation in Chickenfoot, but the resulting program was too slow for the tests shown here.
7 Related work
The automatic extraction and aggregation of web information is not a new challenge. Almost all previous approaches require either (1) service providers to deliver their data in a structured fashion, as in the Semantic Web, or (2) clients to wrap unstructured information sources to extract and aggregate relevant data. The first case levies requirements service providers have little incentive to adopt, rendering clientside wrapping as only realistic choice.
As recognized in [6], wrapping web sites has become even more involved with the advent of AJAXenabled web applications, which reveal the relevant data only during user interactions. Previous approaches to web extraction [34, 46] do not adequately address web page scripting. Where scripting is addressed [12, 15, 42, 47], the simulation of user actions is neither declarative nor succinct, but rather relies on imperative action scripts and standalone, heavyweight extraction interfaces. Web automation tools such as [17, 39] are increasingly able to deal with scripted web applications, but are tailored to automate a single sequence of user actions. Hence, they are neither convenient nor efficient for largescale data extraction with their inherent multiway navigation, necessary to reach all relevant information pieces by following multiple links on the same page.
Thus, in the following, we particularly consider (1) filling and submitting (scripted) web forms, (2) multiway navigation, and (3) memory management for largescale extraction. We focus on supervised extraction tools, categorized into web crawlers, web extraction IDEs, extraction languages, and web automation tools. Thus, we exclude unsupervised web extraction tools (see [21] for a survey), as they focus on automated analysis rather than extraction, rendering them rather incomparable to OXPath.
Web Crawlers. Most commonly employed by search engines for indexing web pages, such crawlers store the relevant information on any found web pages and move on to new pages by traversing all present hyperlinks. Due to the commercial relevance of web crawlers, rather little published research exists, as compared to their prominence in industrial applications. Nevertheless, some work has been published, for example, most famously on the Google crawler [18]. Acting only on static page representations, such crawlers are unable to handle dynamic, scripted content.
Thus, these web crawlers, in their current state, are incapable of extracting information from content reachable only via some user interaction. Bergman [14] first recognized that such amount of content exceeds the quantity of data accessible by hyperlink traversal by far, and is assumed to grow in importance ever since [28] OXPath expressions can specify the crawling of scripted web sites by following web links, submitting forms, etc.
Web Extraction IDEs. Web extractionIDEs, such as [12], have a wider scope than OXPath, as they provide an entire development environment (e.g., extraction cluster on Amazon Cloud EC2, full support of XPath 2.0). But in terms of extraction speed and memory consumption, OXPath outclasses these systems by a wide margin (see Sect. 6). Lixto, Visual Web Ripper [3], and Web Content Extractor [2] are interactive wrapper generator frameworks, recording user actions in browsers to replay these actions for extracting data. As our experimental evaluation (Sect. 6) demonstrates, the memory footprint of these systems grows linear in the number of accessed pages—in contrast to OXPath’s constant memory requirements.
Deep web extraction tools such as [15] increasingly deal with scripted, highly visual web sites, but infer the extraction scripts automatically from userprovided examples. Though allowing for easy wrapper generation, such an approach lacks the precision necessary for many fully automated tasks. An another example, BODE [47] is a browserbased extraction tool, whose imperative extraction language BODED is not portable and hard to optimize. Without minimizing the memory requirements, BODE replicates complete browser instances for multiway navigation, imposing a significant performance penalty, rendering BODE unsuitable for largescale data extraction.
Web Extraction Languages. Most extraction languages follow a declarative approach [9, 34, 37, 44–46], much like OXPath. However, they do not adequately facilitate deep web interaction, such as form submissions, often due to their age. Also, they do not provide native constructs for page navigation, apart from retrieving a page from a given URL. As an exception, the BODED extraction language [47] deals with modern web applications, but is unsuitable for largescale extraction tasks, as discussed in the previous section.
In our evaluation, we compare with Web Harvest [4], a recent, open source extraction language. Extraction tasks are specified as imperative scripts, formulated inXML. Web Harvest does not deal with interactive web applications and does not give access to the rendered page, but rather to a cleaned XML view of HTML documents.
Another strand of research employed XML technologies, for example, XPath, for information extraction. As a notable example, ANDES [40] is capable of navigating modern web interfaces, but only by generating URLs from naively filled forms and feeding these URLs back to the underlying crawler. In contrast, OXPath embeds extraction and navigation into a single seamless process, handling more complicated web interfaces in a more intuitive manner.
Otherwise similar to ANDES, the approach in [7] is limited to generalizing tree traversal patterns. A third example are Lwrappers [10], albeit limited to scraping data from result pages returned on query submissions. In [20], the authors reported on an XPathbased interface for web forms, but did not release their work so far. As a final example, WebProspector [38] processes the Deep Web within the science domain, but appears to be limited to this domain.
XLog [46] extends the ideas from Elog (the datalogbased extraction formalism underlying Lixto [12]) for information extraction by embedding (procedural) extraction predicates. It is optimized for largescale information extraction tasks, but does not address any kind of web interaction such as form filling and page navigation. Earlier work also explores declarative languages for specifying extraction [9, 37, 44], but does not sufficiently support interaction or page scripting.
W4F [44] offers wysiwyg support for wrapper specification, whereas extraction rules are specified using HEL (HTML Extraction Language), an SQLlike language for HTML elements selection in the spirit of WebSQL [37] and WebOQL [9]. However, all these languages do not adequately address web interaction.
Web Automation Tools. Web automation tools mainly focus on single navigation sequences to automate a single task, but do not consider largescale web extraction with their need for low overhead and multiway navigation. Coscripter [31] and iMacros [1] are examples of such tools, not supporting multiway navigation due to their limited iterative and conditional programming constructs. Vegemite [32] is a CoScripter extension that introduces some extraction capabilities, such as querying some value for a number of inputs. However, as its authors note, such navigation patterns are expensive, since the same page might be reloaded many times. Furthermore, as the page state is not preserved, some web applications may not behave as expected.
The same applies to Chickenfoot [17], a language for web automation running its scripts in Firefox. Chickenfoot scripts are essentially imperative Javascript programs that contain loops and iterations, enabling interaction with forms as well as loading and navigating pages. Multiway navigation is possible, but only by explicit “back” instructions commanding the browser to return to previous pages. Thus, page buffering is unnecessary, but for a high price: Page states are lost, and thus, pages must be rendered anew for each branch leaving a page during a multiway navigation.
There are some other systems relying on recorded user actions, for example, WebVCR [8] and WebMacros [43], or more recently [39]. All these tools suffer from limitations on modern web pages and consider only single action sequences rather than scalable multipath data extraction tasks.
More recent work [44] addresses the issue of filling web forms automatically. This work, however, does not offer declarative scripting and makes several simplifying assumptions we do not take—for example, they consider drop down lists as the only dynamic content.
8 Conclusion and future work
To the best of our knowledge, OXPath is the first web extraction system with strict memory guarantees, which reflect strongly in our experimental evaluation. We believe that it can become an important part of the toolset of developers interacting with the web.
We are committed to building a strong set of tools around OXPath. We provide a visual generator for OXPath expressions and a Java API based on JAXP. Some of the issues raised by OXPath that we plan to address in future work are the following: (1)OXPath is amenable to significant optimization and a good target for automated generation of web extraction programs. (2) Further, OXPath is perfectly suited for highly parallel execution: Different bindings for the same variable can be filled into forms in parallel. The effective parallel execution of actions on context sets with many nodes is an open issue. (3) We plan to further investigate language features, such as more expressive visual features and multiproperty axes.
However, classical results [41] on rewriting reverse axes such as ancestor in XPath do not extend to OXPath.
Thus, (path)*[qp] = \(\left(\bigcup _{i=0}^\infty \mathtt{\textit{path} }^i\right)\)[qp] always holds, but (path)*[qp] = \(\bigcup _{i=0}^\infty \)path\(^i\)[qp] does not hold necessarily, since [qp] is applied to each of the \(i\)th copy of \(path\).
Simple OXPath is the restriction of OXPath to simple OXPath expression, but we allow a doc() action at the start of the expression to set the document to be queried.
Acknowledgments
The research leading to these results has received funding from the European Research Council under the European Community’s 7th Framework Programme (FP7/2007–2013)/ERC grant agreement no. 246858 (DIADEM). This work was carried out in the wider context of the networking programme FoX—Foundations of XML, FETOpen grant agreement number FP7ICT233599. The views expressed in this article are solely those of the authors.