As the annotations are represented in XML, there is a variety of tools available to work with the annotations. Such tools include XSLT, XPath and XQuery, as well as a number of special purpose tools—some of which were developed in the course of the Lassy project. Most of these tools have in common that particular parts of the tree can be identified using the XPath query language. XPath (XML Path Language) is an official W3C standard which provides a language for addressing parts of an XML document. In this section we provide a number of simple examples of the use of XPath to search in the Lassy corpora. We then continue to argue against some perceived limitations of XPath.
3.1 Search with XPath
We start by providing a number of simple XPath queries that can be used to search in the Lassy treebanks. We do not give a full introduction to the XPath language—for this purpose there are various resources available on the web.
3.1.1 Some Examples
With XPath, we can refer to hierarchical information (encoded by the hierarchical embedding of node elements), grammatical categories and functions (encoded by thecat andrel attributes), and surface order (encoded by the attributesbegin andend .
As a simple introductory example, the following query:
identifies all nodes anywhere in a given document, for which the value of thecat attribute equalspp. In practice, if we use such a query against our Lassy Small corpus using the Dact tool (introduced below), we will get all sentences which contain a prepositional phrase. In addition, these prepositional phrases will be highlighted. In the query we use the double slash notation to indicate that thisnode can appear anywhere in the dependency structure. Conditions about this node can be given between square brackets. Such conditions often refer to particular values of particular attributes. Conditions can be combined using the boolean operatorsand ,or andnot. For instance, we can extend the previous query by requiring that thePP node should start at the beginning of the sentence:
Brackets can be used to indicate the intended structure of the conditions, as in:
Conditions can also refer to the context of the node. In the following query, we pose further restrictions on a daughter node of the PP category.
This query will find all sentences in which a PP occurs with a head node for which it is the case that its part-of-speech label is not of the formVZ(..). Such a query will return quite a few hits—in most cases for prepositional phrases which are headed by multi-word-units such asin tegenstelling tot (in contrast with),met betrekking tot (with respect to), …. If we want to exclude such multi-word-units, the query could be extended as follows, where we require that there is aword attribute, irrespective of its value.
We can look further down inside a node using the single slash notation. For instance, the expressionnode[@rel="obj1"]/node[@rel="hd"]will refer to the head of the direct object. We can also access the value of an attribute of a sub-node as innode[@rel="hd"]/@postag.
It is also possible to refer to the mother node of a given node, using the double dot notation. The following query identifies prepositional phrases which are a dependent in a main sentence:
Combining the two possibilities we can also refer to sister nodes. In this query, we find prepositional phrases as long as there is a sister which functions as a secondary object:
Finally, the special notation.//identifies any node which is embedded anywhere in the current node. The next query finds embedded sentences which include the wordvan anywhere.
3.1.2 Left to Right Ordering
Consider the following example, in which we identify prepositional phrases in which the preposition (the head) is preceded by the NP (which is assigned theobj1 function). Here we use the operator<
to implement precedence .
Note that we use in these examples thenumber() function to map the string value explicitly to a number. This is required in some implementations of XPath.
The operator=
can be used to implement direct precedence. As another example, consider the problem of finding a prepositional phrase which follows a finite verb directly in a subordinate finite sentence. Initially, we arrive at the following query:
This does identify subordinate finite sentences in which the finite verb is directly followed by a PP. But note that the query also requires that this PP is a dependent of the same node. If we want to find a PP anywhere, then the query becomes:
3.1.3 Pitfalls
The content and sub-structure of coindexed nodes (to represent secondary edges) is present in the XML structure only once. The index attribute is used to indicate equivalence of the nodes. This may have some unexpected effects. For instance, the following query will not match with the dependency structure given in Fig. 9.1 .
The reason is, that the subject ofstoppen itself does not have a subject withlemma=hij. Instead, it does have a subject which is co-indexed with a node for which this requirement is true. In order to match this case also, the query should be complicated, for instance as follows:
The example illustrates that the use of co-indexing is not problematic for XPath, but it does complicate the queries in some cases. Some tools (for instance the Dact tool described in Sect. 9.3.3) provide the capacity to define macro substitutions in queries, which simplifies matters considerably.
3.2 Comparison with Lai and Bird 2004
In [6] a comparison of a number of existing query languages is presented, by focussing on seven example queries. Here we show that each of the seven queries can be formulated in XPath for the Lassy treebank. In order to do this, we first adapted the queries in a non-essential way. For one thing, some queries refer to English words which we mapped to Dutch words. Some other differences are that there is no (finite) VP in the Lassy treebank. The adapted queries with the implementation in XPath is now given as follows:
-
1.
Find sentences that include the word zag.
-
2.
Find sentences that do not include the word zag.
-
3.
Find noun phrases whose rightmost child is a noun.
-
4.
Find root sentences that contain a verb immediately followed by a noun phrase that is immediately followed by a prepositional phrase.
-
5.
Find the first common ancestor of sequences of a noun phrase followed by a prepositional phrase.
-
6.
Find a noun phrase which dominates a worddonker (dark) that is dominated by an intermediate phrase that is a prepositional phrase.
-
7.
Find a noun phrase dominated by a root sentence. Return the subtree dominated by that noun phrase only.
The ease with which the queries can be defined may be surprising to readers familiar with Lai and Bird [6]. In that paper, the authors conclude that XPath is not expressive enough for some queries. As an alternative, the special query language LPATH is introduced, which extends XPath in three ways:
However, we note here that these extensions are unnecessary. As long as the surface order of nodes is explicitly encoded by XML attributesbegin andend, as in the Lassy treebank, then the additional power is redundant. An LPATH query which requires that a node x immediately follows a node y can be encoded in XPath by requiring that the begin-attribute of x equals the end-attribute of y. The examples which motivate the introduction of the other two extensions likewise can be encoded in XPath by means of the begin- and end-attributes. For instance, the LPATH query
where an SMAIN node is selected which contains a right-aligned NP can be defined in XPath as:
Based on these examples we conclude that there is no motivation for an ad-hoc special purpose extension of XPath, but that instead we can safely continue to use the XPath standard.
3.3 A Graphical User Interface for Lassy
Dact is a recent easy-to-use open-source tool, available for multiple platforms, to browse and search through Lassy treebanks. It provides graphical tree visualizations of the dependency structures of the treebank, full XPath search to select relevant dependency structures in a given corpus and to highlight the selected nodes of dependency structures, simple statistical operations to generate frequency lists for any attributes of selected nodes, and sentence-based outputs in several formats to display selected nodes e.g. by bracketing the selected nodes, or by a keyword-in-context presentation. Dact can be downloaded fromhttp://rug-compling.github.com/dact/.
For the XML processing, Dact supports both the libxml2 (http://xmlsoft.org) and the Oracle Berkeley DB XML (http://www.oracle.com) libraries. In the latter case, database technology is used to preprocess the corpus for faster query evaluation. In addition, the use of XPath 2.0 is supported. Furthermore, Dact provides macro expansion in XPath queries.
The availability of XPath 2.0 is useful in order to specify quantified queries (argued for in the context of the Lassy treebanks in [1]). As an example, consider the query in which we want to identify a NP which contains a VC complement (infinite VP complement), in such a way that there is a noun which is preceded by the head of that NP, and which precedes the VC complement. In other words, in such a case there is an (extraposed) VC complement of a noun for which there is another noun which appears in between the noun and the VC complement. The query can be formulated as:
The availability of a macro facility is useful to build up more complicated queries in a transparent way. The following example illustrates this point. Macro’s are defined using the formatname = string
. A macro is used by putting the name between% %
. The following set of macro’s defines the solution to the fifth problem posed in [6] in a more transparent manner. In order to define the minimal node which dominates a NP PP sequence, we first define the notion dominates a NP PP sequence, and then use it to state that the first common ancestor of a sequence of NP PP is a node which is an ancestor of a NP PP sequence, but which does not contain a node which is an ancestor of a NP PP sequence.