Reasoning About XML Constraints Based on XML-to-Relational Mappings

Niewerth, Matthias; Schwentick, Thomas

doi:10.1007/s00224-018-9846-5

Reasoning About XML Constraints Based on XML-to-Relational Mappings

Published: 20 February 2018

Volume 62, pages 1826–1879, (2018)
Cite this article

Theory of Computing Systems Aims and scope Submit manuscript

Matthias Niewerth¹ &
Thomas Schwentick²

154 Accesses
3 Citations
Explore all metrics

Abstract

The article introduces a simple framework for the specification of constraints for XML documents in which constraints are specified by (1) a mapping that extracts a relation from every XML document and (2) a relational constraint on the resulting relation. The mapping has to be generic with respect to the actual data values and the relational constraints can be of any kind. Besides giving a general undecidability result for first-order definable mappings and a general decidability result for MSO definable mappings for restricted functional dependencies, the article studies the complexity of the implication problem for XML constraints that are specified by tree pattern queries and functional dependencies. Furthermore, it highlights how other specification languages for XML constraints can be formulated in the framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-model query languages: taming the variety of big data

Article Open access 31 May 2023

Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data

A RDF-based graph to representing and searching parts of legal documents

Article 01 July 2023

Notes

The exact framework of our reasoning algorithms will be introduced later.
As usual, we could allow a set of attributes instead of the single attribute B but such FDs can always be rewritten as a set of FDs with singleton attributes.
This includes non-injective mappings.
The precise kinds of schemas that we consider will be defined later on.
The name stems from the fact that these FDs very closely correspond to XML key constraints.
We note that Kot and White define the mapping of a tree pattern using unfolding of nested relations. The definition is equivalent to our definition using embeddings.
Duplicate free tree patterns have been considered in [15]
This constraint ensures that all embeddings are maximal, as required in the definition of tree pattern based mappings.
The restriction π₁(B) ≠ ⊥ is not strictly necessary, but it will simplify some proofs. Note that at least one of π₁(B) and π₂(B) has to be different from ⊥ in any case and we can exchange π₁ and π₂.
We will not introduce join and inclusion dependencies formally.
We note without proof, that one can achieve that the resulting tree is independent of the order in which violated dependencies are corrected, by introducing a total order on the data values and always replace the larger data value with the smaller one. Analogously a total order on the nodes of the tree needs to be introduced.
We choose a different label because the meanings of ∗ and # are slightly different. The label ∗ is a true wildcard that can match any symbol. The label # is chosen for a node, which has some unique label, but we do not know yet which label this should be. The label ∗ only occurs in patterns while the label # only occurs in trees.
We use the index τ to distinguish references to components of the targetdependency from references to dependencies fromΣ.
That is ν maps χ₂(B) and its root path to χ₁(B) and its root path and is the identity on all other nodes.
And we will see soon that there is more than one initial tree.
In cases (c) and (d), actually [t]_Sis the actual counter-example.
In other words: all paths missing in U consist of at least three edges.
Since the label of a path edge below a node u indicates the label of the child of u on that path, the extension does not need to add another child of u with that label.
In the final algorithm, U has to satisfy some additional conditions.
Stated otherwise, a cluster is a connected component of p after removing all descendant edges.
We note that an induced witness pair is not necessarily a witness pair but rather a sub-witness pair.
We note that this is a key constraint, but also the only kind of FD with a set of two attributes in Σ.
We note that it does not matter, whether the set of row numbers is disjoint from the set of column numbers as they never interact with each other.
We note that this constraint connects the last tile of a row with the second of the next row and the last tile but one of one row with the first tile of the next row.
Here we use the assumption that valid tilings have width ≥ 4.
We assume a more general notion of a witness pair here. It is clear that Lemma 6.4 canbe generalized for witness pairs, where the mapping is specified by an FO formula.
Remember, that we assume w.l.o.g. that π₁(B) ≠ ⊥.
We use this algorithm to simplify the correctness-proof. It is possible to directly compute the correct node of p such that one invocation of remove-null suffices.
If remove-null would not have been called, the resulting tree would have been t₄ implying that τ already follows from σ₂.

References

Arenas, M., Fan, W., Libkin, L.: On the complexity of verifying consistency of XML specifications. Siam Journal on Computing 38(3), 841–880 (2008)
Article MathSciNet MATH Google Scholar
Arenas, M., Barcelo, P., Libkin, L., Murlak, F.: Relational and xml data exchange. Synthesis Lectures on Data Management 2(1), 1–112 (2010)
Article MATH Google Scholar
Arenas, M., Libkin, L.: A normal form for XML documents. ACM Trans. Database Syst. 29, 195–232 (2004)
Article Google Scholar
Atzeni, P, Morfuni, N.M.: Functional dependencies and constraints on null values in database relations. Inf. Control. 70(1), 1–31 (1986)
Article MathSciNet MATH Google Scholar
Buneman, P., Davidson, S.B., Fan, W., Hara, C.S., Tan, W.C.: Reasoning about keys for XML. Inf. Syst. 28(8), 1037–1063 (2003)
Article MATH Google Scholar
Buneman, P., Davidson, S., Fan, W., Hara, C., Tan, W.-C.: Keys for XML. Comput. Netw. 39(5), 473–487 (2002)
Article MATH Google Scholar
Doner, J.: Tree acceptors and some of their applications. J. Comput. Syst. Sci. 4(5), 406–451 (1970)
Article MathSciNet MATH Google Scholar
Gao, S., Sperberg-McQueen, C.M., Thompson, H., Mendelsohn, N., Beech, D., Maloney, M.: W3C XML Schema definition language (XSD) 1.1 Part 1: Structures http://www.w3.org/TR/2012/REC-xmlschema11-1-20120405/ (2012)
Hartmann, S., Link, S.: More functional dependencies for XML. In: Kalinichenko, L., Manthey, R., Thalheim, B., Wloka, U. (eds.) Advances in Databases and Information Systems, volume 2798 of LNCS, pp. 355–369. Springer, Berlin/Heidelberg (2003)
Hartmann, S., Link, S., Schewe, K.-D.: Functional dependencies over XML documents with DTDs. Acta Cybernetica 17(1), 153–171 (2005)
MathSciNet MATH Google Scholar
Hartmann, S., Link, S., Trinh, T.: Solving the implication problem for XML functional dependencies with properties. In: WoLLIC, pp. 161–175 (2010)
Kot, L., White, W.M.: Characterization of the interaction of XML functional dependencies with DTDs. In: International Conference Database Theory (ICDT), pp. 119–133 (2007)
Lee, M., Ling, T., Low, W.: Designing functional dependencies for XML. In: International Conference on Extending Database Technology (EDBT), pp. 145–158 (2002)
Martens, W., Neven, F., Schwentick, T., Bex, G.J.: Expressiveness and complexity of XML schema. ACM Trans. Database Syst. 31(3), 770–813 (2006)
Article Google Scholar
Miklau, G., Suciu, D.: Containment and equivalence for a fragment of XPath. J. ACM 51(1), 2–45 (2004)
Article MathSciNet MATH Google Scholar
Stockmeyer, L.: The complexity of decision problems in automata and logic, 1974. Ph.D. Thesis, MIT (1974)
Thatcher, J.W., Wright, J.B.: Generalized finite automata theory with an application to a decision problem of second-order logic. Mathematical Systems Theory 2(1), 57–81 (1968)
Article MathSciNet MATH Google Scholar
van Emde Boas, P.: The convenience of tilings. In: Sorbi, A. (ed.) Complexity, Logic and Recursion Theory, volume 187 of Lecture Notes in Pure and Applied Mathematics, pp. 331–363. Marcel Dekker Inc. (1997)
Vincent, M.W., Liu, J., Liu, C.: Strong functional dependencies and their application to normal forms in XML. ACM Trans. Database Syst. 29(3), 445–462 (2004)
Article Google Scholar

Download references

Acknowledgements

We acknowledge the financial support of the Future and Emerging Technologies (FET) program within the Seventh Framework Programme for Research of the European Commission, under the FET-Open grant agreement FOX, number FP7-ICT-233599.

The first author has been supported by grant number MA 4938/21 of the Deutsche Forschungsgemeinschaft (Emmy Noether Nachwuchsgruppe).

We would like to thank Gaetano Geck for fruitful discussions and valuable feedback.

Author information

Authors and Affiliations

Institute for Computer Science, Bayreuth University, 95440, Bayreuth, Germany
Matthias Niewerth
Department of Computer Science, TU Dortmund University, 44221, Dortmund, Germany
Thomas Schwentick

Authors

Matthias Niewerth
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Schwentick
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthias Niewerth.

Appendix: Chasing with Fictitious Functional Dependencies

Now, we extend the chase algorithm, such that it works in the presence of fictitious functional dependencies. To extend the chase algorithm, we have to deal with two separate issues. First, the dependencies in Σ can be fictitious. In this case, we have the problem that π₂(B) might be null in Line 4 of Algorithm 1.^{Footnote 27} We note that Algorithm 2 cannot handle null values. On the other hand, the target dependency τ can be fictitious.

To address both issues, we change the definition of the initial tree and extend Algorithm 1, as can be seen in Algorithm 4. The red parts (Lines 6–11 and additional function parameters) are added to deal with a fictitious target dependency and the blue part (Lines 4–5) is added to deal with fictitious dependencies in Σ. We note that removing the red and blue parts in Algorithm 4 gives exactly Algorithm 1. Algorithm 5 is identical to Algorithm 3, except that it computes a witness pair for t_τ ⊮ τ and provides the additional parameters to the chase function.

We first describe how we deal with fictitious dependencies in Σ, especially with the case that π₂(B) is null in Line 4. We address this problem, by adding a function that removes null values from embeddings by adding nodes to the tree, such that the null value is replaced by a node or data value. We therefore define the function remove-null that takes as input a tree t, a pattern p, a partial embedding π of p in t and a node x of p such that π(x) = ⊥.

The function remove-null computes and returns a tree t^′ and a (partial) embedding π^′ of p in t^′ as follows. Let y be the lowest ancestor of x with π(y) ≠ ⊥, z be the child of y on the path [y, z] and v be π(y). The tree t^′ is derived from t by adding a copy of the path [z, x] to t below π(y), where each added node gets a fresh node identifier from \({\mathcal {V}}\) and a fresh data value from \({\mathcal {V}}\) and wildcard symbols are replaced by #. The embedding π^′ is derived from π by embedding the path [z, x] to the newly inserted path in t^′.

The intuitive idea behind this function is that π₁(B) must be identified with π₂(B) to satisfy σ. Therefore either π₁(B) must become ⊥ (meaning π₂(y) has to be removed from the tree) or π₂(B) must become equal to π₁(B), which especially means that it must become different to ⊥. We just note that removing nodes from the tree (contrary to merging them) is a bad idea, because the initial tree is an—in some sense—minimal counterexample. Therefore we add nodes to remove the nulls. For correctness we refer to the formal proof given below.

Note that in the case that B = y, where y is a node, the chase will merge the newly added node with π₁(B) afterwards, i.e., the effect of remove-null followed by the merge is equivalent of merging π₁(y) with π₂(y), where y is defined as the lowest ancestor of B, which is not mapped to ⊥.

We now describe how to deal with the case, where the target dependency τ is fictitious. To understand the underlying problem, we give an abstract example.

Example A.1

We consider the mapping induced by the patternp = /a〈x_a〉/b〈x_b〉/c〈x_c〉 and the dependencies \(\tau =(p,{x_{a} \overset {\emptyset }{\to } x_{c}})\), \(\sigma _{1}=(p,{x_{a} \overset {x_{c}}{\rightarrow } x_{c}})\), and \(\sigma _{2}=(p,{x_{a} \overset {\emptyset }{\to } x_{b}})\). We note that σ₁ is not strictly fictitious, as it only applies to embeddings where no node is mapped to⊥. It can be easily seen that σ₁ ⊮ τ and σ₂ ⊮ τ, as the tree t₁ in Fig. 5 is a counter-example to σ₁ ⊧ τ and t₃is a counter-example to σ₂ ⊧ τ. Later, we will see that {σ₁, σ₂} ⊧ τ.

Starting with Σ = {σ₁} and τ, Algorithm 3 will incorrectly report that σ₁ ⊧ τ, as it starts with the initial tree t₂, merges nodes v₃ and v₅ to satisfy σ₁ and recursively merges nodes v₂ and v₄ to restore the tree structure. The resulting tree t₄ satisfies σ₁ and τ and therefore Algorithm 3 erroneously reports that σ₁ ⊧ τ.

The intuitive reason for the incorrect behavior is that Algorithm 3 does not consider trees, where the witness-pair for τ involves null values, i.e., τ is treated as a non-fictitious dependency.

It is easy to see that starting with t₁ as initial tree would allow Algorithm 1 to correctly decide that σ₁ ⊮ τ. However, simply adapting the initial tree will not work, as can be seen by the dependency σ₂. Starting with initial tree t₁ and Σ = {σ₂}, Algorithm 1 will merge nodes v₂ and v₄ to satisfy σ₂. The resulting tree t₄ again satisfies τ, which leads to the (again incorrect) result σ₂ ⊧ τ. The intuitive reason now is, that with t₁ as initial tree, Algorithm 1 no longer considers trees where the witness-pair for τ does not use null values.

A possible solution would be to run Algorithm 1 starting from both initial trees and output “yes”, if and only if both runs output “yes”. We note without proof that this solution could be generalized to arbitrary FFDs resulting in at most linearly many initial trees in the depth of p, each with a different number of nodes from p mapped to null in the witness-pair. Instead of this approach, we take a more elegant solution in using an initial tree, where the witness-pair has as many as possible null values and extend Algorithm 1 such that it adds more nodes to the tree when it becomes apparent that the chosen initial tree will result in a final tree satisfying τ.

Therefore, we define the initial tree of a fictitious functional dependency \(\tau =(p,{Y \overset {Z}{\to } B})\) as follows: Let t₁ and t₂ be copies of p, where t₁ only contains the nodes referenced in Y, Z and B together with their ancestors and t₂ contains only the nodes referenced by Y and Z together with their ancestors. Let again π₁ and π₂ be the canonical embeddings of p in t₁ and t₂, respectively. As before, t₁ and t₂ contain node ids from \(\mathcal {V}\) instead of variables from X, all data values in t₁ and t₂ are distinct and wildcards ∗ are replaced by #. The tree t_τ again results by merging the roots of t₁ and t₂ and all pairs (π₁(z), π₂(z)), for which z occurs as a node term in Y and it identifies all pairs of data values (π₁(z).@, π₂(z).@), for which z.@ is a data term in Y. By applying the node merges the embeddings π₁ and π₂ yield two embeddings \({\pi }_{1}^{\prime }\) and \({\pi }_{2}^{\prime }\) such that \(({\pi }_{1}^{\prime }, {\pi }_{2}^{\prime })\) is a witness pair for t_τ and τ.

Furthermore, we add the red parts to Algorithm 4, which take care of adding additional nodes to the tree when necessary, i.e., when a merge occurs that would result in ρ₂ not being a maximal embedding. In this case x (as computed in Line 9) cannot be mapped to ⊥ any more and the function remove-null is invoked to add a new node for mapping x. We loop using the goto statement in Line 12 as it might be necessary to remove further nulls.^{Footnote 28}

Coming back to Example A.1, we want to give the chase sequence for Σ = {σ₁, σ₂} and τ. Algorithm 5 starts with computing the initial tree t₁ (in Fig. 5). As σ₂ is not satisfied, the nodes v₂ and v₄ need to be merged. Prior to this merge, remove-null is called in Line 11, resulting in tree t₂. The chase continues with merging v₂ and v₄ resulting in tree t₃.^{Footnote 29} Now σ₁ is violated resulting in a merge of v₃ and v₅ and the final tree t₄. As t₄ ⊧ τ, Algorithm 5 reports that {σ₁, σ₂}⊧τ. As we will see in Proposition A.2, this result is correct.

Proposition A.2

For every instance I = (Σ, τ) of XC-Imp (TP[/, ∗],FFD), Algorithm 5 terminates and answers “Yes” if and only if Σ ⊧ τ.

Proof

The proof follows a very similar outline to the proof of Proposition 7.2. The basic difference is, that we will consider calls to remove-null in lines 5 and 11 as separate merge steps in our induction.

Let I = (Σ, τ) be an instance of XC-Imp(TP[/, ∗], FFD) with \(\tau =(p_{\tau },{Y_{\tau } \overset {Z_{\tau }}{\longrightarrow } B_{\tau }})\). Clearly, if Algorithm 5 terminates and yields a tree t, no constraint from Σ is violated in t. Thus, if the output of Algorithm 5 is “No” (and thus t ⊮ τ), t is a counter-example for Σ ⊧ τ and thus, the answer “No” is always correct. The proof that “Yes”-answers are also correct again uses tree homomorphisms as defined in the proof of Proposition 7.2.

Let t_i be the tree after i chase steps where a chase step is either a call to remove-null or a call to merge in Algorithm 4. This differs to the proof of Proposition 7.2, where only calls to merge were considered chase steps, as there where no calls to remove-null.

Claim A.3 is identical to Claim 7.3, except that it holds for fictitious dependencies and uses the updated definition of a chase step.

Claim A.3

If there is a counter-example treet^′ for Σ ⊧ τ with witness pair \((\rho ^{1}_{t^{\prime }},\rho ^{2}_{t^{\prime }})\) for τ and t^′, then the tree chase on input (Σ, τ) does not fail and for every chase step i it holds that

(i)
there exist a tree homomorphism \(t_{i} \preceq _{\theta _{i}} t^{\prime }\) ;
(ii)
there exist a witness pair \(({\rho }_{i}^{1},{\rho }_{i}^{2})\) for t_i ⊮ τ; and
(iii)
\(\theta _{i}({\rho }_{i}^{j}(x))=\rho _{t^{\prime }}^{j}(x)\) for j ∈ {1, 2} and all terms x of p_τ with \({\rho }_{i}^{j}(x)\neq \bot \).

Again, the claim immediately yields the correctness of the algorithm.

The proof of Claim A.3 is by induction on the number of chase steps. We distinguish 3 cases for the induction step depending on the type of the chase step: calls to remove-null in Line 5, calls to remove-null in Line 11 and calls to merge (in Line 15). We note that we can show these cases in any order.

We want to remember that we can safely assume that for each witness pair (π₁, π₂) considered in this proof it holds that π₁ is a full embedding and the only nodes mapped to ⊥ in π₂ are on the path from the root to B, where B is the node or data term on the right-hand side of the corresponding dependency.

The induction base for t₀ = τ can be shown exactly as in the proof of Proposition 7.2. The same holds true for the induction step in the case of calls to merge. It should be noted that in calls to merge, ρ₂ always is a full embedding. This is ensured by the call to remove-null in Line 5 that precedes the call to merge in Algorithm 4. It remains to show the induction step in the case of calls to remove-null.

We first show the case, where remove-null is called in Line 11 of Algorithm 4. Let v be the node added in remove-null and x_τ be the corresponding term in the pattern p_τ. We define 𝜃_i+ 1 to map v to \({\rho }_{t^{\prime }}^{2} (x_{\tau })\) and to be equal to 𝜃_i on all other nodes. By definition of embeddings, \({\rho }_{t^{\prime }}(x_{\tau })\) has to be ⊥ or a child of \(\rho _{t^{\prime }} (\text {parent}(x_{\tau }))=\theta _{i}({\rho }_{i}^{2}(\text {parent}(x_{\tau })))\). However, \(\rho _{t^{\prime }}(x_{\tau })\) cannot be ⊥, as we can conclude from (7.1) from Claim 7.3 that \(\theta _{i}({\rho }_{i}^{1}(\text {parent}(x_{\tau }))) = \theta _{i}({\rho }_{i}^{2} (\text {parent}(x_{\tau })))\). Observe that B is a node term and \({\rho }_{i}^{1}(\text {parent}(x_{\tau }))\) and \({\rho }_{i}^{2}(\text {parent}(x_{\tau }))\) are ancestors (in the same level of the tree) of \({\chi _{i}^{1}}(B)\) and \({{\chi }_{i}^{2}}(B)\). This shows that 𝜃_i+ 1 is a valid tree homomorphism (and therefore (i) is satisfied). We define \({\rho }_{i + 1}^{1} = {\rho }_{i}^{1}\) and \({\rho }_{i + 1}^{2}\) to map x_τ to v and to be equal to \({\rho }_{i}^{2}\) on all other variables. It is straightforward to show that \(({\rho }_{i + 1}^{1},{\rho }_{i + 1}^{2})\) is a witness pair for t_i+ 1 ⊮ τ, satisfying (ii). Especially \({\rho }_{i + 1}^{2}\) is a maximal partial embedding as \({\rho }_{i}^{2}\) is a maximal partial embedding and all nodes that are mapped to ⊥ in \({\rho }_{i + 1}^{2}\) are below x_τ, which is mapped to a leaf of the tree. From the definition of 𝜃_i+ 1 and \(({\rho }_{i + 1}^{1},{\rho }_{i + 1}^{2})\), it follows that (iii) still holds, as the only node added to the image of \(({\rho }_{i + 1}^{1},{\rho }_{i + 1}^{2})\) is mapped accordingly in 𝜃_i+ 1.

At last we discuss the case where remove-null is called in Line 5 of Algorithm 4. Let [v, w] be the path added to t in remove-null and [x, y] be the corresponding path in the pattern p. Let y_τ be a variable in p_τ with the same ancestor string as y. Such a variable exists, as χ₁(y) is in the image of \({\rho }_{i}^{1}\). We define x_τ, such that \([x_{\tau },y_{\tau }]\) has the same length and labels as [x, y]. Again, we can conclude from (7.1) that \(\rho _{t^{\prime }}(y_{\tau })\) cannot be ⊥, as \(\rho _{t^{\prime }}^{2}(\text {parent}(x_{\tau }))=\theta _{i}({\chi _{i}^{2}}(\text {parent}(x)))\). We define 𝜃_i+ 1 such that it is equal to 𝜃_i for all nodes of t_i and that it maps the nodes [v, w] to \([\rho _{t^{\prime }}^{2}(x_{\tau }),\rho _{t^{\prime }}^{2}(y_{\tau })]\). Similarly to the last case, it is easy to verify that 𝜃_i+ 1 is a valid tree homomorphism and therefore we can conclude (i). We define \({\rho }_{i + 1}^{1}={\rho }_{i}^{1}\) and \({\rho }_{i + 1}^{2}\) to map the path [x, y] to the path [v, w] and to be equal to \({\rho }_{i}^{2}\) on all other nodes. Again, it is straightforward to show that \(({\rho }_{i + 1}^{1},{\rho }_{i + 1}^{2})\) is a witness pair (showing (ii)) and that 𝜃_i+ 1 is defined in a way, such that (iii) holds. □

Using Proposition A.2, the results in Theorem 7.9 can be shown in the same way, as we have shown the results in Corollary 7.4, Propositions 7.7 and 7.8. In particular, the extension of the chase in the presence of sDTDs, which we have shown to work in Proposition 7.5 does not interfere with the addition of null values.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Niewerth, M., Schwentick, T. Reasoning About XML Constraints Based on XML-to-Relational Mappings. Theory Comput Syst 62, 1826–1879 (2018). https://doi.org/10.1007/s00224-018-9846-5

Download citation

Published: 20 February 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s00224-018-9846-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reasoning About XML Constraints Based on XML-to-Relational Mappings

Abstract

Access this article

Similar content being viewed by others

Multi-model query languages: taming the variety of big data

Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data

A RDF-based graph to representing and searching parts of legal documents

Notes

References

Acknowledgements