In this section, we describe a reduction from the satisfiability problem of Mso-D to the emptiness problem of Sdtas, when the \({\text{ Mso-D }}\) formula \(\varphi \) is a sentence in the following form:
$$\begin{aligned} \varphi \,=\, \exists x_1, \ldots , x_n \,.\,\forall y_1, \ldots , y_m \,.\,\theta , \end{aligned}$$
(2)
where each data constraint of the formula \(\theta \), say \(r(f_1(t_1),\ldots ,f_k(t_k))\), satisfies one of the following:
-
r is unary (i.e., \(k=1\)), or
-
r depends only on variables \(x_1, \ldots , x_n\) and at most one of the variables \(y_1, \ldots , y_m\), i.e., \( Var (t_1, \ldots , t_k) \subseteq \{ x_1, \ldots , x_n, y_i\}\), for some \(i\in [m]\).
Notice that \(\theta \) may contain other quantifiers, but the additional quantified variables can occur only inside unary data constraints. Moreover, it is easy to see that this fragment is closed under positive Boolean combinations (i.e., conjunctions and disjunctions).
This fragment strictly includes the Mso logic with data defined in [11] for data words, which only allows unary data constraints. Below we show that the added expressivity can be used to define and verify properties of a variety of data structures, including those from Examples 4 and 5, and infinite-state games.
In our reduction, we first construct a standard finite-state tree automaton over a finite alphabet (Sect. 6.1), which we then convert to an Sdta (Sect. 6.2).
6.1 Building the Enumerated Tree Automaton
The first step in our reduction from Mso-D to Sdtas is to convert the Mso-D formula \(\varphi \) of type (2) into a formula \(\varphi '\) in standard Mso by abstracting away all data constraints. We distinguish two types of data constraints. Global constraints refer only to the data of the existentially quantified variables \(x_i\); on a given data tree, once the interpretation of those variables is chosen, each global constraint is either \( true \) or \( false \): it is a global property of the tree. Local constraints, instead, additionally refer to a variable, say z, that is not one of \(\{x_1 ,\ldots , x_n\}\); even if the interpretation of \(\{x_1 ,\ldots , x_n\}\) is fixed, the truth of such constraints depends on the interpretation of z. Accordingly, we replace each data constraint in \(\theta \), say \(r(f_1(t_1),\ldots ,f_k(t_k))\), as follows:
-
Global Constraints. If \( Var (t_1, \ldots , t_k) \subseteq \{ x_1, \ldots , x_n \}\), we replace all occurrences of the data constraint with a new propositional variable \(p\). We denote by \(p_1, \ldots , p_h\) all such propositional variables.
-
Local Constraints. Otherwise, there is a unique variable \(z \in Var (t_1, \ldots , t_k)\) that is not one of \(\{ x_1,\ldots ,x_n \}\). We then introduce a new free second-order variable C, and replace each occurrence of the above data constraint with the clause \(z \in C\). We denote by \(C_1,\ldots ,C_l\) all the second-order variables introduced in this process.
Besides the above substitutions, in the resulting Mso formula we leave variables \(x_1,\ldots ,x_n\) free, so that the models of the formula will carry the interpretation of those variables as extra bits in the node labels (recall the discussion on extended models in Sect. 3). We thus obtain the following Mso formula:
Since \(\varphi '\) has no data constraints, we can take its data signature to be empty.
Example 7
Consider the formula \(\psi _{\mathrm {bst}}\) from Example 4 that defines Bsts using auxiliary data \( min \) and \( max \). Since it uses a single universal quantifier, it belongs to the syntactic fragment (2). For the sake of simplicity, consider a stronger formula \(\psi '_{\mathrm {bst}}\) forcing internal nodes to have two children (a.k.a. a full Bst):
Now, consider the following true property of full Bsts: the successor of an internal node is the left-most leaf in its right sub-tree. The following formula states the opposite of that property:
It is easy to see that \(\psi '_\mathrm {bst}\wedge \psi _\mathrm {succ}\) is equivalent to a formula \(\psi \) in our fragment: \(\exists x_1,x_2,x_3 \,.\,\forall y \,.\,\Big (\big ( val (x_1)< val (x_2) < val (x_3) \big ) \,\wedge \lnot leaf (x_1) \wedge leaf (x_3)\)
$$\begin{aligned} \;\;&\wedge left\_only\_path (x_1. right , x_3)\wedge full\_tree (y)\\&\wedge \big ( y\!\not =\!y. left \,\, ? \,\, min (y)\!=\! min (y. left ) \wedge max (y. left )\!<\! val (y)\,\, :\,\, min (y)\!=\! val (y) \big )\\[-0.7ex]&\wedge \big ( y\!\not =\!y. right \,\, ? \,\, max (y)\!=\! max (y. right )\wedge min (y. right )\!>\! val (y)\, :\, max (y)\!=\! val (y)\Big ). \end{aligned}$$
The conversion outlined above turns \(\psi \) into the following Mso formula:
$$\begin{aligned} \forall y \,.\,&\Big (p_1 \wedge \lnot leaf (x_1) \wedge leaf (x_3) \wedge left\_only\_path (x_1. right , x_3) \wedge full\_tree (y) \\&\,\,\,\,\,\,\,\,\,\wedge \big ( y \not = y. left \,?\, y \in C_1 : y \in C_2 \big ) \wedge \big ( y \not = y. right \,?\, y \in C_3 : y \in C_4 \big )\Big ), \end{aligned}$$
where proposition \(p_1\) corresponds to the global constraint \( val (x_1)< val (x_2) < val (x_3)\), the second-order variable \(C_1\) corresponds to the local constraint \( min (y) = min (y. left ) \,\wedge \, max (y. left ) < val (y),\) and variables \(C_2\) – \(C_4\) correspond to the other data constraints in \(\psi _\mathrm {bst}\). \(\square \)
We now apply the standard Mso construction to \(\varphi '\), leading to a bottom-up finite-state tree automaton \(A_{\varphi '}\) on the alphabet \(\varSigma = \{ 0, 1 \}^{n+h+l}\), accepting all finite trees that represent interpretations satisfying \(\varphi '\). The alphabet is \(\varSigma \) because \(n+h+l\) is the total number of free variables in \(\varphi '\): n first-order variables \(x_i\), h propositional variables \(p_i\) (corresponding to global constraints), and l second-order variables \(C_i\) (corresponding to local constraints). We recall the formal statement of this construction below, for more details see [40] and [8].
Theorem 3
For all Mso formulas \(\varphi '\) on the empty data signature, with free first-order variables \(x_1,\ldots ,x_n\), propositional variables \(p_1,\ldots ,p_h\), and second-order variables \(C_1,\ldots ,C_l\), there is a deterministic bottom-up tree automaton on the alphabet \(\{ 0, 1 \}^{n+h+l}\) whose language consists of all extended trees T such that \(T \models \varphi '\).
Simplifying Assumptions. To simplify the presentation of the following constructions, we make two simplifying assumptions. First, we assume that all terms appearing in data constraints are variables, and not composite terms like \(x. left . right \). Dropping this assumption is technically simple and omitted due to space constraints. Second, we assume that all connecting functions f appearing in data constraints correspond to fields in \(\mathcal {S}\). Sentences that satisfy the second assumption have a unique interpretation \(\mathbb {I}\), because they have no free variables and the connecting functions must be interpreted as the functions extracting the corresponding field from each node. We discuss how to remove this restriction in Sect. 6.3.
We now establish a relation between \(\varSigma \)-trees accepted by \(A_{\varphi '}\), and data trees on the data signature \(\mathcal {S}\) defined by \(\varphi \). Denote by \((a_1,\ldots ,a_n, b_1,\ldots ,b_h, c_1,\ldots ,c_l)\) the generic element of \(\varSigma \). Given a \(\varSigma \)-tree \((T, \sigma )\) and a variable \(x_i\) in \(\varphi '\), we define \( node (\sigma , x_i)\) to be the unique node \(u \in T\) such that the \(a_i\) component of \(\sigma (u)\) is 1. In words, the function \( node \) picks the position in the tree where the \(\varSigma \)-tree activates the bit \(a_i\).
Definition 2
Consider an Mso-D sentence \(\varphi \) of the form (2) on the data signature \(\mathcal {S}\), and let \(\mathbb {I}\) be its unique interpretation. We say that a \(\varSigma \)-tree \((T, \sigma )\) and an \(\mathcal {S}\)-tree \((T, \lambda )\) are consistent iff for all nodes \(u \in T\) the following hold:
-
1.
For all \(i \in [h]\), let \(r^\mathrm {glb}_i\big ( f_i^1(\alpha _i^1), \ldots , f_i^{j_i}(\alpha _i^{j_i})\big )\) be the global constraint from \(\varphi \) corresponding to the propositional variable \(p_i\) from \(\varphi '\). Recall that under the simplifying assumptions each \(\alpha _i^j\) is one of \(x_1,\ldots ,x_n\), and each \(f_i^j\) is one of the names of the fields of \(\mathcal {S}\). Then, \(\sigma (u)(b_i) = 1\) iff the following holds
$$\mathbb {D}(r^\mathrm {glb}_i)\big (\, \mathbb {I}(f_i^1)( node (\sigma , \alpha _i^1)), \,\ldots ,\, \mathbb {I}(f_i^{j_i})( node (\sigma , \alpha _i^{j_i})) \,\big ).$$
-
2.
For all \(i \in [l]\), let \(r^\mathrm {loc}_i\big (g_i^1(\beta _i^1), \ldots , g_i^{k_i}(\beta _i^{k_i}), g_i(z_i)\big )\) be the local constraint from \(\varphi \) corresponding to the second-order variable \(C_i\) from \(\varphi '\). Recall that each \(\beta _i^j\) is one of \(x_1,\ldots ,x_n\), and each \(g_i^j\) (as well as \(g_i\)) is one of the names of the fields of \(\mathcal {S}\). Then, \(\sigma (u)(c_i) = 1\) iff the following holds
$$ \mathbb {D}(r^\mathrm {loc}_i)\big (\, \mathbb {I}(g_i^1)( node (\sigma , \beta _i^1)) \, , \,\ldots \, , \, \mathbb {I}(g_i^{k_i})( node (\sigma , \beta _i^{k_i}))\, , \, \mathbb {I}(g_i)(\lambda (u)) \,\big ). $$
The following result states the fundamental relationship between \(\varphi \) and \(A_{\varphi '}\).
Theorem 4
Let \(\varphi \) be an Mso-D sentence of the form (2) on the data signature \(\mathcal {S}\), and let \(A_{\varphi '}\) be the corresponding tree automaton described above. For all data trees \((T, \lambda )\) with data signature \(\mathcal {S}\), the following are equivalent:
-
1.
it holds \(T^\lambda , \mathbb {I}\models \varphi \), where \(\mathbb {I}\) is the unique interpretation of \(\varphi \);
-
2.
there exists a tree \((T,\sigma ) \in L(A_{\varphi '})\) s.t. \((T,\lambda )\) and \((T,\sigma )\) are consistent.
6.2 Building the Symbolic Data Tree Automaton
We now convert the tree automaton from the previous section into an Sdta that accepts all and only the data trees satisfying the original Mso-D formula \(\varphi \).
Intuitively, the Sdta mimics the behavior of the tree automaton, and in doing so, it enforces the data constraints contained in \(\varphi \). The information about which constraints should be true and which should be false at every node is encoded in the alphabet \(\varSigma = \{ 0, 1 \}^{n+h+l}\) of the tree automaton. In detail, if \((a_1,\ldots ,a_n, b_1,\ldots ,b_h, c_1,\ldots ,c_l)\) is a generic symbol from the alphabet, the \(b_i\)’s encode the truth value of the global constraints, and the \(c_i\)’s encode the truth value of the local constraints. However, the data on which to evaluate those constraints comes from different sources. The global constraints are evaluated only on the guessed data for the existentially quantified variables \(x_1,\ldots ,x_n\), whereas the local constraints also access the data of the current node.
Finally, the \(a_i\) component of the alphabet encodes the actual position of each \(x_i\) in the current tree (i.e., \(a_i\) is 1 only in the node that is the interpretation of \(x_i\)). So, when \(a_i=1\) the symbolic automaton checks that the guessed data evaluation for \(x_i\) corresponds to the data in the current node.
Let \(A_{\varphi '} = (\varSigma , Q, F, \varDelta )\) be the tree automaton from Sect. 6.1, we now define the Sdta \(\mathcal {A}_\varphi = (\mathcal {S}, \mathcal {S}^Q, \psi ^F, \varPsi ^\varDelta )\). First, notice that the alphabet data signature \(\mathcal {S}\) coincides with that of the original Mso-D formula. We then set the state data signature \(\mathcal {S}^Q\) to \(\{ state : Q \} \cup \{ id ^i: type \mid ( id : type ) \in \mathcal {S}, i=1\ldots n \}\), i.e., \(\mathcal {S}^Q\) contains an enumerated data field representing the state of the tree automaton \(A_{\varphi '}\), and n copies of each data field in \(\mathcal {S}\). These copies are used to store the guessed data evaluations for the existentially quantified variables \(x_i\) from (2). For a symbolic state \(q \in L(\mathcal {S}^Q)\) and \(i\in [n]\), we denote by \(q[x_i]\) the i-th projection of q on \(\mathcal {S}\), i.e., the evaluation that assigns to each field \( id \) in \(\mathcal {S}\) the value \(q. id ^i\). The acceptance constraint \(\psi ^F(q)\) is simply defined as \(q. state \in F\).
Regarding the transition constraints \(\varPsi ^\varDelta \), we will focus only on the case of nodes with two children, since the other cases are similar. Let \((s_l, s_r, a, s)\) be a transition in \(A_{\varphi '}\), where \(a = (a_1,\ldots ,a_n, b_1,\ldots ,b_h, c_1,\ldots ,c_l) \in \varSigma .\) We add the following implicant to the transition constraint \(\psi _{ lr }\):
$$\begin{aligned}&\Big \{\;\;\;\;\;\;\;\;q_l. state = s_l \wedge q_r. state = s_r \wedge q. state = s \end{aligned}$$
(4a)
$$\begin{aligned}&\;\;\;\wedge \;\bigwedge _{i \in [n]} \big ( q[x_i] = q_l[x_i] \wedge q[x_i] = q_r[x_i] \big ) \quad \wedge \quad \bigwedge _{\{ i \,\mid \, a_i = 1\}} \big ( q[x_i] = \sigma \big ) \end{aligned}$$
(4b)
$$\begin{aligned}&\;\;\;\wedge \;\bigwedge _{i \in [h]} \Big [ (b_i = 1) \leftrightarrow \mathbb {D}(r^\mathrm {glb}_i)\big ( q[\alpha _i^1].f_i^1, \ldots , q[\alpha _i^{j_i}].f_i^{j_i} \big ) \Big ] \end{aligned}$$
(4c)
$$\begin{aligned}&\;\;\;\wedge \;\bigwedge _{i \in [l]} \Big [ (c_i = 1) \leftrightarrow \mathbb {D}(r^\mathrm {loc}_i)\big ( q[\beta _i^1].g_i^1,\ldots , q[\beta _i^{k_i}].g_i^{k_i}, \, \sigma .g_i \big ) \Big ]\;\;\;\Big \} \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \implies \;\;\psi _{ lr }(q_l, q_r, \sigma ,q) \,. \end{aligned}$$
(4d)
The above conjuncts can be explained as follows: (4a) mimics the state change in the discrete transition, the first part of (4b) states that the n copies of the data fields held by the symbolic automaton are uniform over the whole tree, the second part of (4b) additionally states that the i-th copy of the data fields coincides with the data \(\sigma \) in the unique node where the discrete automaton prescribes \(a_i = 1\), (4c) enforces the i-th global constraint \(r^\mathrm {glb}_i\) in all nodes where the discrete automaton prescribes \(b_i = 1\), and finally (4d) enforces the local constraints when the \(c_i\) component of the discrete alphabet is 1.
Theorem 5
Let \(\varphi \) be an Mso-D sentence of the form (2) and let \(\mathcal {A}_\varphi \) be the corresponding Sdta described above. We have \(\mathcal {L}(\varphi ) = L(\mathcal {A}_\varphi )\).
6.3 Supporting Auxiliary Data
So far, we have assumed that all connecting function symbols f appearing in the data constraints correspond to fields in \(\mathcal {S}\). In other words, all data constraints refer to data fields that are present in the trees. However, our logic also supports connecting function symbols that do not correspond to fields in the data signature. In that case, the interpretation is free to assign any value to f(u), for each node u in the data tree. Thus, the Sdta \(\mathcal {A}_\varphi \) must accept a data tree if there exists an interpretation for those functions that satisfies the formula. To achieve this effect, let \(\{ f_i \}_{i=1 \ldots k}\) be the set of connecting function symbols occurring in \(\varphi \) and not corresponding to data fields in \(\mathcal {S}\), where \(f_i\) has type \(nodes \rightarrow data_i\). Define the extended data signature
$$ \mathcal {S}' = \mathcal {S}\cup \{ f_i : data _i \}_{i=1 \ldots k}. $$
We enrich the state data signature of \(\mathcal {A}_\varphi \) as follows:
$$ \mathcal {S}^Q = \{ state : Q \} \,\cup \{ name^i:type \mid (name:type) \in \mathcal {S}', i=0\ldots n \}. $$
Compared to the original definition from Sect. 6.2, we store an extra copy of the data fields, identified by index 0, representing the data in the current node. Moreover, all copies include the auxiliary data fields. It is straightforward to adapt the constraint \(\varPsi ^\varDelta \) from Sect. 6.2 to support such auxiliary data fields.