## Abstract

Data words are words with additional edges that connect pairs of positions carrying the same data value. We consider a natural model of automaton walking on data words, called Data Walking Automaton, and study its closure properties, expressiveness, and the complexity of some basic decision problems. Specifically, we show that the class of deterministic Data Walking Automata is closed under all Boolean operations, and that the class of non-deterministic Data Walking Automata has decidable emptiness, universality, and containment problems. We also prove that deterministic Data Walking Automata are strictly less expressive than non-deterministic Data Walking Automata, which in turn are captured by Class Memory Automata.

### Keywords

Data languages Walking automata## 1 Introduction

Data words generalize strings over finite alphabets, where the term ‘data’ denotes the presence of elements from an infinite domain. Formally, *data words* are modeled as finite sequences of elements chosen from a set of the form \({\Sigma }\times \mathbb {D}\), where Σ is a finite alphabet and \(\mathbb {D}\) is an infinite alphabet. Elements of Σ are called *letters*, while elements of \(\mathbb {D}\) are called *data values*. Sets of data words are called *data languages*.

It comes natural to investigate reasonable mechanisms (e.g., automata, logics, algebras) for specifying languages of data words. Some desirable features of such mechanisms are the decidability of the paradigmatic problems (i.e., emptiness, universality, containment) and effective closures of the recognized languages under Boolean operations and projection. A natural idea is to enhance a finite state machine with data structures to provide some ability to handle data values. Examples of these structures include registers to store data values [8, 10], pebbles to mark positions in the data word [13], hash tables to store partitions of the data domain [1]. In [4] Data Automata are introduced and shown to capture the class of data languages definable in two-variable first-order logic over data words. Class Memory Automata [1] provide an alternative view of Data Automata. For all models, except Pebble Automata and Two-way Register Automata, the non-emptiness problem is decidable; universality and, by extension, equivalence and inclusion are undecidable for all non-deterministic models.

In this work we consider data words as sequences of letters with additional edges that connect pairs of positions carrying the same data value. This idea is consistent with the fact that as far as a data word is concerned the actual data value at a position is not relevant, but only the relative equality and inequality of positions with respect to data values. It is also worth noting that none of the above automaton models makes any distinction between permutations of the data values inside data words. Our model of automaton, called Data Walking Automaton, is naturally two-way: it can roughly be seen as a finite state device whose head moves along successor and predecessor positions, as well as along the edges that connect any position to the closest one having the same data value, either to the right or to the left. Remarkably, emptiness, universality, and containment are decidable problems for Data Walking Automata. Our automata capture, up to functional renaming of letters, all data languages recognized by Data Automata. The deterministic subclass of Data Walking Automata is shown to be closed under all Boolean operations (closure under complementation is not immediate since the machines may loop). We can also deduce from previous results on Tree Walking Automata [2, 3] that deterministic Data Walking Automata are strictly less powerful than non-deterministic Data Walking Automata, which in turn are subsumed by Data Automata.

- 1.
We adapt the model of walking automaton, originally introduced for trees, to data words.

- 2.
We study closure properties of the classes of data languages recognized by deterministic and non-deterministic walking automata under the operations of union, intersection, complementation, and projection.

- 3.
We analyze the relative expressive power of the deterministic and non-deterministic models of walking automata, comparing them with other classes of automata from the literature, most notably, Data Automata. We also show that deterministic walking automata recognize all data languages definable in the two-variable fragment of first-order logic with access to the global and class successor predicates.

- 4.
We study the complexity of fundamental problems on data languages recognized by non-deterministic walking automata; in particular, we prove that the problems of word acceptance, emptiness, universality, and containment are decidable.

- 5.
We prove that extending the model of walking automaton with alternation results in an undecidable emptiness problem.

**Organization.** In Section 2 we give some preliminary definitions concerning the standard models of Data Automata and Tiling Automata. In Section 3 we introduce the deterministic and non-deterministic models of walking automata on data words and we prove some basic closure properties. In Section 4 we analyze the expressive power of Data Walking Automata, in comparison with Data Automata, and we prove a series of separation results analogous to those for walking automata on trees. In Section 5 we identify a fragment of first-order logic, precisely, the two-variable fragment with access to the global and class successor predicates, that is captured by the class of deterministic Data Walking Automata. In Section 6 we study the complexity of some fundamental problems involving Data Walking Automata, most notably, word acceptance, emptiness, universality, and containment. In Section 7 we consider the alternating model of Data Walking Automaton and we show that the emptiness problems becomes undecidable in this case. Section 8 provides an assessment of the results and discusses future work.

## 2 Preliminaries

Throughout this paper we will tacitly assume that all data words are non-empty – this assumption will simplify some definitions, such as that of Tiling Automaton. Given a data word *w* = (*a*_{1},*d*_{1}) ⋯ (*a*_{n},*d*_{n}), a *class* of *w* is a maximal set of positions with identical data value. The set of classes of *w* forms a partition of the set of positions and is naturally defined by the equivalence relation *i*∼*j* iff *d*_{i}=*d*_{j}.

The *global successor* and *global predecessor* of a position *i* in a data word *w* are the positions *i*+1 and *i*−1 (if they exist). The *class successor* of a position *i* is the leftmost position after *i* in its class (if it exists) and is denoted by *i*⊕1. The *class predecessor* of a position *i* is the rightmost position before *i* in its class (if it exists) and is denoted by *i*⊖1. The global and class successors of a position are collectively called *successors*, and similarly for the *predecessors*. The successors and predecessors of a position are called its *neighbors*.

Using the above definitions we can identify any data word \(w\in ({\Sigma }\times \mathbb {D})^{*}\) with a directed graph whose vertices are the positions of *w*, each one labelled by a letter from Σ, and whose edges are given by the successor and predecessor functions +1, −1, ⊕1, ⊖1. This graph can be represented in space Θ(|*w*|), where |*w*| denotes the length of *w*. For 1≤*i*≤|*w*| we denote by \(w(i) \in {\Sigma } \times \mathbb {D}\) the label of the *i*th position of *w*.

*w*over the alphabet \(\{a,b\}\times \mathbb {N}\):

### 2.1 Local Types

*w*and a position

*i*in it, we introduce the local type \(\overrightarrow {\textsf {type}}_{w}(i)\) (resp., \(\overleftarrow {\textsf {type}}_{w}(i)\)) to specify whether the global and class successors (resp., predecessors) of

*i*exist and whether they coincide or not. Formally, when considering the successors of a position

*i*, four scenarios are possible:

- 1.
the position

*i*is the rightmost one in*w*and hence no successors exist; we denote this by \(\overrightarrow {\textsf {type}}_{w}(i)=\textsf {max}\); - 2.
the position

*i*is not the rightmost position of*w*, but it is the rightmost in its class, in which case the global successor exists but not the class successor; we denote this by \(\overrightarrow {\textsf {type}}_{w}(i)= \textsf {cmax}\); - 3.
both global and class successors of

*i*are defined in*w*and they coincide, i.e.*i*+1=*i*⊕1; we denote this by \(\overrightarrow {\textsf {type}}_{w}(i)=\textsf {1succ}\); - 4.
both successors of

*i*are defined in*w*and they are different, i.e.*i*+1≠*i*⊕1; we denote this by \(\overrightarrow {\textsf {type}}_{w}(i)=\textsf {2succ}\).

We define \(\overrightarrow {\textsf {Types}}=\{\textsf {max},\textsf {cmax},\textsf {1succ},\textsf {2succ}\}\) to be the set of possible right types of positions of data words. The symmetric cases for the predecessors of *i* are signified by the left type \(\overleftarrow {\textsf {type}}_{w}(i) \in \overleftarrow {\textsf {Types}} = \{\textsf {min},\textsf {cmin},\textsf {1pred},\textsf {2pred}\}\). Finally, we define \(\textsf {type}_{w}(i) ~=~ \big (\,\overleftarrow {\textsf {type}}_{w}(i),\,\overrightarrow {\textsf {type}}_{w}(i)\big ) ~\in \textsf {Types} ~=\, \overleftarrow {\textsf {Types}} \times \overrightarrow {\textsf {Types}}\).

### 2.2 Class Memory Automata

We will rely on results on Data Automata [4] for our decidability results. However, for convenience we will use an equivalent model called Class Memory Automata [1]. We use [*n*] to denote the subset {1,...,*n*} of the natural numbers. Intuitively, a Class Memory Automaton is a finite state automaton enhanced with hash functions that assigns a *memory value* from a finite set [*k*] to each data value in \(\mathbb {D}\). On encountering a pair (*a*,*d*), a transition is non-deterministically chosen from a set that depends on the current state of the automaton, the memory value *f*(*d*), and the input letter *a*. When a transition on (*a*,*d*) is executed, the current state and the memory value of *d* are updated. Below we give a formal definition of a Class Memory Automaton. Later we will show that this model is also similar to that of Tiling Automata [17].

**Definition 1**

*Class Memory Automaton*(

*CMA*for short) is a tuple \(\mathcal {C}=(Q,{\Sigma },k,{\Delta },I,F,K)\), where:

*Q*is the finite set of states,Σ is the finite alphabet,

[

*k*] is the set of memory values,\({\Delta } ~\subseteq ~ Q\times {\Sigma } \times [k] \times Q \times [k]\) is the transition relation,

\(I\subseteq Q\) is the set of initial states,

\(F\subseteq Q\) is the set of final states,

\(K\subseteq [k]\) is the set of final memory values.

*Configurations*are pairs (

*q*,

*f*), with

*q*∈

*Q*and \(f\in [k]^{\mathbb {D}}\) – i.e. pairs consisting of a control state and a function from \(\mathbb {D}\) to [

*k*].

*Transitions*are of the form

*runs*. The

*initial configurations*are the pairs (

*q*

_{0},

*f*

_{0}), with

*q*

_{0}∈

*I*and

*f*

_{0}(

*d*)=1 for all \(d\in \mathbb {D}\); the

*final configurations*are the pairs (

*q*,

*f*), with

*q*∈

*F*and

*f*(

*d*)∈

*K*for all \(d\in \mathbb {D}\). The

*recognized language*\(\mathcal {L}(\mathcal {C})\) consists of the data words \(w \:=\: (a_{1}, d_{1}) \:\cdots \: (a_{n},d_{n}) ~\in ({\Sigma }\times \mathbb {D})^{*}\) that admit runs of the form \((q_{0},f_{0}) ~\overset {(a_{1},d_{1})}{\longrightarrow } ~ {\cdots } ~\overset {(a_{n},d_{n})}{\longrightarrow }~ (q_{n},f_{n})\), starting in an initial configuration and ending in a final configuration.

It is known that data languages recognized by CMA are effectively closed under union and intersection, but not under complementation. Their emptiness problem is decidable and reduces to reachability in vector addition systems, which is decidable but not known to be of elementary complexity. Inclusion and universality problems for CMA are undecidable.

The following result, paired with closure under intersection, allows us to assume that the information about local types of positions of a data word is available to CMA:

**Proposition 1** (Björklund and Schwentick [1])

*Let L be the set of all data words*\(w \in ({\Sigma } \times \textsf {Types}\times \mathbb {D})^{*}\)*such that, for all positions i, w(i)=(a,τ,d) implies τ=type*_{w}*(i). The language L is recognized by a CMA.*

### 2.3 Tiling Automata

Here we briefly recall the definitions of another class of automata, called *Tiling Automata* or *Graph Automata* [17]. Such automata receive acyclic directed graphs of bounded degree as input and they capture the expressiveness of the existential fragment of monadic second-order logic. In order to accept an input graph, a Tiling Automaton associates, in a non-deterministic way, a *color* (or state) to each node and then checks that the resulting colored spheres satisfy some specific constraints. Here, by colored sphere centered at a node *v* we mean precisely the subgraph induced by the set of nodes at distance at most *r* from *v*, for a fixed number *r* which is a parameter of the automaton (note that this set has bounded size because the input graph has bounded degree). Accordingly, the constraints of a Tiling Automaton are encoded by a finite set of graphs, hereafter called *tiles*, that describe the admitted spheres in an input graph marked with colors. Below, we give a definition of Tiling Automaton that is tailored for graphs representing data words – we refer to [17] for a more general definition and an account of the basic properties. In particular, we fix the the radius of the spheres to be 1, as it is usually done with graphs that represent finite words, trees, or pictures over a finite alphabet. However, we remark that considering only spheres of radius 1 limits the expressiveness of Tiling Automata as acceptors of data words, since the restriction will capture only data languages definable in a strict fragment of existential monadic second-order logic over the relations +1 and ⊕1. Subsequently, we will show that CMA and Tiling Automata are equally expressive when operating on the subclass of graphs representing data words.

*tile*as a tuple of the form

*a*and the possible type

*τ*of a position

*i*, as well as the possible colors

*γ*

_{0},

*γ*

_{−1},

*γ*

_{⊖1},

*γ*

_{+1},

*γ*

_{⊕1}that can be associated with

*i*and its neighboring positions

*i*−1,

*i*⊖1,

*i*+1,

*i*⊕1. For the sake of brevity, an element

*α*among 0,−1,⊖1,+1,⊕1 is called an

*axis*and correspondingly the color

*γ*

_{α}is denoted by

*t*[

*α*]. Clearly, we assume that

*t*[

*α*] is undefined (denoted

*t*[

*α*]=⊥) for all and only those axes (i.e., successors or predecessors) that are missing, as indicated by the type

*τ*. Similarly, we assume that

*t*[−1]=

*t*[⊖1] (resp.,

*t*[+1]=

*t*[⊕1]) whenever \(\tau \in \{\textsf {1pred}\}\times \overrightarrow {\textsf {Types}}\) (resp., \(\tau \in \overleftarrow {\textsf {Types}}\times \{\textsf {1succ}\}\)).

For example, if *τ*=(cmin,1succ), then the tuple *t*=(*a*,*τ*,*γ*_{0},*γ*_{−1},*γ*_{⊖1}, *γ*_{+1},*γ*_{⊕1}) is a tile only if *γ*_{0},*γ*_{−1}≠⊥, *γ*_{⊖1}=⊥, and *γ*_{+1}=*γ*_{⊕1}≠⊥.

**Definition 2**

*Tiling Automaton*is a triple \(\mathcal {T}=({\Sigma },{\Gamma },T)\) consisting of a finite alphabet Σ, a finite set Γ of colors, and a finite set

*T*of tiles over Σ and Γ. A

*tiling*by \(\mathcal {T}\) of a data word

*w*= (

*a*

_{1},

*d*

_{1}) … (

*a*

_{n},

*d*

_{n}) is a function \(\tilde {w}: [n] \rightarrow {\Gamma }\) such that, for all positions

*i*in

*w*, the tile

*T*. The language recognized by the Tiling Automaton \(\mathcal {T}\) consists of all data words that admit a valid tiling by \(\mathcal {T}\).

The result below follows from simple translations of automata and depends on the fact that CMA can compute the types of the positions in a data word.

**Proposition 2**

*CMA and Tiling Automata on data words are equivalent. Moreover, there exist polynomial-time translations between the two models.*

*Proof*

*Q*×[

*k*], where each color is meant to describe the state and the memory value of the data value that appear in a possible run of \(\mathcal {C}\) immediately after a given position. To construct an equivalent Tiling Automaton \(\mathcal {T}=({\Sigma },{\Gamma },T)\), it suffices to describe the set

*T*of admitted tiles. Formally, this set consists of those tuples

if \(\tau \in \{\min \}\times \overrightarrow {\textsf {Types}}\), then \(\gamma _{0}=(q^{\prime },h^{\prime })\) for some transition \((q_{0},a,1,q^{\prime },h^{\prime })\in {\Delta }\) and some initial state

*q*_{0}∈*I*;if \(\tau \in \{\textsf {cmin}\}\times \overrightarrow {\textsf {Types}}\), then

*γ*_{−1}=(*q*,*h*) for some memory value*h*∈[*k*] and \(\gamma _{0}=(q^{\prime },h^{\prime })\) for some transition \((q,a,1,q^{\prime },h^{\prime })\in {\Delta }\);if \(\tau \in \{\textsf {1pred}\}\times \overrightarrow {\textsf {Types}}\), then

*γ*_{−1}=(*q*,*h*) and \(\gamma _{0}=(q^{\prime },h^{\prime })\) for some transition \((q,a,h,q^{\prime },h^{\prime })\in {\Delta }\);if \(\tau \in \{\textsf {2pred}\}\times \overrightarrow {\textsf {Types}}\), then

*γ*_{−1}=(*q*,*h*) for some memory value*h*∈[*k*], \(\gamma _{\ominus 1}=(q^{\prime \prime },h^{\prime \prime })\) for some state \(q^{\prime \prime }\in Q\) and some memory value \(h^{\prime \prime }\in [k]\), and \(\gamma _{0}=(q^{\prime },h^{\prime })\) for some transition \((q,a,h^{\prime \prime },q^{\prime },h^{\prime })\in {\Delta }\);if \(\tau \in \overleftarrow {\textsf {Types}}\times \{\textsf {cmax}\}\), then

*γ*_{0}=(*q*,*h*) for some state*q*∈*Q*and some final memory value*h*∈*K*;if \(\tau \in \overleftarrow {\textsf {Types}}\times \{\textsf {max}\}\), then

*γ*_{0}=(*q*,*h*) for some final state*q*∈*F*and some final memory value*h*∈*K*

*t*by considering the case where \(\tau \in \overleftarrow {\textsf {Types}}\times \{\textsf {1succ},\textsf {2succ}\}\), as the required constraints will be enforced when considering the possible tiles associated with the class successor, which have type \(\tau ^{\prime }\in \{\textsf {1pred},\textsf {2pred}\}\times \overrightarrow {\textsf {Types}}\)).

*w*= (

*a*

_{1},

*d*

_{1}) (

*a*

_{2},

*d*

_{2}) … (

*a*

_{n},

*d*

_{n}) and a run of the CMA \(\mathcal {C}\) on

*w*of the form

*i*=1,...,

*n*and we observe that the following is a valid tiling of

*w*by \(\mathcal {T}\) (we succinctly represent it by a string over

*Q*×[

*k*]):

*f*

_{i}inductively for all

*i*=0,...,

*n*, as follows:

It follows that the Tiling Automaton \(\mathcal {T}\) defines the same data language recognized by the CMA \(\mathcal {C}\).

*w*∈

*L*augmented with the information about the local types. The states of \(\mathcal {C}\) are the tiles in

*T*, plus a distinguished initial state

*q*

_{0}, i.e.

*Q*=

*T*⊎{

*q*

_{0}}. Moreover, we let

*k*=|

*Q*| and we identify the memory values in [

*k*] with the states in

*Q*; in particular, we identify the memory value 1 with the initial state

*q*

_{0}. We now turn towards defining the transition relation Δ of \(\mathcal {C}\). Recall that we identified memory values in [

*k*] with states in

*Q*. In particular, this means that a generic transition rule of \(\mathcal {C}\) is a tuple of the form \(\left (t,t^{\prime },(a,\tau ),t^{\prime \prime },t^{\prime \prime }\right )\) that, on the basis of the input symbol (

*a*,

*τ*) and the states

*t*and \(t^{\prime }\) associated, respectively, with the global predecessor and the class predecessor, specifies a possible state \(t^{\prime \prime }\) that could be associated with the current position, as succinctly described by the diagram (as usual, assume \(t^{\prime }=q_{0}\) when there is no class predecessor and

*t*=

*q*

_{0}when there is no global predecessor). Hence it suffices to define Δ as the set of tuples of the form \(\left (t,t^{\prime },(a,\tau ),t^{\prime \prime },t^{\prime \prime }\right )\), with (

*a*,

*τ*)∈Σ×Types and \(t,t^{\prime },t^{\prime \prime }\in Q\), such that

- 1.
if \(\tau \in \{\min \}\times \overrightarrow {\textsf {Types}}\), then \(t=t^{\prime }=q_{0}\);

- 2.
if \(\tau \in \{\textsf {cmin}\}\times \overrightarrow {\textsf {Types}}\), then \(t^{\prime }=q_{0}\), \(t[+1]=t^{\prime \prime }[0]\), and \(t^{\prime \prime }[-1]=t[0]\);

- 3.
if \(\tau \in \{\textsf {1pred}\}\times \overrightarrow {Types}\), then \(t=t^{\prime }\), \(t[+1]=t[\oplus 1]=t^{\prime \prime }[0]\), and \(t^{\prime \prime }[-1]=t[0]\);

- 4.
if \(\tau \in \{\textsf {2pred}\}\times \overrightarrow {\textsf {Types}}\), then \(t^{\prime }[\oplus 1]=t[+1]=t^{\prime \prime }[0]\), \(t^{\prime \prime }[-1]=t[0]\), and \(t^{\prime \prime }[\ominus 1]=t^{\prime }[0]\).

*F*and

*K*of final states and final memory values contain those tiles

*t*=(

*a*,

*τ*,

*γ*

_{0},…,

*γ*

_{⊕1}) whose type

*τ*belong to \(\overleftarrow {\textsf {Types}}\times \{\textsf {max},\textsf {cmax}\}\).

Let us now consider a data word *w* = (*a*_{1},*d*_{1}) (*a*_{2},*d*_{2}) … (*a*_{n},*d*_{n}) and define \(w^{\prime } \:=\: (a_{1},\tau _{1},d_{1}) \: (a_{2},\tau _{2},d_{2}) \:\ldots \: (a_{n},\tau _{n},d_{n})\), where *τ*_{i}=type_{w}(*i*) for all positions *i*∈[*n*]. Any tiling of *w* by \(\mathcal {T}\) can be turned into a valid run of \(\mathcal {C}\) on \(w^{\prime }\) by simply prepending the initial configuration (*q*_{0},*f*_{0}) to the sequence of tiles. Conversely, any run of \(\mathcal {C}\) on \(w^{\prime }\) devoid of the initial configuration can be seen as a tiling of *w* by \(\mathcal {T}\). □

## 3 Automata Walking on Data Words

An automaton walking on data words is a finite state acceptor that processes a data word by moving its head along the successors and predecessors of positions. We let Axes={0,+1,⊕,1,−1,⊖1} be the set of the five possible directions of navigation in a data word (0 stands for ‘stay in the current position’).

**Definition 3**

A *Data Walking Automaton* (*DWA* for short) is defined as a tuple \(\mathcal {A} = (Q,{\Sigma },{\Delta },I,F)\), where *Q* is the finite set of states, Σ is the finite alphabet, \({\Delta } ~\subseteq ~ Q\times {\Sigma } \times \textsf {Types} \times Q\times \textsf {Axis}\) is the transition relation, \(I\subseteq Q\) is the set of initial states, \(F\subseteq Q\) is the set of final states.

Let \(w ~=~ (a_{1}, d_{1}) ~ {\cdots } ~ (a_{n},d_{n}) \in ({\Sigma }\times \mathbb {D})^{*}\) be a data word. Given *i*∈[*n*] and *α*∈Axis, we denote by *α*(*i*) the position that is reached from *i* by following the axis *α* (for instance, if *α*=0 then *α*(*i*)=*i*, if *α*=⊕1 then *α*(*i*)=*i*⊕1, provided that *i* is not the last element in its class). A configuration of \(\mathcal {A}\) is a pair consisting of a state *q*∈*Q* and a position *i*∈[*n*]. A transition is a tuple of the form \((p,i) ~\overset {w}{\longrightarrow }~ (q,j)\) such that (*p*,*a*_{i},*τ*,*q*,*α*)∈Δ, with *τ*=type_{w}(*i*) and *j*=*α*(*i*). The initial configurations are the pairs (*q*_{0},*i*_{0}), with *q*_{0}∈*I* and *i*_{0}=1. The halting configurations are those pairs (*q*,*i*) on which no transition is enabled; such configurations are said to be *final* if *q*∈*F*. The language \(\mathcal {L}(\mathcal {A})\) recognized by \(\mathcal {A}\) is the set of all data words \(w\in ({\Sigma }\times \mathbb {D})^{*}\) that admit a run of \(\mathcal {A}\) that starts in an initial configuration and halts in a final configuration.

We will also consider *deterministic* versions of DWA, in which the set *I* of initial states is a singleton and the transition relation Δ is a partial function from *Q*×Σ×Types to *Q*×Axis.

*Example 1*

Let *L*_{1} be the set of all data words that contain at most one occurrence of each data value (this language is equally defined by the formula \(\forall {x}\forall {y} ~ x \sim y \rightarrow x=y\)). A deterministic DWA can recognize *L*_{1} by reading the input data word from left to right (along axis +1) and by checking that all positions except the last one have type (cmin,cmax). When a position with type (cmin,max) or (min,max) is reached, the machine halts in an accepting state.

*Example 2*

Let *L*_{2} be the set of all data words in which every occurrence of *a* is followed by an occurrence of *b* in the same class (this is expressed by the formula \(\forall {x} ~ a(x) \rightarrow \exists {y} ~ b(y) \wedge x<y \wedge x \sim y\)). A deterministic DWA can recognize *L*_{2} by scanning the input data word along the axis +1. On each position *i* with left type cmin, the machine starts a subcomputation that scans the entire class of *i* along the axis ⊕1, and verifies that every *a* is followed by a *b*. The subcomputation terminates when a position with right type cmax is reached, after which the machines traverses back the class, up to the position *i* with left type cmin, and then resumes the main computation from the successor *i*+1. Intuitively, the automaton traverses the data word from left to right in a ‘class-first’ manner.

*Example 3*

*L*

_{3}of all data words in which every occurrence of

*a*is followed by an occurrence of

*b*that is not in the same class (this is expressed by the formula \(\forall x.~ a(x) ~\rightarrow ~ \exists y.~ b(y) ~\wedge ~ x<y ~\wedge ~ x \nsim y\)). This language is recognized by a deterministic DWA, although not in an obvious way. Fix a data word

*w*. It is easy to see that

*w*∈

*L*

_{3}iff one following cases holds:

- 1.
there is no occurrence of

*a*in*w*, - 2.
*w*contains a rightmost occurrence of*b*, say in position*ℓ*_{b}, and all occurrences of*a*are before*ℓ*_{b}; in addition, we require that either the class of*ℓ*_{b}does not contain an*a*, or the class of*ℓ*_{b}contains a rightmost occurrence of*a*, say in position*ℓ*_{a}, and another*b*appears after*ℓ*_{a}but outside the class of*ℓ*_{b}.

We show how to verify the second case by a deterministic DWA. For this, the automaton reaches the rightmost position of *w* and searches backward, following the axis −1, the first occurrence of *b*: this puts the head of the automaton in position *ℓ*_{b}. From position *ℓ*_{b} the automaton searches along the axis ⊖1 an occurrence of *a*. If no occurrence of *a* is found before seeing the left type cmin, then the automaton halts and accepts. Otherwise, as soon as an *a* is seen (necessarily at position *ℓ*_{a}), a second phase starts that tries to find another occurrence of *b* after *ℓ*_{a} and outside the class of *ℓ*_{b} (we call such an occurrence a *b*-witness). To do this, the automaton moves along the axis +1 until it sees a *b*, say at position *i*. After that, it scans the class of *i* along the axis ⊕1. If the right type cmax is seen before seeing a *b*, this means that *i* was the position of the last *b* in the class of *i*: in this case, the automaton goes back to position *i* (which is now the first position along axis ⊖1 that contains a *b*) and accepts iff another *b* is seen along the axis +1 (thanks to the previous test, that occurrence of *b* must be outside the class of *ℓ*_{b} and hence a *b*-witness). Otherwise, if a *b* is seen in position *j*>*i* the automaton backtracks to position *i* and resumes the search for another occurrence of *b* along the axis +1 (note that if *i* is a *b*-witness, then *j* is also a *b*-witness, which will be processed by the automaton eventually).

### 3.1 Closure Properties

We show here some basic closure properties for the class of non-deterministic DWA and the class of deterministic DWA under the set theoretic operations of union, intersection, and complementation. We defer to Section 6 a study of (non)closure properties of DWA under projection; there we will be able to build up on a number of results involving the classes of deterministic DWA, non-deterministic DWA, and CMA.

**Proposition 3**

*The class of non-deterministic DWA is effectively closed under union and intersection.*

*Proof*

Closure under union for the class of non-deterministic DWA is easily shown by taking a disjoint union of the state space of the two automata. Closure under intersection is shown by assuming without loss of generality that one of the two automata accepts only by halting in the leftmost position and by coupling its final states with the initial states of the other automaton. □

Analogous closure properties hold for the class of deterministic DWA, but now rely on the fact that one can remove loops from deterministic computations.

**Proposition 4**

*Given a deterministic DWA*\(\mathcal {A}\)*, one can construct in linear time a deterministic DWA*\(\mathcal {A}^{\prime }\)*equivalent to*\(\mathcal {A}\)*that always halts.*

*Proof*

This proof is an adaptation of Sipser’s construction for eliminating loops from deterministic space-bounded Turing machines [16]. We fix for the rest of the proof a deterministic DWA \(\mathcal {A}=(Q, {\Sigma }, {\Delta }, q_{0}, F)\) and an input data word \(w ~=~ (a_{1}, d_{1}) ~ {\cdots } ~ (a_{n},d_{n}) \in ({\Sigma }\times \mathbb {D})^{*}\) of length *n*. We define the *configuration graph* of \(\mathcal {A}\) on *w* as the directed graph \(\mathcal {G}(\mathcal {A}, w)\) with vertices *Q*×[*n*] and edges of the form \((p,i) ~\overset {w}{\longrightarrow }~ (q,j)\). The *reverse configuration graph*\(\mathcal {G}^{\textsf {rev}}(\mathcal {A}, w)\) is the graph obtained from \(\mathcal {G}(\mathcal {A},w)\) by reversing the edges. The basic argument behind the construction is that the reverse configuration graph \(\mathcal {G}^{\textsf {rev}}(\mathcal {A}, w)\) of a *deterministic* DWA is a forest. If not, there would be two distinct paths from a vertex (*p*,*i*) to a vertex (*q*,*j*) in \(\mathcal {G}^{\textsf {rev}}(\mathcal {A}, w)\) contradicting the fact that \(\mathcal {A}\) is deterministic.

As in the case of Turing machines, without loss of generality we can assume that \(\mathcal {A}\) has a unique final state *q*_{f} and in the case of a successful run the automaton \(\mathcal {A}\) finishes at the last position in state *q*_{f}. The data word *w* is in \(\mathcal {L}(\mathcal {A})\) if there is a path in \(\mathcal {G}^{\textsf {rev}}(\mathcal {A}, w)\) from the configuration (*q*_{f},*n*) to the configuration (*q*_{0},1). We construct a deterministic DWA \(\mathcal {A}^{\prime }\) that searches for such a path by performing a depth-first traversal of the tree rooted at (*q*_{f},*n*). The idea is implemented in the following way. We fix an arbitrary order on the transitions in Δ. In particular, this allows us to identify the first, second, ... edge leaving a certain node (*q*,*i*) in the graph \(\mathcal {G}^{\textsf {rev}}(\mathcal {A}, w)\). The states of the automaton \(\mathcal {A}^{\prime }\) are the transitions of \(\mathcal {A}1\) and \(\mathcal {A}^{\prime }\) starts at the last position in state corresponding to the first transition with target *q*_{f}. Traversing the edge from a vertex (*q*,*j*) to a child (*p*,*i*) in the graph \(\mathcal {G}^{\textsf {rev}}(\mathcal {A}, w)\) is simulated by applying the transition contained in the state of \(\mathcal {A}^{\prime }\) backwards. When a node has no children or all its children have been traversed, the automaton goes to the parent by taking the unique possible transition at that node and computes the next transition for the parent node. At any point during the simulation, if the node (*q*_{0},1) is visited, then the automaton halts and accepts. Otherwise the simulation terminates eventually at the root (*q*_{f},*n*) and the input is rejected. □

**Proposition 5**

*The class of deterministic DWA is effectively closed under union, intersection, and complementation.*

*Proof*

Thanks to Proposition 4 we can assume, without loss of generality, that deterministic DWA never loop, and always halt in the first position of the input word. Under this assumption, closure under complementation simply amounts at swapping the final and the non-final states. Similarly, unions and intersections of deterministic DWA are computed by chaining the automata, that is, by coupling the halting states of one automaton to the initial states of the other. □

## 4 Deterministic vs Non-deterministic DWA

This section is devoted to prove the following separation results:

**Theorem 1**

*There exist data languages recognized by non-deterministic DWA that cannot be recognized by deterministic DWA. There also exist data languages recognized by CMA that cannot be recognized by non-deterministic DWA.*

Intuitively, the proof of the theorem exploits the fact that one can encode binary trees by suitable data words and think of deterministic DWA (resp. non-deterministic DWA, CMA) as deterministic Tree Walking Automata (resp. non-deterministic Tree Walking Automata, classical bottom-up tree automata). One can then use the results from [2, 3] that show that (i) Tree Walking Automata cannot be determinised and (ii) Tree Walking Automata, even non-deterministic ones, cannot recognize all regular tree languages. We develop these ideas in the following subsections.

### 4.1 Encodings of Trees

Hereafter we use the term ‘tree’ (resp. ‘forest’) to denote a generic finite tree (resp. forest) where each node is labelled with a symbol from a finite alphabet Σ and has either 0 or 2 children. To encode trees/forests by data words, we will represent the node-to-left-child and the node-to-right-child relationships by means of the successor functions +1 and ⊕1, respectively. In particular, a leaf will correspond to a position of the data word with no class successor, an internal node will correspond to a position where both class and global successors are defined (and are distinct), and a root will be represented either by the leftmost position in the word or by a position with no class predecessor that is immediately preceded by a position with no class successor.

*d*,

*e*,

*f*,

*g*, the complete binary tree of height 2 can be encoded by the following data word: (to ease the understanding, we only drew the instances of the successor functions ⊕1 and +1 that represent left and right edges in the encoded tree).

A formal definition of encoding of a tree or forest follows.

**Definition 4**

We say that a data word \(w\in ({\Sigma }\times \mathbb {D})^{+}\) is a *forest encoding* if there is no position *i* such that \(\overrightarrow {\textsf {type}}_{w}(i)=\textsf {1succ}\) and no pair of consecutive positions *i* and *i*+1 such that \(\overrightarrow {\textsf {type}}_{w}(i)=\textsf {2succ}\) and \(\overleftarrow {\textsf {type}}_{w}(i+1)=\textsf {2pred}\).

*w*, we denote by forest(

*w*) the directed graph that has for nodes the positions of

*w*, labelled over Σ, and for edges the pairs

The fact that forest(*w*) is indeed a forest, for every data word *w* satisfying the above definition, follows from two basic observations: (i) the edges of forest(*w*) follow the ordering on the positions of *w*, and hence forest(*w*) is a directed acyclic graph, and (ii) for all pairs of distinct positions *i*,*j* in *w*, if \(\overrightarrow {\textsf {type}}(i)=\overrightarrow {\textsf {type}}(j)=\textsf {2succ}\), then *i*+1≠*j*⊕1 (otherwise, we would have *j*<*i*, \(\overrightarrow {\textsf {type}}(i)=\textsf {2succ}\), and \(\overleftarrow {\textsf {type}}(i+1)=\textsf {2pred}\), contradicting Definition 4), and hence nodes in forest(*w*) have in-degree at most 1.

*w*), based on the following case distinction:

if \(\overrightarrow {\textsf {type}}_{w}(i)=\textsf {2succ}\), then

*i*+1 and*i*⊕1 are the targets of two edges departing from*i*; we say that*i*+1 and*i*⊕1 are the*left*and*right children*of*i*, respectively;if \(\overrightarrow {\textsf {type}}_{w}(i)\in \{\textsf {max},\textsf {cmax}\}\), then

*i*has no edge departing from it, in which case*i*is a*leaf*;if \(\overleftarrow {\textsf {type}}_{w}(i)=\textsf {min}\) or \(\overleftarrow {\textsf {type}}_{w}(i)=\textsf {cmin}\) and \(\overrightarrow {\textsf {type}}_{w}(i-1)=\textsf {cmax}\), then

*i*has no edge entering it, and hence we call it*a root*.

Moreover, if forest(*w*) contains a single root, then it is a tree and we accordingly define tree(*w*)=forest(*w*); otherwise, we simply let tree(*w*) be undefined. Note that every tree of the form tree(*w*) is a *full binary tree*, namely, the internal nodes have always two children.

*canonical encodings*, in which the nodes are listed following the pre-order visit. For example, the above data word

*w*corresponds to a canonical encoding, while

*w*

^{′}does not. Clearly, each tree

*t*has a unique canonical encoding, up to permutations of the data values, which we denote by enc(

*t*).

*Remark 1*

We conclude this part by observing that the data language consisting of all forest encodings is recognized by a deterministic DWA: for this it suffices to scan the input data word once from left to right and check that the local types satisfy Definition 4. If in addition the DWA checks that there are no occurrences of the local type (cmin,cmax), then the recognized language consists of the valid encodings of full binary trees, namely, those data words *w* such that tree(*w*) is defined. On the other hand, the language of the canonical encodings of forests/trees is not recognizable by any DWA (even non-deterministic ones).

### 4.2 Separations of Tree Automata

We will work in this section with full binary trees, hereafter called simply *trees*. We briefly recall the definition of a tree walking automaton and the separation results from [2, 3]. In a way similar to DWA, we first introduce local types of nodes inside trees. These can be seen as pairs of labels from the finite sets Types^{↓}={leaf,internal} and Types^{↑}={root,leftchild,rightchild}, and they allow us to distinguish between a leaf and an internal node as well as between a root, a left child, and a right child. We use a set TAxis={0,*↑*,↙,↘} of four navigational directions inside a tree: 0 is for staying in the current node, *↑* is for moving to the parent, ↙ is for moving to the left child, and ↘ is for moving to the right child.

**Definition 5**

*non-deterministic Tree Walking Automaton*(

*TWA*) is a tuple \(\mathcal {A}=(Q,{\Sigma },{\Delta },I,F)\), where:

Σ is the finite alphabet,

*Q*is the finite set of states,\({\Delta } ~\subseteq ~ Q\times {\Sigma }\times \textsf {Types}^{\downarrow }\times \textsf {Types}^{\uparrow } \times Q\times \textsf {TAxis}\) is the transition relation,

\(I\subseteq Q\) is the set of initial states,

\(F\subseteq Q\) is the sets of final states.

*Runs*of TWA are defined in a way similar to the runs of DWA and begin with the initial state marking the root. The subclass of

*deterministic*TWA is obtained by replacing the transition relation Δ with a partial function from

*Q*×Σ×Types

^{↓}×Types

^{↑}to

*Q*×TAxis and by letting

*I*consist of a single initial state

*q*

_{0}.

### 4.3 Translations Between TWA and DWA

*L*, we denote by

*L*

^{enc}the language of all data words that encode (possibly in a non-canonical way) the trees in

*L*:

**Lemma 1**

*Given a deterministic (resp. non-deterministic) TWA*\(\mathcal {A}\)*recognizing L, one can construct a deterministic (resp. non-deterministic) DWA*\(\mathcal {A}^{\textsf {enc}}\)*recognizing L*^{enc}*. Conversely, given a deterministic (resp. non-deterministic) DWA*\(\mathcal {A}\)*, one can construct a deterministic (resp. non-deterministic) TWA*\(\mathcal {A}^{\textsf {tree}}\)*such that, for any tree t,*\(\mathcal {A}^{\textsf {tree}}\)*accepts t iff*\(\mathcal {A}\)*accepts the canonical encoding enc(t).*

*Proof*

*t*and any encoding

*w*of

*t*, we have \(t\in \mathcal {L}(\mathcal {A})\) iff \(w\in \mathcal {L}(\mathcal {A}^{\prime })\) (note that we do not specify the behaviour of \(\mathcal {A}^{\prime }\) on the inputs that are not valid encodings of trees). Formally, given the TWA \(\mathcal {A}=(Q,{\Sigma },{\Delta },q_{0},F)\), we define \(\mathcal {A}^{\prime }=(Q,{\Sigma },{\Delta }^{\prime },q_{0},F)\), where

*τ*

^{↓}and

*τ*

^{↑}are obtained from \(\overrightarrow {\tau }\) and \(\overleftarrow {\tau }\) as follows: either

*τ*

^{↓}=internal or

*τ*

^{↓}=leaf, depending on whether \(\overrightarrow {\tau } = \textsf {2succ}\) or \(\overrightarrow {\tau } \in \{\textsf {max},\textsf {cmax}\}\), and either

*τ*

^{↑}=root, or

*τ*

^{↑}=leftchild, or

*τ*

^{↑}=rightchild, depending on whether \(\overleftarrow {\tau } = \min \), or \(\overleftarrow {\tau } = \textsf {cmin}\), or \(\overleftarrow {\tau } = \textsf {2pred}\).

We let the reader check that, for all trees *t* and all data word encodings *w* of *t*, \(\mathcal {A}\) accepts *t* iff \(\mathcal {A}^{\prime }\) accepts *w*. We conclude the proof by exploiting the closure properties of DWA and by defining \(\mathcal {A}^{\textsf {enc}}\) as the intersection of \(\mathcal {U}^{\textsf {enc}}\) and \(\mathcal {A}^{\prime }\).

We turn now to the second claim. We fix a deterministic DWA \(\mathcal {A}\) (again, the case of a non-deterministic DWA is similar) and we show how to construct a deterministic TWA \(\mathcal {A}^{\textsf {tree}}\) whose behaviour is the same as the behaviour of \(\mathcal {A}\) when restricted to canonical encodings of trees. For the latter property to make sense, we need to make sure that the behaviour of \(\mathcal {A}\) is invariant on the possible different canonical encodings of trees: this is however easy to see, since canonical encodings are unique up to permutation of the data values, and, similarly, computations of DWA are invariant under permutation of the data values.

We recall that the standard definition of a TWA envisages three possible directions of navigation in a tree: *↑*, ↙, and ↘. For the sake of presentation, we introduce two new axis, denoted \(\leftarrow \) and →, that allow the automaton to move from a certain node *i* respectively to the predecessor \(\leftarrow (i)\) and to the successor \(\rightarrow \!(i)\) of *i*, according to the total ordering induced by the pre-order visit of the tree. We will use these new directions of navigation to mimic the moves of \(\mathcal {A}\) between consecutive positions of a canonical encoding. For instance, a move of \(\mathcal {A}\) from position *i* to position *i*−1 in a canonical encoding *w* of *t* will be simulated by a corresponding move of \(\mathcal {A}^{\textsf {tree}}\) from node *i* to the node that immediately precedes *i* in the pre-order visit of *t*, even in the case when *i* is not a left child (so \(\leftarrow \!(i)\neq \,\uparrow (i)\)). We also observe that allowing moves along the axis \(\leftarrow \) and → does not increase the expressive power of TWA. Indeed, when a transition is executed that takes the automaton from node *i* to node \(j=\,\leftarrow \!(i)\), then two cases can happen depending on the local type of node *i* in *t*: either *i* is a left child, in which case *j* is simply the parent of *i*, or *i* is a right child, in which case *j* is the rightmost leaf in the left subtree of the parent of *i*, i.e. *j*= ↘ ^{n}(↙ (*↑* (*i*))) f from what followsor *n* sufficiently large, and thus the transition can be simulated by a finite sequence of moves along the axis *↑*, ↙, ↘, ..., ↘. Analogous arguments hold for the transitions that take the automaton from node *i* to node \(j=\,\rightarrow (i)\).

We also modify our TWA model in order to be able to check simple node properties at each transition – again, this modification does not affect the expressive power. Specifically, we assume that the guards of the transitions of a TWA can distinguish, using refined local types, the last (rightmost) leaf in the pre-order visit of the entire tree from all the other leaves (this feature can be easily implemented via deterministic subcomputations that start in a leaf and look for the deepest ancestor that is not a right child). We thus refine the local type leaf∈Types^{↓} into two new local types rightmostleaf and otherleaf.

and where \(\overleftarrow {\tau }\) and \(\overrightarrow {\tau }\) are obtained from *τ*^{↑} and *τ*^{↓} as follows: \(\overleftarrow {\tau } ~=~ \min ~/~ \textsf {cmin} ~/~ \textsf {2pred}\) depending on *τ*^{↑} = root / leftchild / rightchild, and \(\overrightarrow {\tau } ~=~ \textsf {max} ~/~ \textsf {cmax} ~/~ \textsf {2succ}\) depending on *τ*^{↓} = rightmostleaf / otherleaf / internal. By a slight abuse of notation, we can identify the nodes in a tree *t* with the corresponding positions in the canonical encoding enc(*t*). Under this assumption, it becomes easy to verify that every transition \((p,i) \:\overset {t}{\longrightarrow }\: (q,j)\) of the TWA \(\mathcal {A}^{\textsf {tree}}\) on a tree *t* can be seen as a transition \((p,i) \:\overset {\textsf {enc}(t)}{\longrightarrow }\: (q,j)\) of the DWA \(\mathcal {A}\) on the canonical encoding enc(*t*) and, symmetrically, every transition of \(\mathcal {A}\) on the canonical encoding enc(*t*) can be seen as a transition of \(\mathcal {A}^{\textsf {tree}}\) on the tree *t*. Analogous properties for the runs of \(\mathcal {A}\) and \(\mathcal {A}^{\textsf {tree}}\) follow by a simple inductive argument. This shows that \(\mathcal {A}^{\textsf {tree}}\) accepts precisely those trees whose canonical encodings are accepted by \(\mathcal {A}\) starting from the rightmost position. □

**Lemma 2**

*Given a tree automaton*\(\mathcal {A}\)*recognizing a regular language L, one can construct a CMA*\(\mathcal {A}^{\textsf {enc}}\)*recognizing L*^{enc}.

*Proof*

The tree automaton \(\mathcal {A}\) can be seen as a Tiling Automaton on trees or, equivalently, as a Tiling Automaton on encodings of trees (recall that, by Definition 4, nodes that are neighbors inside a tree *t* correspond, in any data word that encodes *t*, to positions that are also neighbors). The data language of all valid encodings of trees is also recognized by a Tiling Automaton. The claim follows from Proposition 2 and the fact that CMA are closed under intersection. □

We are now ready to transfer the separation results to data languages:

*Proof of Theorem 1*

Let *L*_{1} be a language recognized by a non-deterministic TWA \(\mathcal {A}_{1}\) that cannot be recognized by deterministic TWA (recall that such a language exists thanks to the first claim of Theorem 1). Using the first claim of Lemma 2, we construct a non-deterministic DWA \(\mathcal {A}_{1}^{\textsf {enc}}\) such that \(\mathcal {L}(\mathcal {A}_{1}^{\textsf {enc}})=L_{1}^{\textsf {enc}}\). Suppose by way of contradiction that there is a deterministic DWA \(\mathcal {B}_{1}\) that also recognizes \(L_{1}^{\textsf {enc}}\). We apply the second claim of Lemma 1 and we obtain a deterministic TWA \(\mathcal {B}_{1}^{\textsf {tree}}\) that accepts all and only the trees whose canonical encodings are accepted by \(\mathcal {B}_{1}\). Since \(L_{1}^{\textsf {enc}}=\{w ~\mid ~ \textsf {tree}(w)\in L_{1}\}\) is invariant under equivalent encodings of trees (that is, \(w\in L_{1}^{\textsf {enc}}\) iff \(w^{\prime }\in L_{1}^{\textsf {enc}}\) whenever \(\textsf {tree}(w)=\textsf {tree}(w^{\prime })\)), we have that *t*∈*L*_{1} iff \(\textsf {enc}(t)\in L_{1}^{\textsf {enc}}\), iff \(t\in \mathcal {L}(\mathcal {B}_{1}^{\textsf {tree}})\). We have just shown that the deterministic TWA \(\mathcal {B}_{1}^{\textsf {tree}}\) recognizes the language *L*_{1}, which contradicts the assumption on *L*_{1}.

By applying similar arguments to a regular tree language *L*_{2} that is not recognizable by non-deterministic TWA (cf. second claim of Theorem 2), one can separate CMA from non-deterministic DWA. □

Finally, we observe that the language *L*_{bridges} is definable in the two-variable fragment of first-order logic with access to the linear order < on positions and either the class successor predicate ⊕1 or the data equality predicate ∼. It is also definable in Basic Data LTL, a linear temporal logic with 2-sorted operators, working over the string projection and the data classes, see [9].

## 5 A Fragment of First-order Logic Captured by DWA

Two-variable fragments of first-order logics have been extensively studied in the literature, especially in connection with data languages. For example, in [4] the logic FO^{2}[Σ,+1,≤,∼], which uses the global successor, the total ordering relation, and a third predicate ∼ for comparing data values, has been considered and proved decidable by reduction to emptiness of Data Automata. Other examples of logical formalisms that use at most two variables and some binary predicates were studied in [5, 11, 15].

In this section we consider the two-variable fragment of first-order logic with global successor and class successor predicates, and we prove that sentences in this logic can be translated to equivalent deterministic DWA. More precisely, the logic under consideration is denoted FO^{2}[Σ,+1,⊕1] and consists of first-order formulas that use at most two variable names, unary predicates corresponding to the letters in the finite alphabet Σ, and the global and class successors predicates +1 and ⊕1. Data words can be naturally seen as models of FO^{2}[Σ,+1,⊕1] sentences.

^{2}[Σ,+1,⊕1] follows from two basic observations:

- 1.
every FO

^{2}[Σ,+1,⊕1] sentence can be rewritten into a Boolean combination of locally threshold testable conditions of the form*“local property α (x) is satisfied on k distinct positions”*, where “local property” roughly means a formula that can be evaluated over a small neighborhood of the position; - 2.
deterministic DWA can easily count, up to some given bound, the number of positions

*x*in a data word where a certain local property*α*(*x*) holds; since they are closed under unions and intersections, deterministic DWA can thus evaluate Boolean combinations of locally threshold testable conditions.

^{2}[Σ,+1,⊕1] formulas to Boolean combinations of locally threshold testable conditions. This choice is also motivated by the fact that two-variable formulas cannot describe precisely the isomorphism types of subgraphs induced by neighboring positions, as it is the case for instance with full first-order formulas.

*G*

_{w}=(

*V*, ), where:

*V*=(*V*_{a})_{a∈Σ}is the partition of the domain of*w*into sets*V*_{a}={*i*∣*w*(*i*)=*a*},- (i.e.
*j*is both a successor and a class successor of*i*), - and either
*i*⊕1 is undefined or*j*≠*i*⊕1, - and either
*i*+1 is undefined or*j*≠*i*+1.

_{w}(

*i*,

*j*) the length of the shortest path between

*i*and

*j*in the undirected graph obtained from

*G*

_{w}.

*x*

*y*for

*y*=

*x*+1∧

*y*=

*x*⊕1,

*x*

*y*for

*y*=

*x*+1∧

*y*≠

*x*⊕1,

*x*

*y*for

*y*≠

*x*+1∧

*y*=

*x*⊕1, and dist(

*x*,

*y*)>1 for

*y*≠

*x*+1∧

*y*≠

*x*⊕1∧

*x*≠

*y*+1∧

*x*≠

*y*⊕1. Moreover, we will assume that all existential quantifications in FO

^{2}[Σ,+1,⊕1] are of the form

*φ*(

*y*) does not contain any free occurrence of the variable

*x*and

*τ*(

*x*,

*y*) is a formula among , dist(

*x*,

*y*)>1. We can do so, without loss of generality, because every atomic relation between

*x*and

*y*(e.g.

*y*=

*x*⊕1) can be seen as disjunction of formulas

*τ*(

*x*,

*y*) of the previous form and because existential quantification commutes with disjunction.

^{2}[Σ,+1,⊕1] formulas with one free variable

*x*, which can be evaluated over small neighborhoods of

*G*

_{w}. Formally, we define

*ℓ-local formulas*by induction on \(\ell \in \mathbb {N}\), as follows:

*a*(*x*) is 0-local for all letters*a*∈Σ,*α*(*x*) ∧*β*(*x*) is*ℓ*-local if both*α*(*x*) and*β*(*x*) are*ℓ*-local,¬

*α*(*x*) is*ℓ*-local if*α*(*x*) is*ℓ*-local,- ∃
*y*(*α*(*y*) ∧*τ*(*x*,*y*)) is (*ℓ*+1)-local if*α*(*y*) is*ℓ*-local and*τ*(*x*,*y*) entails dist(*x*,*y*)=1, namely, .

*ℓ*-local formulas on data words via computations of bounded length that start and end in the same position:

**Lemma 3**

*Given an ℓ-local formula α(x), one can construct a deterministic DWA*\(\mathcal {A}\)*such that, for all data words w and all positions i, if*\(\mathcal {A}\)*starts in i, then it halts again in i and it accepts or rejects depending on whether*\((w,i)\vDash \alpha (x)\).

*Proof*

The construction of the automaton \(\mathcal {A}\) follows the syntactic structure of the local formula and exploits basic closure properties of the considered subclass of deterministic DWA.

The base case consists of translating a predicate *a*(*x*) into a single-transition automaton that moves from the initial state to a halting state that is either accepting or rejecting depending on the label of the current position.

*α*(

*x*) ∧

*β*(

*x*) and ¬

*α*(

*x*) are dealt with by using the standard constructions of concatenation of runs and complementation of automata. Finally, the translation of an (

*ℓ*+1)-local formula \(\exists y~ \left (\alpha (y) \:\wedge \: \tau (x,y)\right )\) is done by a simple case distinction based on the form of

*τ*(

*x*,

*y*). For instance, if , then the automaton tests that the current position has type 1succ (if not it halts and rejects), then moves to the successor of the current position, simulates the automaton for the

*ℓ*-local formula

*α*(

*y*), moves back to the original position, and halts in an accepting or rejecting state depending on the result of the subcomputation for

*α*(

*y*). Analogous constructions can be given for the remaining cases. □

Thanks to the above lemma and to the fact that, up to logical equivalence, there exist only finitely many *ℓ*-local formulas, we can treat *ℓ*-local formulas in the same way as we treat letters from the finite alphabet Σ.

^{2}[Σ,+1,⊕1] definable data languages are recognized by deterministic DWA, we will first give a translation of FO

^{2}[Σ,+1,⊕1] sentences towards Boolean combinations of constraints that count, up to a certain threshold, the number of positions satisfying some local formulas. For this we introduce an intermediate logical language, denoted FO

^{2}[Σ,+1,⊕1], that extends FO

^{2}[Σ,+1,⊕1] by adding sentences with counting quantifiers of the form ∃

^{≥k}

*y*

*α*(

*y*), where \(k\in \mathbb {N}\) and

*α*(

*y*) is an

*ℓ*-local formula for some \(\ell \in \mathbb {N}\). These new sentences are interpreted on data words in the following natural way:

^{2}[Σ,+1,⊕1] formula

*φ*(

*x*) into an equivalent Boolean combination of

*ℓ*-local formulas

*α*(

*x*), for a suitable \(\ell \in \mathbb {N}\), and global constraints of the form ∃

^{≥k}

*y*

*α*(

*y*). The following lemma provides the inductive translation in the most interesting case (i.e. quantification over points that are far from the instance of the free variable):

**Lemma 4**

*Let*\(\varphi (x)=\exists y~ \left (\alpha (y) ~\wedge ~ \textsf {dist}(x,y)>1\right )\)

*be an FO*

^{2}

*[Σ,+1,⊕1] formula, where α(y) is ℓ-local. Let E*

*be the set of all edge relations witnessing distance 1. We have that φ(x) is logically equivalent to the following Boolean global constraints of the form combination of (ℓ+1)-local and global constraints:*

*where α*

^{0}

*(x)=α(x) and α*

^{e}

*(x)=∃y. (α(y) ∧ xey) for all e∈E (note that α*

^{e}

*(x) is an (ℓ+1)-local formula).*

*Proof*

The proof of this lemma is a case distinction based on which positions *y* at distance at most 1 from *x* satisfy the local formula *α*(*y*). Precisely, we consider some data word *w* and a position *i* in it. For each *e*∈*E*, we denote by *j*_{e} the unique position in *G*_{w} such that *i*e*j*_{e} (note that dist(*i*,*j*_{e})=1). For convenience, we also let *j*_{0}=*i*. We then define *I* to be the set of all indices \(e\in E\cup \{0\}\) such that \((w,j_{e})\vDash \alpha (y)\). By construction, *w* contains exactly |*I*| positions at distance at most 1 from *i* that satisfy *α*(*y*). We conclude that \((w,i)\vDash \varphi (x)\) iff there is a position *y* at distance more than 1 from *x* that satisfies *α*(*y*), iff *w* contains at least |*I*|+1 positions that satisfy *α*(*y*). □

We can now show how to turn an FO^{2}[Σ,+1,⊕1] sentence into a Boolean combination of constraints of the form ∃^{≥k}*y**α*(*y*), with *α*(*y*) local:

**Theorem 3**

*Every FO*^{2}*[Σ,+1,⊕1] sentence is logically equivalent to a Boolean combination of global constraints of the form ∃*^{≥k}*yα(y), where*\(k\in \mathbb {N}\)*and α(y) is ℓ-local for some*\(\ell \in \mathbb {N}\).

*Proof*

^{2}[Σ,+1,⊕1] formula (or sentence) Ψ can be transformed into a

*normal form*that consists of a Boolean combination of

*ℓ*-local formulas, for some \(\ell \in \mathbb {N}\), and global constraints ∃

^{≥k}

*y*

*α*(

*y*). To prove this claim we use an induction on the number

*N*of subformulas of Ψ that have a single free variable and are not yet normalized. The base case

*N*=0 is vacuously true. As for the inductive step, we consider an innermost subformula

*ϕ*(

*x*) of Ψ that is not yet normalized and we show how to normalize it. Since all proper subformulas of

*ϕ*(

*x*) are normalized, we know that

*ϕ*(

*x*) cannot be local, nor can start with a Boolean connective (otherwise

*ϕ*(

*x*) would have been already in normal form). Moreover, recall that every universally quantified formula ∀

*y*

*φ*(

*y*) can be seen as a shorthand for ¬∃

*y*¬

*φ*(

*y*), and that existentially quantified formulas are assumed to be in the form defined by Eq. (‡). Based on these arguments, we know that

*ϕ*(

*x*) is of the form

*φ*(

*y*) is normalized and contains no free occurrence of the variable

*x*. We then consider the global constraints that occur as maximal subformulas of

*φ*(

*y*): since these are sentences with no free variable, they commute with the existential quantification on

*y*. In particular,

*ϕ*(

*x*) is logically equivalent to a Boolean combination of formulas of the form

*ϕ*(

*x*). □

**Corollary 1**

*Deterministic DWA recognize all data languages that are definable in FO*^{2}*[Σ,+1,⊕1] (or even in*\(\textsf {FO}^{2}_{\textsf {count}}[{\Sigma },+1,\oplus 1]\)*).*

*Proof*

^{2}[Σ,+1,⊕1] sentence Ψ is equivalent to a Boolean combination \({\Psi }^{\prime }\) of global constraints \(\gamma _{1},...,\gamma _{n}\), where

*γ*

_{j}= ∃

^{≥kj}

*y*

*α*

_{j}(

*y*) for all 1≤

*j*≤

*n*, with \(k_{1},...,k_{n}\in \mathbb {N}\) and

*α*

_{1}(

*y*),...,

*α*

_{n}(

*y*) local formulas. We can use Lemma 3 to turn each local formula

*α*

_{j}(

*y*) into an equivalent deterministic DWA \(\mathcal {A}_{j}\). Moreover, we can introduce a new alphabet \({\Gamma }=\mathcal {P}(\{c_{1},...,c_{n}\})\) and construct a deterministic finite state automaton \(\mathcal {B}\) that scans any word \(\hat {w}\in {\Gamma }^{*}\), storing in its control state the number

*h*

_{j}of occurrences of each predicate

*c*

_{j}, up to threshold

*k*

_{j}, and accepting iff the formula \({\Psi }^{\prime }\) is satisfied when we substitute each constraint \(\gamma _{j} ~=~ \exists ^{\geq k_{j}}{y}\alpha _{j}(y)\) with the condition

*h*

_{j}≥

*k*

_{j}. Now, we let

*L*be the data language defined by Ψ. By construction, \(\mathcal {B}\) recognizes the following language over Γ:

*c*

_{j}with the subautomaton \(\mathcal {A}_{j}\) results in a deterministic DWA that recognizes the language

*L*. □

## 6 Decision Problems on DWA

We analyze in detail the complexity of the decision problems on DWA. We start by considering the simpler membership problem, which consists of deciding whether \(w\in \mathcal {L}(\mathcal {A})\) for a DWA \(\mathcal {A}\) and a data word *w*, both given as input. Subsequently, we move to the emptiness and universality problems, which consist of deciding, respectively, whether a given DWA accepts at least one data word and whether a given DWA accepts all data words. We will show that these problems are decidable, as well as the more general problems of containment and equivalence.

### 6.1 Membership

Compared to other classes of automata on data words (e.g. CMA, Register Automata), deterministic DWA have a membership problem of very low time/space complexity. Moreover, the complexity of the membership problem does not get much worse if we consider non-deterministic DWA. We assume the reader to be familiar with circuit complexity and, in particular, with constant-depth (e.g. AC^{0}) reductions [18].

**Proposition 6**

*The membership problem for a deterministic DWA*\(\mathcal {A}\)*and a data word w is decidable in time*\(\mathcal {O}(|w|\cdot |\mathcal {A}|)\)*and is*LogSpace-*complete under*AC^{0}*reductions. Similarly, the membership problem for non-deterministic DWA is*NLogSpace-*complete.*

*Proof*

To decide in deterministic linear time whether a given deterministic DWA \(\mathcal {A}\) accepts a given data word *w*, it is not just sufficient to simulate the run of \(\mathcal {A}\) on *w*, since \(\mathcal {A}\) may reject *w* by entering an infinite loop. We use instead Proposition 4 to compute a non-looping deterministic DWA \(\mathcal {A}^{\prime }\) equivalent to \(\mathcal {A}\). Recall that \(\mathcal {A}^{\prime }\) can be computed from \(\mathcal {A}\) in linear time and hence \(|\mathcal {A}^{\prime }|=\mathcal {O}(|\mathcal {A}|)\). Then we simulate the run of \(\mathcal {A}^{\prime }\) on *w*. Overall, this requires time \(\mathcal {O}(|\mathcal {A}|+|\mathcal {A}^{\prime }|\cdot |w|)=\mathcal {O}(|\mathcal {A}|\cdot |w|)\) and space \(\mathcal {O}(\log |\mathcal {A}|+\log |w|)\). For hardness, we note that the membership problem is LogSpace-hard under AC^{0} reductions already for deterministic finite state automata (see, for example, [7]).

As for non-deterministic DWA, it suffices to observe that a non-deterministic logarithmic-space Turing machine can easily guess and simulate a run of a given DWA \(\mathcal {A}\) on a given data word *w*. This shows that the membership problem for non-deterministic DWA is in NLogSpace. Moreover, the membership problem is known to be NLogSpace-hard already for non-deterministic finite state automata. □

### 6.2 Emptiness

We start by reducing the emptiness of CMA to the emptiness of deterministic DWA (or, equivalently, to universality of deterministic DWA). For this purpose, it is convenient to first translate the input CMA \(\mathcal {A}\) into an equivalent Tiling Automaton \(\mathcal {T}=({\Sigma },{\Gamma },T)\), using Proposition 2. We denote by \(\textsf {Tilings}(\mathcal {T})\) the set of data words over \({\Sigma }\times \mathbb {D}\) expanded by valid tilings on them – we think of the latter set as a data language over the alphabet \({\Gamma }\times {\Sigma }\times \mathbb {D}\). Now, given a data word \(\tilde {w} \in ({\Gamma }\times {\Sigma }\times \mathbb {D})^{*}\), checking whether \(\tilde {w}\) belongs to \(\textsf {Tilings}(\mathcal {T})\) reduces to checking constraints on neighborhoods of positions. Since this can be done by a deterministic DWA, we get the following result:

**Proposition 7**

*Given a Tiling Automaton*\(\mathcal {T}\)*, one can construct in polynomial time a deterministic DWA*\(\mathcal {T}^{\textsf {tiling}}\)*that recognizes the data language*\(\textsf {Tilings}(\mathcal {T})\).

Three important corollaries follow from the above proposition. The corollaries concern the operation of *functional projection*, formally specified by a function \(f:{\Sigma }\rightarrow {\Sigma }^{\prime }\) and mapping any data word *w* = (*a*_{1},*d*_{1}) … (*a*_{n},*d*_{n}) over \({\Sigma }\times \mathbb {D}\) to the data word *f*(*w*) = (*f*(*a*_{1}),*d*_{1}) … (*f*(*a*_{n}),*d*_{n}) over \({\Sigma }^{\prime }\times \mathbb {D}\).

**Corollary 2**

*Data languages recognized by CMA are functional projections of data languages recognized by deterministic DWA.*

**Corollary 3**

*The class of non-deterministic DWA and that of deterministic DWA are not closed under functional projections.*

*Proof*

If non-deterministic DWA would capture functional projections of deterministic DWA, then, by the previous result, they would also capture the languages recognized by CMA, which would contradict Theorem 1. □

**Corollary 4**

*Emptiness and universality of deterministic DWA is at least as hard as emptiness of CMA, which in turn is at least as hard as reachability in Petri nets* [4].

We now turn to showing that languages recognized by non-deterministic DWA are also recognized by CMA, and hence emptiness of DWA is reducible to emptiness of CMA. Let \(\mathcal {A}=(Q,{\Sigma },{\Delta },I,F)\) be a non-deterministic DWA. Without loss of generality, we can assume that \(\mathcal {A}\) has a single initial state *q*_{0} and a single final state *q*_{f}. We can also assume that whenever \(\mathcal {A}\) accepts a data word *w*, it does so by halting in the rightmost position of *w*. For the sake of brevity, given a transition *δ*=(*p*,*a*,*τ*,*q*,*α*)∈Δ, we define source(*δ*)=*p*, target(*δ*)=*q*, letter(*δ*)=*a*, type(*δ*)=*τ*, and reach(*δ*)=*α*. Below, we introduce the concept of min-flow, which can be thought of as a special form of tiling that witnesses acceptance of a data word *w* by \(\mathcal {A}\). Min-flows are similar to crossing sequences, which were used by Rabin and Scott in [14] to transform two-way finite state automata to equivalent one-way automata – a difference here is that we cannot avoid, or easily detect, the presence of disconnected components in a min-flow.

**Definition 6**

*w*= (

*a*

_{1},

*d*

_{1}) … (

*a*

_{n},

*d*

_{n}) be a data word of length

*n*. A

*min-flow on w*is any map \(\mu : [n] \rightarrow 2^{\Delta }\) that satisfies the following conditions:

- 1.
There is a transition

*δ*∈*μ*(1) such that source(*δ*)=*q*_{0}; - 2.
There is a transition

*δ*∈*μ*(*n*) such that target(*δ*)=*q*_{f}; - 3.
For all positions

*i*∈[*n*], if*δ*∈*μ*(*i*), then letter(*δ*)=*a*_{i}and type(*δ*)=type_{w}(*i*); - 4.
For each

*i*∈[*n*] and each*q*∈*Q*, there is at most one transition*δ*∈*μ*(*i*) such that source(*δ*)=*q*; - 5.
For each

*i*∈[*n*] and each*q*∈*Q*, there is at most one position*j*∈[*n*] for which there is*δ*∈*μ*(*j*) such that target(*δ*)=*q*and*i*=reach(*δ*)(*j*); - 6.For each
*i*∈[*n*], let exiting(*i*) be the set of all states of the form source(*δ*) for some*δ*∈*μ*(*i*); similarly, let entering(*i*) be the set of all states of the form target(*δ*) for some*δ*∈*μ*(*j*) and some*j*∈[*n*] such that*i*=reach(*δ*)(*j*); our last condition states that for all positions*i*∈[*n*],- (a)
if

*i*=1, then entering(*i*)=exiting(*i*)∖{*q*_{0}}, - (b)
if

*i*=*n*, then exiting(*i*)=entering(*i*)∖{*q*_{f}}, - (c)
otherwise, exiting(

*i*)=entering(*i*).

- (a)

**Lemma 5**

\(\mathcal {A}\)*accepts w iff there is a min-flow μ on w.*

*Proof*

Let *w*=(*a*_{1},*d*_{1}) ⋯ (*a*_{n},*d*_{n}) be a data word of length *n* and let *ρ* be a successful run of \(\mathcal {A}\) on *w* of the form \((q_{0},i_{0}) \overset {\textsf {w}}{\longrightarrow } (q_{1},i_{1}) \overset {\textsf {w}}{\longrightarrow } \ldots \: (q_{m},i_{m})\) obtained by the sequence of transitions *δ*_{1},…,*δ*_{m}. Without loss of generality, we can assume that no position in *ρ* is visited twice with the same state (indeed, if *i*_{k}=*i*_{h} and *q*_{k}=*q*_{h} for some *k*≠*h*, then *ρ* would contain a loop that can be eliminated without affecting acceptance). We associate with each position *i*∈[*n*] the set *μ*(*i*)={*δ*_{k} ∣ 1≤*k*≤*m*, *i*_{k}=*i*}. One can easily verify that *μ* is a min-flow on *w*.

For the other direction, we assume that there is a min-flow *μ* on *w*. We construct the edge-labeled graph *G*_{μ} with vertices in *Q*×[*n*] and edges of the form \(\left ((p,i),(q,j)\right )\) labeled by a transition *δ*, where *i*∈[*n*], *δ*∈*μ*(*i*), *p*=source(*δ*), *q*=target(*δ*), and *j*=reach(*δ*)(*i*). By construction, every vertex of *G*_{μ} has the same in-degree as the out-degree (either 0 or 1), with the only exceptions being the vertex (*q*_{0},1) of in-degree 0 and out-degree 1, and the vertex (*q*_{f},*n*) of in-degree 1 and out-degree 0. One way to construct a successful run of \(\mathcal {A}\) on *w* is to repeatedly choose the *only* vertex *x* in *G*_{μ} with in-degree 0 and out-degree 1, execute the transition *δ* that labels the *only* edge departing from *x*, and remove that edge from *G*_{μ}. This procedure terminates when no edge of *G*_{μ} can be removed and it produces a successful run on *w*. □

Since min-flows are special forms of tilings, CMA can guess them and hence:

**Theorem 4**

*Given a DWA, one can construct an equivalent CMA. In particular, emptiness of DWA is a decidable problem.*

### 6.3 Universality

Here we show that the complement of the language recognized by a DWA is also recognized by a CMA, and hence universality of DWA is reducible to emptiness of CMA. As usual, we fix a DWA \(\mathcal {A}=(Q,{\Sigma },{\Delta },I,F)\), with *I*={*q*_{0}} and *F*={*q*_{f}}, and we assume that \(\mathcal {A}\) halts only on rightmost positions. Below we define max-flows, which, dually to min-flows, can be seen as a special forms of tilings witnessing non-acceptance.

**Definition 7**

*w*= (

*a*

_{1},

*d*

_{1}) … (

*a*

_{n},

*d*

_{n}) be a data word of length

*n*. A

*max-flow on w*is any map \(\nu : [n] \rightarrow 2^{Q}\) that satisfies the following conditions:

- 1.
*q*_{0}∈*ν*(1) and*q*_{f}∉*ν*(*n*), - 2.
for all positions

*i*∈[*n*] and all transitions*δ*∈Δ, if source(*δ*)∈*ν*(*i*), letter(*δ*)=*a*_{i}, and type(*δ*)=type_{w}(*i*), then target(*δ*)∈*ν*(reach(*δ*)(*i*)).

**Lemma 6**

\(\mathcal {A}\)*rejects w iff there is a max-flow ν on w.*

*Proof*

Let \(\rho ~=~ (q_{0},i_{0}) \overset {\textsf {w}}{\longrightarrow } (q_{1},i_{1}) \overset {\textsf {w}}{\longrightarrow } \ldots \: (q_{m},i_{m})\) be a partial run of \(\mathcal {A}\) on *w* starting in the initial state. It is easy to verify, e.g. by induction the length *m* of *ρ*, that every max-flow *ν* on *w* contains *ρ* in the sense that *q*_{k}∈*ν*(*i*_{k}) for all indices 0≤*k*≤*m*. This means that if \(\mathcal {A}\) has a successful run on *w*, then there is no max-flow on *w*.

Next assume that \(\mathcal {A}\) has no successful run on *w*. Consider the smallest max-flow *ν* containing all the runs of \(\mathcal {A}\) on *w*. This witnesses the left-to-right direction of the proposition. □

We obtain that CMA capture complements of languages recognized by DWA:

**Theorem 5**

*Given a non-deterministic DWA*\(\mathcal {A}\)*recognizing L, one can construct a CMA*\(\mathcal {A}^{\prime }\)*that recognizes the complement of L. In particular, universality of DWA is a decidable problem.*

### 6.4 Containment and Other Problems

We conclude by mentioning a few interesting decidability results that follow directly from Theorems 4 and 5 and from the closure properties of CMA under union and intersection. The first result concerns the decidability of containment/equivalence of DWA. The second result concerns the property of language of being *invariant under tree encodings*, namely, of being of the form *L*^{enc} for some language *L* of trees.

**Corollary 5**

*Given two non-deterministic DWA*\(\mathcal {A}\)*and*\(\mathcal {B}\)*, one can decide whether*\(\mathcal {L}(\mathcal {A})\subseteq \mathcal {L}(\mathcal {B})\)*. More generally, one can decide emptiness of every Boolean combination of languages recognized by non-deterministic DWA.*

*Proof*

*L*be a Boolean combination of languages recognized by non-deterministic DWA. Without loss of generality, we can assume that

*L*

_{i,j}(resp. \(\bar {L}_{i,j}\)) is a language recognized by a non-deterministic DWA \(\mathcal {A}_{i,j}\) (resp. the complement of a language recognized by a non-deterministic DWA \(\bar {\mathcal {A}}_{i,j}\)). In view of Theorems 4 and 5, one can construct suitable CMA \(\mathcal {C}_{i,j}\) and \(\bar {\mathcal {C}}_{i,j}\) recognizing

*L*

_{i,j}and \(\bar {L}_{i,j}\), respectively. Finally, closure of CMA under unions and intersections imply that

*L*is recognized by a CMA, for which emptiness can be decided. □

**Corollary 6**

*Given a non-deterministic DWA*\(\mathcal {A}\)*, one can decide whether*\(\mathcal {L}(\mathcal {A})\)*is invariant under tree encodings.*

*Proof*

We briefly explain how to reduce the problem of deciding invariance under tree encodings to a containment problem between DWA. We reuse some of the notation that we introduced in Section 4. Let \(L=\mathcal {L}(\mathcal {A})\) for some non-deterministic DWA \(\mathcal {A}\). We have that *L* is invariant under tree encodings iff (i) \(L\subseteq U^{\textsf {enc}}\), where *U* is the regular language of all trees, and (ii) for all data words \(w,w^{\prime }\) such that \(\textsf {tree}(w)=\textsf {tree}(w^{\prime })\), *w*∈*L* iff \(w^{\prime }\in L\). The first condition is a simple containment between DWA. Checking the second condition reduces to transforming \(\mathcal {A}\) into a TWA \(\mathcal {A}^{\textsf {tree}}\) such that \(\mathcal {L}(\mathcal {A}^{\textsf {tree}}) ~=~ \left \{t ~\mid ~ \textsf {enc}(t)\in L\right \}\), then turning \(\mathcal {A}^{\textsf {tree}}\) back to a DWA \(\mathcal {A}^{\prime }\) such that \(\mathcal {L}(\mathcal {A}^{\prime }) ~=~ \mathcal {L}(\mathcal {A}^{\textsf {tree}})^{\textsf {enc}} ~=~ \left \{w ~\mid ~ \textsf {enc}(\textsf {tree}(w))\in L\right \}\) (\(~\supseteq ~ \mathcal {L}(\mathcal {A})\)), and finally deciding whether \(\mathcal {L}(\mathcal {A}^{\prime })\subseteq \mathcal {L}(\mathcal {A})\). □

## 7 Undecidable Extensions of DWA

In this section we consider some natural extensions of DWA, specifically alternating DWA and DWA with pebbles, and we show that they quickly lead to undecidable emptiness problems. *Alternating DWA* are defined, as expected, by partitioning the set of states into existential and universal ones and by formulating acceptance as a winning condition in a two-player game (infinite plays are seen as rejecting runs). *Pebble DWA* are the analogue of tree walking pebble automata [6] for data words: like DWA, they can move along global/class predecessors/successors and, in addition, they can drop a pebble from a fixed finite set at a currently visited position, they can lift a pebble from the current position, and they can test whether the current position is marked with a pebble.

**Proposition 8**

*Emptiness of alternating DWA is undecidable.*

*Proof*

*u*

_{i},

*v*

_{i}) for

*i*=1,…,

*n*, with

*n*>0 and

*u*

_{i},

*v*

_{i}words over an alphabet Σ. We introduce a new alphabet Γ=Σ⊎{1,…,

*n*}⊎{

*#*} and we encode a solution \(u_{i_{1}} \cdot \ldots \cdot u_{i_{m}}=v_{i_{1}} \cdot \ldots \cdot v_{i_{m}}\) (

*m*>0) of the PCP instance by means of a data word \(w\in ({\Gamma }\times \mathbb {D})^{*}\), such that:

- 1.
the projection of

*w*onto Γ is the string \(i_{1}\cdot u_{i_{1}} \cdot \ldots \cdot i_{m}\cdot u_{i_{m}} \cdot \# \cdot i_{1}\cdot v_{i_{1}} \cdot \ldots \cdot i_{m}\cdot v_{i_{m}}\), - 2.
the data value associated with

*#*occurs exactly once, while the other data values, which are associated with symbols in Σ⊎{1,…,*n*}, occur exactly twice, once to the left and once to the right of the separator*#*, - 3.
any two positions with equal data value carry the same symbol from Γ,

- 4.
the sequence of data values associated with symbols in Σ (resp., in {1,…,

*n*}) occurring to the left of*#*coincides with the sequence of data values associated with symbols in Σ (resp. in {1,…,*n*}) occurring to the right of*#*.

*L*be the language of all data word encodings of solutions of the PCP instance. Below, we show that

*L*can be recognized by an alternating DWA, which implies that the considered PCP problem reduces to non-emptiness of

*L*.

The string projection of *L* onto Γ is a regular language of the form \(\{i\cdot u_{i} ~\mid ~ 1\le i\le n\}^{+} ~ \# ~ \{i\cdot v_{i} ~\mid ~ 1\le i\le n\}^{+}\). This means that the first condition that defines a data word encoding can be checked by a deterministic DWA. The second and third conditions are also easily checked by deterministic DWA with access to local types.

It remains to describe a DWA that checks the last condition by exploiting alternation. For this is sufficient to consider only the subsequence of data values associated with symbols in Σ to the left and to the right of *#*. More precisely, starting from the leftmost Σ-labeled position of the input data word *w*, the automaton repeatedly performs the following sequence of moves, until the rightmost Σ-labeled position is reached: from a position *i*, it first moves universally to some Σ-labeled position *j*>*i* before the occurrence of *#*, then it moves to the class successor *j*⊕1, reaches universally some Σ-labeled position *k*>*j*⊕1, and moves to the class predecessor *k*⊖1. If the input word *w* is a valid encoding of a solution of the PCP instance, and in particular if *w* satisfies condition 4. above, then the automaton eventually halts in the rightmost position of *w*. Otherwise, if *w* does not satisfies condition 4., then there exist a position *j* to the left of *#* and a position *k* to the right of *#* such that *j*⊕1<*k* and *k*⊖1<*j*. This means that the automaton admits an infinite run that cycles between positions *j* and *k*, and thus rejects the input word *w*. □

**Proposition 9**

*Emptiness of pebble DWA is undecidable.*

*Proof*

The proof is a variant of that of Proposition 8. Given an instance of the PCP problem, we define the language *L* of all encodings of solutions of this instance. The data language *L* can be equally recognized by a deterministic DWA with a single pebble. As before, the first three conditions that define membership of a data word *w* in *L* can be checked by deterministic DWA without pebbles, while the last condition requires the use a pebble, since it concerns the order of the data values associated with symbols in Σ⊎{*#*}. Specifically, if we denote by \(w^{\prime }\) the subsequence of *w* obtained by selecting the positions labeled over Σ⊎{*#*}, then checking the last condition amounts to verifying that, in \(w^{\prime }\), every position *i* to the left of *#* satisfies \(\left (\left ((i\oplus 1)+1\right )\ominus 1\right )-1 ~=~ i\). This test can be directly performed on the input data word *w* by a deterministic automaton that executes the following steps: it places a pebble at each position *i* to the left of *#*, it moves first along axis ⊕1 and then to the right reaching the next Σ-labeled position (if there is no such position, it backtracks to position *i* and accepts iff the next symbol is *#*); then it moves along the axis ⊖1 and again to the left to the previous Σ-labeled position, where it checks the presence of the pebble (if not, the automaton halts and rejects); finally, it lifts the pebble and continues the computation with the next Σ-labeled position *i*+1, until the separator *#* is reached. □

## 8 Discussion

We showed that the model of walking automaton can be adapted to data words in order to define robust families of data languages. We studied the complexity of the fundamental problems of word acceptance, emptiness, universality, and containment (quite remarkably, all these problems are shown to be decidable). We also analyzed the relative expressive power of the deterministic and non-deterministic models of Data Walking Automata, comparing them with other classes of automata appeared in the literature (most notably, Data Automata and Class Memory Automata). In this respect, we proved that deterministic DWA, non-deterministic DWA, and CMA form a strictly increasing hierarchy of data languages, where the top ones are functional projections of the bottom ones.

It follows from our results that DWA satisfy properties analogous to those satisfied by Tree Walking Automata – for instance non-deterministic DWA, like non-deterministic TWA, are effectively closed under all Boolean operations, are strictly less expressive than Tiling Automata, and are not closed under functional projections.

We also know that DWA are incomparable with one-way non-deterministic Register Automata [8]: on the one hand, DWA can check that all data values are distinct, whereas Register Automata cannot; on the other hand, Register Automata can recognize languages of data strings that do not encode valid runs of Turing machines, while Data Walking Automata cannot, as otherwise universality would become undecidable. Variants of DWA can also be considered, for instance, by adding registers, pebbles, alternation, or nesting. Unfortunately, none of these extensions yields a decidable emptiness problem. As an example, we have shown that the use of alternation or pebbles in DWA allows one to easily encode positive instances of Post’s correspondence problem, thus implying undecidability of emptiness.

*Are non-deterministic DWA closed under complementation?**Do DWA capture all languages definable in FO*^{2}*[Σ,<,⊕1], i.e. the two-variable fragment of first-order logic with access to the letters in Σ, the linear order < on positions, and the class successor predicate ⊕1? Similarly, do DWA capture all languages definable in Basic Data LTL?*

^{2}[Σ,<,⊕1] and conjectured to be not recognizable by DWA:

## Notes

### Acknowledgments

The first author thanks Thomas Colcombet for detailed discussions and acknowledges that some of the ideas were inspired during these. The second author acknowledges Mikołaj Bojańczyk and Thomas Schwentick for detailed discussions about the relationship between DWA and Data Automata. The authors are also grateful to the anonymous referees for the many helpful remarks on the paper.

### References

- 1.Björklund, H., Schwentick, T.: On notions of regularity for data languages. Theor. Comput. Sci.
**411**(4-5), 702–715 (2010)MathSciNetCrossRefMATHGoogle Scholar - 2.Bojańczyk, M., Colcombet, T.: Tree-walking automata cannot be determinized. Theor. Comput. Sci.
**350**(2-3), 164–173 (2006)MathSciNetCrossRefMATHGoogle Scholar - 3.Bojańczyk, M., Colcombet, T.: Tree-walking automata do not recognize all regular languages. SIAM J.
**38**(2), 658–701 (2008)MathSciNetCrossRefMATHGoogle Scholar - 4.Bojańczyk, M., David, C., Muscholl, A., Schwentick, T., Segoufin, L.: Two-variable logic on data words. ACM Trans. Comput. Log.
**12**(4), 27 (2011)MathSciNetMATHGoogle Scholar - 5.Bojańczyk, M., Muscholl, A., Schwentick, T., Segoufin, L.: Two-variable logic on data trees and XML reasoning. J. Assoc. Comput. Mach.
**56**(3) (2009)Google Scholar - 6.Engelfriet, J., Hoogeboom, H.: Tree-walking pebble automata. In: Jewels are forever, contributions to Theoretical Computer Science in honor of Arto Salomaa, pp 72–83. Springer (1999)Google Scholar
- 7.Holzer, M., Kutrib, M.: Descriptional and computational complexity of finite automata: a survey. Inf. Comput.
**209**(3), 456–470 (2011)MathSciNetCrossRefMATHGoogle Scholar - 8.Kaminski, M., Francez, N.: Finite-memory automata. Theor. Comput. Sci.
**134**(2), 329–363 (1994)MathSciNetCrossRefMATHGoogle Scholar - 9.Kara, A., Schwentick, T., Zeume, T.: Temporal logics on words with multiple data values. In: Proceedings of the IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pp. 481–492.Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2010)Google Scholar
- 10.Libkin, L., Vrgoc, D.: Regular expressions for data words. In: Proceedings of the 18th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR),
*LNCS*, vol. 7180, pp 274–288. Springer (012)Google Scholar - 11.Manuel, A., Zeume, T.: Two-variable logic on 2-dimensional structures. In: Proceedings of the 22th EACSL Annual Conference on Computer Science Logic (CSL),
*LIPIcs*, vol. 23, pp. 484–499. Schloss Dagstuhl -Leibniz-Zentrum fuer Informatik (2013)Google Scholar - 12.McNaughton, R., Papert, S.: Counter-free automata. MIT (1971)Google Scholar
- 13.Neven, F., Schwentick, T., Vianu, V.: Finite state machines for strings over infinite alphabets. ACM Trans. Comput. Log.
**5**(3), 403–435 (2004)MathSciNetCrossRefGoogle Scholar - 14.Rabin, M., Scott, D.: Finite automata and their decision problems. IBM J. Res. Dev.
**3**(2), 114–125 (1959)MathSciNetCrossRefMATHGoogle Scholar - 15.Schwentick, T., Zeume, T.: Two-variable logic with two order relations. In: Proceedings of the 19th EACSL Annual Conference on Computer Science Logic (CSL),
*LNCS*, vol. 6247, pp 499–513. Springer (2010)Google Scholar - 16.Sipser, M.: Halting space-bounded computations. Theor. Comput. Sci.
**10**, 335–338 (1980)MathSciNetCrossRefMATHGoogle Scholar - 17.Thomas, W.: Elements of an automata theory over partial orders. In: Partial Order Methods in Verification, pp 25–40. American Mathematical Society (1997)Google Scholar
- 18.Vollmer, H.: Introduction to Circuit Complexity: a uniform approach. Texts in Theoretical Computer Science. An EATCS Series. Springer (1999)Google Scholar