Keywords

1 Introduction

Reinforcement learning [51] (RL) is a sampling-based approach to synthesis, capable of producing solutions with superhuman efficiency [10, 44, 49]. An RL agent interacts with its environment through episodic interactions while receiving scalar rewards as feedback for its performance. Following the explicit/symbolic dichotomy of model checking approaches, the interactions in classic RL can be characterized as explicit: each episode consists of a sequence of experiences in which the agent chooses an action from a concrete state, observes the next concrete state, and receives an associated reward for this explicit interaction.

We envision a symbolic approach to RL, where each experience may deal with a set of states represented by a predicate, and the evolution of the system is described by predicate transformers. When the state space is large, symbolic representations may lead to greater efficiency and better generalization. Moreover, there are systems with naturally succinct representations—such as factored MDPs [27, 32], succinct MDPs [23], and Petri nets [9]—that can benefit from symbolic manipulation of states.

The concept of symbolic interactions with an environment differs significantly from typical approximation methods used in RL, such as linear approximations [51] or deep neural networks [31]. In the context of such techniques, a learning agent attempts to generalize observations based on perceived similarities between them. In the symbolic setting, however, the generalization of a given interaction is explicitly provided by the environment itself. As a result, symbolic interactions facilitate a more direct form of generalization, where the environment ensures that similar interactions lead to similar outcomes.

This paper presents regular reinforcement learning (RRL), a symbolic approach to RL that employs regular languages and rational transductions, respectively, as models of predicates and their transformations. While natural languages can be used to encode symbolic interactions, we use regular languages [50] for the following reasons: (1) Regular languages enable unambiguous representation of predicates and predicate transformers. (2) Regular languages possess elegant theoretical properties including the existence of minimal canonical automata, determinizability, closure under many operations, and decidable emptiness and containment. (3) Regular languages hold a special position in machine learning, enjoying numerous efficient learnability results and active learning algorithms.

Regular languages also form the basis of a class of powerful symbolic model checking algorithm for infinite-state systems known as regular model checking (RMC) [2, 5, 13, 18]. The following example introduces the concepts of RRL though a variation on the canonical token passing protocol used in the RMC literature [5, 18].

Example 1

(Token Passing). The token passing protocol involves an arbitrary number of processes arranged in a linear topology and indexed by consecutive natural numbers. At any point in time, each process can be in one of two states: t if it has a token or n if it does not. The states of the system are then strings over the alphabet \(\left\{ t,n \right\} \). The initial state, in which only the leftmost process has a token, is the regular language \(tn^*\).

At each time step, an agent chooses an action from the set \(\left\{ a, b, c \right\} \), and each of these actions corresponds to one of the following outcomes.

(a):

Each even-indexed process with a token passes it to the right. Each odd-indexed process with a token passes a copy of it to the right.

(b):

Each odd-indexed process with a token passes it to the right. Each even-indexed process with a token passes a copy of it to the right.

(c):

The outcome of a (resp. b) occurs with probability p (resp. \((1-p)\)).

Figure 1 depicts finite state transducersFootnote 1 corresponding to actions a and b. The property to be verified is that exactly one process possesses a token at any given time. The essence of this property may be captured by a reward functionFootnote 2 \(R : 2^{\left\{ t, n \right\} ^*} \rightarrow \mathbb {R}\) defined such that

$$\begin{aligned} R(L) = {\left\{ \begin{array}{ll} 0 &{} \text {if } L \subseteq n^* t n^*, \\ -1 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

From the initial configuration \(tn^*\), action a moves the system to state \(ntn^*\) and incurs a reward of 0, while action b transitions the system to the configuration \(ttn^*\) and incurs a reward of \(-1\). The optimal policy selects action a when the token is with a process with an even index, and chooses action b otherwise.

In RRL, the agent initially chooses an action that it deems appropriate for the state \(tn^*\). The environment then returns a language obtained by applying either transducer \(T_{a}\) or transducer \(T_{b}\) to transform \(tn^*\), depending on the agent’s choice. From \(tn^*\), the two possible languages are \(ntn^*\) and \(ttn^*\). The environment also assigns a reward to the agent. Repeated interactions of this type result in a sequence of states (regular languages) and rewards. The goal of the agent is to learn a policy (a function from regular languages to actions) that maximizes the cumulative reward.

Fig. 1.
figure 1

An edge from \(q_0\) to \(q_2\) labelled by \(t \backslash n\) denotes that if the transducer reads the symbol t from state \(q_0\), then it outputs the symbol n and moves to \(q_2\). Double-circled states are accepting. Such a machine is understood to only produces outputs for inputs that, once completely processed, leave the transducer in an accepting state.

Since there are infinitely many regular languages, the system described in Example 1 gives rise to an infinite-state decision process. As there is no known convergent RL algorithm for infinite-state environments in general, this prohibits the direct use of tabular RL algorithms for RRL. In regular model checking, techniques exist to address the difficulties of dealing with infinite state spaces, such as widening and acceleration. In regular reinforcement learning, we will leverage advances in graph neural networks to tackle this challenge.

Contributions. As in RMC, the primary application of language-theoretic modeling in RRL is the symbolic representation of states and transitions in the underlying system. We formalize RL environments that are constructed according to this principle under the name regular Markov decision processes (RMDPs). These environments generalize the systems modeled in RMC by incorporating controllable dynamics (through the agent selecting actions) and stochastic transition dynamics. Figure 2 provides a visual depiction of the similarities and differences between system transitions in RMC vs. RRL.

Fig. 2.
figure 2

Illustration of the difference between transitions in RMC and RMDPs.

We provide a theoretical analysis of various aspects of RMDPs, focussing on issues related to decidability, finiteness, and approximability of optimal policies. This clarifies the basic limits of RRL and helps in determining when standard RL methods can or cannot be adapted to this setting. In particular, we establish the following results in Sect. 4.

  • The optimal expected payoff, known as the value, of a given RMDP under an arbitrary payoff function is not computable.

  • For any RMDP with computable rewards and transition probabilities, the value under a discounted payoff is approximately computable.

  • For any RMDP with computable rewards and transition probabilities, the value under a discounted payoff is PAC-learnable.

  • We identify several conditions under which an RMDP remains finite and present a Q-learning algorithm for such situations.

After this, we turn our attention toward practical applications of RRL. In Sect. 5 we propose a formulation of deep RRL. By representing regular languages as finite-state automata and viewing automata as labeled directed graphs, we are able to exploit graph neural networks for approximating optimal values and policies. Graph neural networks [56] are neural network architectures that process graphs as input, typically by performing repeated message passing of vectors over the graph’s structure. We demonstrate through a collection of experimental case studies that deep RRL is both natural and effective.

2 Related Work

Regular Model Checking. RMC [3, 5, 18, 55] is a verification framework based on symbolically encoding system states and transitions as regular languages and rational transductions, respectively. Despite the relative simplicity of rational transductions, allowing their arbitrary iteration produces a Turing-complete model of computation. Consequently, significant effort has been put into methods to approximate the transitive closures of rational transductions, and to compute them exactly in special cases [17, 38, 52].

In particular, incorporating automata learning techniques into the RMC toolbox [33, 43, 45] has shown promise. There is also significant work on improving the framework’s expressive capabilities by extending RMC to enable the use of \(\omega \)-regular languages [13, 40, 41], regular tree languages [4, 6, 16, 19], and more powerful types of transductions [26]. RMC and its various extensions have been successfully applied to verification safety and liveness properties in a variety of settings related to mutual exclusion protocols, programs operating on unbounded data structures [14, 15], lossy channel systems [8], and additive dynamical systems over numeric domains [11, 12]. To the best of our knowledge, this paper is the first to combine deep reinforcement learning with regular model checking.

Regular Languages and Reinforcement Learning. The use of regular languages in RL has become increasingly popular to meet the increasing demand for structured, principled representations in neuro-symbolic artificial intelligence. The work closest to our own employs regular languages as a mechanism for modeling aspects of environments with certain kinds of non-Markovian, or history-dependent, dynamics.

Regular Decision Processes [20, 21] are the topic of a recent line of research at the intersection of language-theoretic regularity and sequential optimization. A regular decision process is a finite state probabilistic transition system—much like a traditional Markov decision process (MDP)—except that transition probabilities and rewards are dependent on some regular property of the history. Note that while regular decision processes provide a succinct modeling framework for a subclass of non-Markovian optimization problems, they can be converted to larger, but semantically equivalent, finite-state MDPs. In contrast, the RMDPs introduced in this paper are not generally equivalent to finite MDPs. Considerable work has been done to develop the theory and practice around regular decision processes, including design and analysis of inference algorithms [1], learning efficiency analysis [47], and empirical evaluations of specific modeling tasks [42].

Reward Machines [34,35,36] are finite state machines used in modeling reward signals in decision processes with non-Markovian, but regular, reward dynamics Attention to the topic has resulted in inference algorithms for learning reward machines in partially observable MDPs [37], methods for jointly learning reward machines and corresponding optimal policies [57], adaptations of active grammatical inference algorithms like \(\text {L}^\star \) for reward machine inference [58], generalization to probabilistic machines modeling stochastic reward signals [24, 29], applications to robotics [22], and more [25, 59].

3 Preliminaries

Let \(\mathbb {N}\) and \(\mathbb {R}\) denote, respectively, the natural numbers and the real numbers. For a set X, we write \(2^X\) to denote its powerset and \(\left| X \right| \) to denote its cardinality.

3.1 Regular Languages

An alphabet \(\varSigma \) is a finite set of symbols, and a word w over \(\varSigma \) is a finite string of its symbols. The length \(\left| w \right| \) of a word w is the number of its constituent symbols. The empty word, of length 0, is denoted by \(\varepsilon \). We write \(\varSigma ^n\) for the set of all words of length n. Further, let \(\varSigma ^{\le n} = \bigcup _{k=0}^n \varSigma ^k\) be the set of all strings of length at most n and let \(\varSigma ^* = \bigcup _{n=0}^\infty X^n\) be the set of all words over \(\varSigma \). A subset \(L \subseteq \varSigma ^*\) is a called a language.

Definition 1

(FSA). A (nondeterministic) finite-state automaton (FSA) \(\mathcal {A}\) is given by a tuple \(\left\langle \varSigma , Q, q_0, F, \delta \right\rangle \), where \(\varSigma \) is an alphabet, Q is a finite set of states, \(q_0 \in Q\) is a distinguished initial state, \(F \subseteq Q\) is a set of final states, and \(\delta : Q \times \varSigma \rightarrow 2^Q\) is a transition function.

The transition function \(\delta \) may be extended to \(\delta ^* : Q \times \varSigma ^* \rightarrow 2^Q\) such that \(\delta ^*(q, \varepsilon ) = \left\{ q \right\} \) and \(\delta ^*(q, \sigma w) = \bigcup _{q' \in \delta (q, \sigma )} \delta ^*(q', w)\). The semantics of an FA \(\mathcal {A}\) are given by a language

$$\begin{aligned} L_\mathcal {A} = \left\{ w \in \varSigma ^* : \delta ^*\left( q_0, w \right) \cap F \ne \emptyset \right\} , \end{aligned}$$

and we say that \(\mathcal {A}\) recognizes \(L_\mathcal {A}\).

A language is regular if it is recognized by an FSA.

3.2 Rational Transductions

Let \(\varSigma \) and \(\varGamma \) be alphabets. A mapping \(\theta : \varSigma ^* \rightarrow 2^{\varGamma ^*}\), or equivalently a relation over \(\varSigma ^* \times \varGamma ^*\), is called a transduction. For a language L and a transduction \(\theta : \varSigma ^* \rightarrow 2^{\varGamma ^*}\), let \(\theta (L) = \bigcup _{x \in L} \theta (x)\). The domain of \(\theta : \varSigma ^* \rightarrow 2^{\varGamma ^*}\) is given as \(\textrm{dom}\left( \theta \right) = \left\{ x \in \varSigma ^* : \theta (x) \ne \emptyset \right\} \) and its image is defined as \(\textrm{im}\left( \theta \right) = \bigcup _{x \in \varSigma ^*} \theta (x)\).

Given a finite set of transductions \(\varTheta \) with type \(\varSigma ^* \rightarrow 2^{\varSigma ^*}\), each finite word \(\theta _1 \dots \theta _n \in \varTheta ^*\) corresponds to the transduction \(\theta _n \circ \dots \circ \theta _1\) where \(\varepsilon \) represents the identity mapping, i.e. \(\varepsilon (x) = x\) for every \(x \in \varSigma ^*\). For convenience, we identify the word \(\theta _1 \dots \theta _n\) with the transduction \(\theta _n \circ \dots \circ \theta _1\) so that \(\theta _1 \dots \theta _n(x) = \theta _n \circ \dots \circ \theta _1(x)\) holds for every \(x \in \varSigma ^*\). The set of languages reachable from a given language L via elements of \(\varTheta ^*\) is called the orbit of \(\varTheta \) on L and is written as

$$\begin{aligned} \textrm{Orb}_{\varTheta }\left( L \right) = \left\{ \tau (L) : \tau \in \varTheta ^* \right\} . \end{aligned}$$

Definition 2

(FST). A (nondeterministic) finite-state transducer (FST) T is given by a tuple \(\left\langle \varSigma , \varGamma , Q, q_0, F, \delta \right\rangle \), where

  • \(\varSigma \) and \(\varGamma \) are input and output alphabets, respectively,

  • Q is a finite set of states,

  • \(q_0 \in Q\) is a distinguished initial state,

  • \(F \subseteq Q\) is a set of final states, and

  • \(\delta : Q \times \varSigma \rightarrow 2^{Q \times \varGamma ^*}\) is a transition function that maps each state-input pair to a set of state-output pairs.

The transition function \(\delta \) may be extended to \(\delta ^* : Q \times \varSigma ^* \rightarrow 2^{Q \times \varGamma ^*}\) such that \(\delta (q, \varepsilon ) = \left\{ \left\langle q, \varepsilon \right\rangle \right\} \) and \(\delta ^*(q, \sigma x) = \left\{ \left\langle q_2, yz \right\rangle : \left\langle q_1, y \right\rangle {\in } \delta (q, \sigma ) \wedge \left\langle q_2, z \right\rangle {\in } \delta ^*(q_1, x) \right\} \). The semantics of T are the transduction \(\llbracket T \rrbracket : \varSigma ^* \rightarrow 2^{\varGamma ^*}\) defined by

$$\begin{aligned} \llbracket T \rrbracket (x) = \left\{ y \in \varGamma ^* : \exists q \in F.\, \left\langle q, y \right\rangle \in \delta ^*(q_0, x) \right\} . \end{aligned}$$

A rational transduction \(\theta \) is one for which there exists an FST T such that \(\theta = \llbracket T \rrbracket \). A rational function \(\theta \) is a rational transduction such that \(\left| \theta (x) \right| \le 1\) for all \(x \in \varSigma ^*\). When discussing rational functions, we write the type as \(\varSigma ^* \rightarrow \varGamma ^*\).

Remark 1

While the title of regular language has become standard terminology, there is no universally adopted vocabulary for their relational counterparts. The transductions we qualify here as rational are sometimes qualified alternatively in related work with terms such as regular, FST-definable, GSM-definable, etc.

3.3 Markov Decision Processes

Let \(\textrm{Dist}\left( X \right) \) be the family of all probability distributions over a set X.

Definition 3

(MDP). A Markov decision process (MDP) M is presented by a tuple \(\left\langle S, \hat{s}, A, p, r \right\rangle \), where

  • S is a set of states,

  • \(\hat{s} \in S\) is a distinguished initial state,

  • A is a set of actions,

  • \(p : S \times A \rightarrow \textrm{Dist}\left( S \right) \) is a probabilistic transition function, and

  • \(r : S \times A \rightarrow \mathbb {R}\) is a reward function.

For any states \(s,t \in S\) and action \(a \in A\), we write \(p(t \mid s, a)\) as a shorthand for p(sa)(t). We call an MDP finite if both S and A are finite sets.

A policy over an MDP \(M = \left\langle S, \hat{s}, A, p, r \right\rangle \) is a history-dependent function that determines how the next action is stochastically chosen. More formally, a policy is defined as a mapping \(\pi : S \left( AS \right) ^* \rightarrow \textrm{Dist}\left( A \right) \) from the domain of interaction histories to probability distributions over the action space. Let \(\varPi _{M}\) be the set of all policies over the MDP M. Fixing a policy \(\pi \) on M induces a family of probability distributions \(\left\{ \mathbb {P}_\pi ^n : n \in \mathbb {N} \right\} \) on histories \(h = s_1 a_1 \dots s_n a_n\) with \(s_1 = \hat{s}\), where \(\mathbb {P}_\pi ^n(h) = \prod ^{n-1}_{k=1} p(s_{k+1} \mid s_k, a_k) \pi (a_k \mid s_1 a_1 \dots a_{k-1} s_k)\). There exists a unique extension \(\mathbb {P}_\pi \in \textrm{Dist}\left( (SA)^\omega \right) \) of the family \(\left\{ \mathbb {P}_\pi ^n : n \in \mathbb {N} \right\} \) and a corresponding expectation \(\mathbb {E}_\pi \).

An objective over an MDP M with states S and actions A is a real-valued function \(\textsf{J}\) over the domain of infinite real sequences. Whenever the function \(\textsf{J} \circ r\) is \(\mathbb {P}_\pi \)-measurable, the expectation \(\mathbb {E}_\pi (\textsf{J}) = \int \textsf{J} \circ r \text { d}\mathbb {P}_\pi \) is well-defined and can be used to evaluate the quality of the policy \(\pi \) with respect to the environment M. The \(\textsf{J}\)-value of M is defined as \(\textrm{Val}_{\textsf{J}}\left( M \right) = \sup _{\pi \in \varPi _{M}} \mathbb {E}_\pi (\textsf{J})\).

Let \(\textsf{J}\) be a fixed objective function.

  • The \(\textsf{J}\)-value problem asks, given as input (i) an MDP M and (ii) a lower bound b, to decide whether the inequality \(b \le \textrm{Val}_{\textsf{J}}\left( M \right) \) holds.

  • The \(\textsf{J}\)-value is computable if, and only if, there is an algorithm that, given an MDP M as input, returns \(\textrm{Val}_{\textsf{J}}\left( M \right) \).

  • The \(\textsf{J}\)-value is approximable if, and only if, there exists an algorithm that, given as input (i) an MDP M and (ii) a tolerance \(\epsilon > 0\), returns a value V such that \(\left| \textrm{Val}_{\textsf{J}}\left( M \right) - V \right| \le \epsilon \).

4 Regular Markov Decision Processes

Regular Markov decision processes (RMDPs) are MDPs where states have been provided with a specific structure expressed through a regular language over some alphabet \(\varSigma \). An execution of an RMDP starts with an initial regular language \(L_0 = I\). At each step \(i \ge 0\), a decision maker or learning agent selects an action \(a_i\) from the current state \(L_i\). The environment resolves the action by selecting a transduction \(\theta _i\) from the probabilistic distribution over \(\varTheta \) corresponding to the action and returning the next state as \(L_{i+1} = \theta _i(L_i)\) and returning the reward \(\boldsymbol{r}(L_{i})\). The goal of the agent is to learn a policy for selecting actions in a manner that optimizes the value of a given objective \(\textsf{J}\) in expectation.

Definition 4

(RMDP). A regular Markov decision process (RMDP) \(\boldsymbol{R}\) is given by a tuple \(\left\langle \varSigma , I, \varTheta , A, \boldsymbol{p}, \boldsymbol{r} \right\rangle \), where

  • \(\varSigma \) is an alphabet,

  • \(I \subseteq \varSigma ^*\) is an initial regular language,

  • \(\varTheta \) is a finite set of rational transductions with type \(\varSigma ^* \rightarrow 2^{\varSigma ^*}\),

  • A is a finite set of actions,

  • \(\boldsymbol{p} : 2^{\varSigma ^*} \times A \rightarrow \textrm{Dist}\left( \varTheta \right) \) is a mapping from language-action pairs to distributions over \(\varTheta \), and

  • \(\boldsymbol{r} : 2^{\varSigma ^*} \rightarrow \mathbb {R}\) is a bounded reward function.

Semantically, \(\boldsymbol{R}\) is interpreted as a countable MDP \(\llbracket \boldsymbol{R} \rrbracket = \left\langle S, \hat{s}, A, p, r \right\rangle \). The state set is defined as \(S = \textrm{Orb}_{\varTheta }\left( I \right) \) with initial state \(\hat{s} = I\), and the transition and reward functions are such that the equations

$$\begin{aligned} p(\theta (L) \mid L, a) = \boldsymbol{p}(\theta \mid L, a) \qquad \text {and}\qquad r(L, a) = \boldsymbol{r}(L), \end{aligned}$$

hold for all languages \(L \in S\), actions \(a \in A\), and transductions \(\theta \in \varTheta \). The value of an objective \(\textsf{J}\) over a RMDP \(\boldsymbol{R}\) is defined as \(\textrm{Val}_{\textsf{J}}\left( \boldsymbol{R} \right) = \textrm{Val}_{\textsf{J}}\left( \llbracket \boldsymbol{R} \rrbracket \right) \).

An RMDP \(\boldsymbol{R}\) is called finite if the orbit \(\textrm{Orb}_{\varTheta }\left( I \right) \), is finite. An RMDP is said to be computable if the transition probability map \(\boldsymbol{p}\) and the reward function \(\boldsymbol{r}\) are computable.

4.1 Undecidability of Values

Our first theoretical result establishes that value problems for RMDPs are generally undecidable.

Theorem 1

Determining whether an arbitrary RMDP satisfies any fixed non-trivial property is undecidable.

Proof

We construct, as depicted in Fig. 3, a deterministic FST that can simulate the transition relation of an arbitrary Turing machine (TM). Configurations of the TM, i.e. combinations of internal state and tape contents, are encoded as words in the regular language \(\left( 0 + 1 \right) \left( 0 + 1 \right) ^* Z \left( 0 + 1 \right) ^*\), where Z is the finite set of internal states. The index i of the single element of Z occurring in each such word represents that the tape head of the TM is at position \(i{-}1\). Assume that the TM in question includes an arbitrary transition instruction according to the following pair of rules.

  • If 0 is read in state z, then write \(b_0\), go to state \(z_0\), and move the tape head.

  • If 1 is read in state z, then write \(b_1\), go to state \(z_1\), and move the tape head.

We leave the direction of the tape head shift undetermined and show all possibilities in Fig. 3. The red edges show the construction for the above TM transition when the tape head shifts . The blue edges show the construction when the tape head shifts .

In combination with Rice’s theorem [46]—which states that no non-trivialFootnote 3 property is decidable for the class of Turing machines— this construction implies the desired result.    \(\square \)

Fig. 3.
figure 3

The FST from the proof of Theorem 1, simulating the transition function of a Turing machine over a binary alphabet.

It follows from Theorem 1 that optimal values of RMDPs are not computable in general.

Corollary 1

Under any objective, the RMDP value problem is undecidable.

4.2 Discounted Optimization

We now consider RMDPs under discounted objectives. Let \(x = x_1, x_2,\dots \) be a bounded infinite sequence of real numbers. Given a discount factor \(\lambda \in [0,1)\), the \(\lambda \)-discounted objective \(\textsf{D}_\lambda \) is defined as

$$\begin{aligned} \textsf{D}_\lambda (x) = \sum ^\infty _{n=1} x_n \lambda ^{n-1}. \end{aligned}$$

Over computable RMDPs, it is possible to approximate the \(\textsf{D}_\lambda \)-value to an arbitrary tolerance. This result is facilitated by properties of the discounted objective, and therefore holds even when the RMDP in question is not finite. The proof uses the standard technique of finding, given a tolerance \(\epsilon \) and a discount factor \(\lambda \), the least n such that

$$\begin{aligned} \left| \textsf{D}_\lambda (x) - \sum ^n_{k=1} \lambda ^{k-1} x_k \right| \le \epsilon . \end{aligned}$$

Theorem 2

If \(\boldsymbol{R}\) is a computable RMDP, then the \(\textsf{D}_\lambda \)-value is \(\epsilon \)-approximable, for any \(\lambda \in [0,1)\) and any \(\epsilon > 0\).

Proof

For a given RMDP \(\boldsymbol{R}\), the \(\textsf{D}_\lambda \)-value can be characterized by the following Bellman optimality equation:

$$\begin{aligned} D(L) = \max _{a \in A} \left\{ \boldsymbol{r}(L) + \lambda \sum _{\theta \in \varTheta } \boldsymbol{p}(\theta \mid L, a) D(\theta (L)) \right\} . \end{aligned}$$

It follows from the more general result on MDPs with countable state space, finite action space, and bounded reward [30]. Let b be an upper bound on the absolute value of the rewards. For a given \(\epsilon > 0\), let n be such that

$$\begin{aligned} \frac{\lambda ^{n+1}b}{1 - \lambda } \le \epsilon . \end{aligned}$$

Then a solution to the following recurrence characterizes an \(\varepsilon \)-optimal value and corresponding memoryful policy for the RMDP:

$$\begin{aligned} D_n(L) = {\left\{ \begin{array}{ll} \max \limits _{a \in A} \left\{ \boldsymbol{r}(L) + \lambda \sum \limits _{\theta \in \varTheta } \boldsymbol{p}(\theta \mid L, a) D_{n-1}(\theta (L)) \right\} &{}\text {if } n > 0 \\ 0 &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$

The proof is now complete.   \(\square \)

An RL algorithm is probably approximately correct (PAC) [53], with respect to parameters \(\epsilon > 0\) and \(\delta > 0\), if after polynomially many samples of the environment, it produces an \(\epsilon \)-optimal policy with probability \(1 - \delta \). Objective functions under which PAC algorithms exist are called PAC-learnable.

Theorem 3

For every RMDP, the \(\textsf{D}_\lambda \)-value is PAC-learnable.

Proof

Our approach for calculating \(\epsilon \)-optimal policies for the discounted objective involves computing a policy that is optimal for a fixed number of steps, denoted by n. Given \(\epsilon > 0\), we choose n such that

$$\begin{aligned} \frac{\lambda ^{n+1}b}{1 - \lambda } \le \epsilon . \end{aligned}$$

This policy can be computed on a finite-state MDP obtained by unfolding the given RMDP n times. We can then apply existing PAC-MDP algorithms [7] to compute an \(\displaystyle \frac{\epsilon }{2}\)-optimal policy, which is also an \(\epsilon \)-optimal policy for the RMDP.    \(\square \)

4.3 Finiteness Conditions

In this section, we provide three sufficient conditions to guarantee finiteness of the RMDP. Fix an arbitrary RMDP \(\boldsymbol{R} = \left( \varSigma , I, \varTheta , A, \boldsymbol{p}, \boldsymbol{r} \right) \) with semantics \(\llbracket \boldsymbol{R} \rrbracket = \left( S, \hat{s}, A, p, r \right) \).

Word-Based Condition. A transduction \(\theta : \varSigma ^* \rightarrow 2^{\varGamma ^*}\) is (i) length-preserving if \(\left| \theta (x) \right| = \left| x \right| \), (ii) decreasing if \(\left| \theta (x) \right| < \left| x \right| \), (iii) non-increasing if \(\left| \theta (x) \right| \le \left| x \right| \), (iv) non-decreasing if \(\left| \theta (x) \right| \ge \left| x \right| \), (v) increasing if \(\left| \theta (x) \right| > \left| x \right| \), for all \(x \in \varSigma ^*\).

Proposition 1

If I is a finite language, \(\varTheta \) is non-increasing, \(\left| \varSigma \right| = n\), and \(\max _{x \in I} \left| x \right| = m\), then \(\textrm{Orb}_{\varTheta }\left( I \right) \in 2^{O\left( n^m \right) }\).

Proof

The statement can be derived from the observation that the longest string possibly appearing in the image \(\theta (I)\) of a finite language I under a non-increasing transformation \(\theta \) is of length \(m = \max _{w \in I} \left| w \right| \). There are \(n^m\) strings of length m over an alphabet \(\varSigma \) of size n, so it follows that \(\left| \theta (I) \right| \le 1 + \sum ^m_{k=1} n^k\). More succinctly, this says that \(\left| \theta (I) \right| = O\left( n^m \right) \). Since \(\textrm{Orb}_{\varTheta }\left( I \right) \) must comprise some collection of subsets of \(\varSigma ^{\le m}\), we conclude that \(\left| \textrm{Orb}_{\varTheta }\left( I \right) \right| = 2^{O\left( n^m \right) }\).    \(\square \)

Language-Based Condition. A transduction \(\theta : \varSigma ^* \rightarrow 2^{\varSigma ^*}\) is (i) specializing if \(\theta (L) \subseteq L\), (ii) non-specializing if \(\theta (L) \not \subseteq L\), (iii) generalizing if \(L \subseteq \theta (L)\), (iv) non-generalizing if \(L \not \subseteq \theta (L)\), for all \(L \subseteq \varSigma ^*\).

Proposition 2

If \(\left| I \right| = n\) and \(\varTheta \) is specializing, then \(\textrm{Orb}_{\varTheta }\left( I \right) \le 2^n\).

Proof

This result can be deduced from the observation that when beginning with a finite initial language, specializing transformations can only generate languages with a cardinality that is either equal to or smaller than that of I.    \(\square \)

Reward-Based Condition. Let \(\sim _{\boldsymbol{R}} \subseteq 2^{\varSigma ^*} \times 2^{\varSigma ^*}\) be an equivalence relation over languages such that \(L_1 \sim _{\boldsymbol{R}} L_2\) if, and only if,

$$\begin{aligned} \boldsymbol{r}(L_1) = \boldsymbol{r}(L_2) \quad \text { and }\quad \forall \theta \in \varTheta .\quad \theta (L_1) \sim _{\boldsymbol{R}} \theta (L_2). \end{aligned}$$

This relation is often useful as a means of partitioning the state space of an RMDP into a finite set of equivalence classes that respects the structure of its dynamics. For instance, it is straightforward to deduce the following proposition.

Proposition 3

If there exists \(n \in \mathbb {N}\) such that \(\boldsymbol{r}(L) = \boldsymbol{r}\left( L \cap \varSigma ^{\le n} \right) \) holds for every \(L \subseteq \varSigma ^*\), then \(\textrm{Orb}_{\varTheta }\left( I \right) \) has finitely many \(\sim _{\boldsymbol{R}}\)-equivalence classes.

4.4 Q-Learning in RMDPs

We have discussed some conditions that ensure the finiteness of \(\llbracket M \rrbracket \). When any such condition is satisfied, it becomes feasible to employ off-the-shelf RL algorithms for discounted optimization. Equation (1)—in which \([L]_{\sim }\) denotes the equivalence class of \(\sim _{\boldsymbol{R}}\) to which the language L belongs—provides an iteration scheme for a variation on the Q-learning [54] algorithm tailored for RMDPs. If learning rates \(\left( \alpha _n \right) _{n \in \mathbb {N}}\) are such that \(\sum _{n=1}^\infty \alpha _n = \infty \) and \(\sum _{n=1}^{\infty } \alpha _n^2 < \infty \), and the trajectory \(\left( \left[ L_n \right] _\sim , a_n, r_n \right) _{n \in \mathbb {N}}\) includes each pair \([L]_{\sim }, a\) infinitely often, then iterating Eq. (1) converges almost surely to an optimal policy.

$$\begin{aligned} Q_{\scriptstyle n+1}\left( [L_{\scriptstyle n}]_{\scriptscriptstyle \sim }, a_{\scriptstyle n} \right) {{:}{=}} \left( 1 {-} \alpha _{\scriptstyle n} \right) Q_{\scriptstyle n}\left( \left[ L_{\scriptstyle n} \right] _{\scriptscriptstyle \sim }, a_{\scriptstyle n} \right) \,{+}\,\alpha _{\scriptstyle n} \left( r_{\scriptstyle n} {+} \max _{a \in A} Q_{\scriptstyle n}\left( \left[ L_{\scriptstyle n+1} \right] _{\scriptscriptstyle \sim }, a \right) \right) \end{aligned}$$
(1)

5 Deep Regular Reinforcement Learning

Generally speaking, RMDPs may have infinite state spaces, so we cannot guarantee convergence of Q-learning. In light of this fact and the uncomputability of exact discounted values—established by Theorem 1—it makes sense to consider approximate learning methods. Accordingly, we propose a deep learning approach based on using graph neural networks (GNNs). Our key insight in this context is the observation that we can use automaton representations of the states of an RMDP directly as inputs to GNNs. Much like standard deep RL uses feature vectors as inputs for neural networks, this technique uses automata—which are essentially finite labeled graphs—as inputs for GNNs. We term this approach deep regular reinforcement learning.

Before presenting experimental results, we describe the overall architecture of our learning scheme in the next subsection.

Fig. 4.
figure 4

Deep regular reinforcement learning architecture.

Our graph neural network architecture, is based on the graph convolution operator proposed by Kipf & Welling [39]. We perform an independent graph convolution for each letter in the input automaton— only allowing the convolution to operate over the graph connectivity for that letter—and take the mean of the resulting vectors for each node, followed by a nonlinearity. We repeat this for N layers and then concatenate the sum of all node vectors with the element-wise maximum of all node vectors. Separate multi-layer perceptrons produce the policy and state value predictions from this representation. We use proximal policy optimization (PPO) [48] for training. Figure 4 outlines this architecture. The initial embedding for each node in the automaton is a binary vector of length two, which encodes whether a node is the initial state and if a node is accepting. For all experiments, the graph neural network had 3 graph convolution layers with hidden dimension 256, and the multi-layer perceptron heads had 2 layers each with hidden dimension 256. We used the LeakyReLU nonlinearity.

In the remainder of this section, we present specific examples of regular RL problems and provide experimental results to illustrate the effectiveness of deep regular reinforcement learning.

5.1 Token Passing

We first consider the token passing scenario of Example 1 (cf. Fig. 1). Note that this example admits a partitioning of the environment via the equivalence relation defined in Sect. 4.3: there are two equivalence classes, corresponding to whether there are an even or an odd number of n symbols before the first t symbol. We compare using the GNN on the original representation (GNN) and on the representation formed by the two equivalence classes (GNN+EC). Figure 5 shows the FSA representations used for the two equivalence classes.

Fig. 5.
figure 5

Automata used to represent even and odd equivalence classes.

The hyperparameters we used for PPO in this case study were 1024 steps per update, a 512 batch size, 4 optimization epochs, a clip range of \(\epsilon = 0.2\), and a discount factor of \(\lambda = 0.99\). The learning curvesFootnote 4 are shown in Fig. 6.

Fig. 6.
figure 6

Reward curves for the token passing case study.

Note that under the selected network architecture, determining whether the number of n symbols occurring before the first t symbol is even or odd is largely determined by the multi-layer perceptron components. Roughly, the number of n symbols before the first t symbol is encoded in unary in the global sum component of the graph representation. The multi-layer perceptrons must then detect parity on this unary representation, which is challenging. To encourage learning, we use a denser reward of 1 on every successful step and \(-1\) on failure, up to a time limit of 30. Although alternative network architectures have the potential to perform better, the simple two-state equivalence class representation is still expected to result in faster learning than the unmodified representation. The learning run is shortened, and the maximum episode length is set to 30 to highlight the difference between these two setups. We see that forming equivalence classes leads to an increase in learning speed.

5.2 Duplicating Pebbles

Consider a grid world with multiple pebbles on it. The agent can select two adjacent directions, e.g., “up” and “right”, and every pebble will be duplicated and moved in each of these directions. The goal of the agent is to have at least one pebble reach the goal state, while all pebbles do not accumulate a cost greater than a threshold \(t=2\). If a pebble goes over a trap cell, it incurs a cost of 1 for that pebble.

Although the number of pebbles grows exponentially, doubling after each action, the set of paths the pebbles take has an FSA representation. Namely, one can represent the growing paths by adding a state to the FSA with two transitions to the state corresponding to the two directions selected. This added state is marked as the only accepting state. The language of this FSA corresponds to all paths that pebbles have taken. A reward of \(-1\) is given on failure and a reward of 1 is given on success. The grid layout is shown in Fig. 7, where the initial pebble begins in the top left. The agent learns the optimal policy “down, right”, “down, right”, “up, left”, “up, left” in about 10k training steps. Figure 7 shows the execution of this optimal policy, from left to right, top to bottom. Traps are denoted by red cells and the goal is denoted by a green cell. The number in a cell counts the number of pebbles it contains.

The hyperparameters we used for PPO in this case study were 512 steps per update, a 128 batch size, 4 optimization epochs, a clip range of \(\epsilon = 0.2\), and a discount factor of \(\lambda = 0.99\). The resulting reward curve is shown in Fig. 8.

Since the representation of the state is an FSA of the possible trajectories, a linear program is solved at each step to find the highest cost path needed for computing the reward. Note that when “up, left” is first performed, some pebbles wrap around to the opposite side of the grid. If “down, right” was performed 3 times, instead of twice, then the agent would fail the objective since the 2 pebbles at (1, 2) on the grid would duplicate and visit the trap state again after having already visited it once.

Fig. 7.
figure 7

Execution of the optimal policy for the duplicating pebbles case study.

Fig. 8.
figure 8

Reward curve for the duplicating pebbles case study.

5.3 Shunting Yard Algorithm

To showcase representation of unbounded data structures like stacks and queues as a strength of RRL, we consider learning the shunting yard algorithm [28] which transforms an expression in infix notation to postfix notation.

Fig. 9.
figure 9

Runs produced by the learned policy for the shunting yard algorithm.

We represent the input as a regular language consisting of a single string containing the concatenation of the infix notation input, the stack, and the output, each separated by a special symbol “#”. The agent has three actions:

  • moving the first character of the input to the output,

  • pushing the first character of the input to the stack, and

  • popping the top character on the stack to the output.

We generate random infix notation expressions and give a reward of \(-1\) if the output is an invalid postfix expression, a reward of 0.5 if the output is a valid postfix expression that evaluates to the wrong value, and a reward 1 if the output evaluates to the correct value. The agent is able to learn an effective strategy in about 100k time steps.

Figure 9 shows example runs produced by the learned policy. The representation used during learning is an FSA accepting a single string, but we print the string for clarity. Actions are the actions selected upon observing that state. The last state is the final state at termination.

The hyperparameters we used for PPO in this case study were 1024 steps per update, a 128 batch size of, 10 optimization epochs, a clip range of \(\epsilon = 0.2\), and a discount factor of \(\lambda = 0.99\). The resulting reward curve is shown in Fig. 10.

Fig. 10.
figure 10

Reward curve for the shunting yard algorithm case study.

5.4 Modified Tangrams

This case study examines the application of deep RRL to variations of geometric puzzles known as tangrams, which involve arranging a finite set of polygonal tiles on a flat surface to create a picture. The picture is typically a silhouette in the shape of a common object such as a building or a tree, and the puzzle is completed once the tiles have been arranged into a configuration that covers the silhouette exactly. A standard tangram set includes a collection of target pictures, 5 right triangles (2 large, 1 medium, and 2 small), a square, and a parallelogram. We consider modified tangrams, which we qualify as such because the pieces do not coincide with the standard tile set. An example is displayed in Fig. 11.

Fig. 11.
figure 11

A modified tangram. The goal is to cover the gray shape at the center resembling the symbol “\(\times \)” by rearranging the colored tiles \(\left\{ U, V, W, X, Y, Z \right\} \). (Color figure online)

In order to cast these sorts of puzzles into the RRL framework, we apply standard notions used in positional numeration systems to connect geometric shapes and regular languages. More precisely, tiles are considered as sets of points in the unit square \([0,1] \times [0,1]\) of the Euclidean plane. Then, sets of points are encoded by regular languages consisting of digital expansions of these points.

For a numeration base \(b \in \mathbb {N}\), define a map \(\left\langle \!\left\langle \, \cdot \,\right\rangle \!\right\rangle _b : \left\{ 0,\dots ,b-1 \right\} ^* \rightarrow (0,1)\) as

$$\begin{aligned} \left\langle \!\left\langle \, w \,\right\rangle \!\right\rangle _b = \sum ^{\left| w \right| }_{k=1} \frac{w_k}{b^k} \end{aligned}$$

to interpret each string of digits w as a base-b digital expansion (where the left-most symbol is the most significant bit) of a number \(\left\langle \!\left\langle \, w \,\right\rangle \!\right\rangle _b\) in the unit interval. Such interpretations extend to languages so that

$$\begin{aligned} \left\langle \!\left\langle \, L \,\right\rangle \!\right\rangle _b = \left\{ \left\langle \!\left\langle \, w \,\right\rangle \!\right\rangle _b : w \in L \right\} . \end{aligned}$$

We fix the base as \(b = 2\), and consider automata over the alphabet the two-dimensional boolean alphabet \(\mathbb {B}^2\) to encode points in the plane. We design automata capturing languages that represent the sets of points included in particular shapes, as illustrated in Fig. 12.

Fig. 12.
figure 12

Automata corresponding to some of the starting tiles shown in Fig. 11.1.

Remark 2

Automata \(\mathcal {A}_X\) and \(\mathcal {A}_Z\) may be obtained from the automata \(\mathcal {A}_W\) (Fig. 12.3) and \(\mathcal {A}_Y\) (Fig. 12.4), respectively. This can be done by taking the logical complement of the x-coordinate on every non-looping transition and exchanging pairs of self loops on a common state labeled by \(\begin{pmatrix} 0 & 0 \end{pmatrix}\) and \(\begin{pmatrix} 1 & 1 \end{pmatrix}\) to ones labeled by \(\begin{pmatrix} 0 & 1 \end{pmatrix}\) and \(\begin{pmatrix} 1 & 0 \end{pmatrix}\).

We also design finite-state transducers, as illustrated in Fig. 13, for basic geometric operations such as translation by 1/2, translation by 1/4, and reflection across \(x=1/2\) and \(y=1/2\).

Fig. 13.
figure 13

FSTs implementing some rigid transformations on the unit square. Arbitrary digits are represented by d, while \(*\) represents arbitrary pairs of digits.

Fig. 14.
figure 14

Annotated reward curve for the modified tangram example.

The agent’s goal is to apply these basic transformations to move each shape from its initial position into the goal region. To reduce the number of actions, the agent selects transformations for one of the shapes at a time and uses a special “submit” action to move to the next shape. We treat the collection of automata as a single nondeterministic FSA, and specially mark the alphabet of the active automaton in the collection. Rewards are proportional to the overlap with the remaining exposed target shape when the submit action is used. On all other steps, a reward of \(-0.01\) is given to encourage promptness.

The hyperparameters used for PPO in this case study were 256 steps per update, a 64 batch size, 10 optimization epochs, a clip range of \(\epsilon = 0.1\), and a discount factor of \(\lambda = 0.99\). The resulting reward curve—which we annotate at various points to show the agent’s progress—is shown in Fig. 14.

6 Conclusion

This paper introduced a framework for symbolic reinforcement learning, dubbed regular reinforcement learning, where system states are modeled as regular languages and transition relations are modeled as rational transductions. We established theoretical results about the limitations and capabilities of this framework, proving that optimal values and policies are approximable and efficiently learnable under discounted payoffs. Furthermore, we developed an approach to deep regular reinforcement learning that combines aspects of deep learning and symbolic representation via the use of graph neural networks. Through a variety of case studies, we illustrated the effectiveness of deep regular reinforcement learning.