1 Introduction

Many tasks in natural language processing, computational biology, reinforcement learning, and time series analysis rely on learning with sequential data, i.e. estimating functions defined over sequences of observations from training data. Weighted finite automata (WFA) and recurrent neural networks (RNN) are two powerful and flexible classes of models which can efficiently represent such functions. On the one hand, WFA are tractable, they encompass a wide range of machine learning models (they can for example compute any probability distribution defined by a hidden Markov model (HMM), Denis & Esposito, 2008, and can model the transition and observation behavior of partially observable Markov decision processes, Thon & Jaeger, 2015) and they offer appealing theoretical guarantees. In particular, the so-called spectral methods for learning HMMs (Hsu et al., 2009), WFA (Bailly et al., 2009; Balle et al., 2014) and related models (Glaude & Pietquin, 2016; Boots et al., 2011), provide an alternative to Expectation-Maximization based algorithms that is both computationally efficient and consistent. On the other hand, RNN are remarkably expressive models—they can represent any computable function (Siegelmann & Sontag, 1992)—and they have successfully tackled many practical problems in speech and audio recognition  (Graves et al., 2013; Mikolov et al., 2011; Gers et al., 2000), but their theoretical analysis is difficult. Even though recent work provides interesting results on their expressive power (Khrulkov et al., 2018; Yu et al., 2017) as well as alternative training algorithms coming with learning guarantees (Sedghi & Anandkumar, 2016), the theoretical understanding of RNN is still limited. At the same time, tensor networks are a generalization of tensor decomposition techniques, where complex operations between tensors are represented in a simple diagrammatic notation, allowing one to intuitively represent intricate ways to decompose a high-order tensor into lower-order tensors acting as building block. The term tensor networks also encompasses a set of optimization techniques to efficiently tackle optimization problems in very high-dimensional spaces, where the optimization variable is represented as a tensor network and the optimization process is carried out with respect to the building blocks of the tensor network. As an illustration, such optimization techniques make it possible to efficiently approximate the leading eigen-vectors of matrices of size \(2^N \times 2^N\) where N can be as large as 50 (Holtz et al., 2012). Tensor networks have emerged in the quantum physics community to model many-body systems (Orús 2014; Biamonte and Bergholm 2017) and have also been used in numerical analysis as a mean to solve high-dimensional differential equations (Oseledets 2011; Lubich et al., 2013) and to design efficient algorithms for big data analytics (Cichocki et al., 2016). Tensor networks have recently been used in the context of machine learning to compress neural networks  (Novikov et al., 2015, 2014; Ma et al., 2019; Yang et al., 2017), to design new approaches and optimization techniques borrowed from the quantum physics literature for supervised and unsupervised learning tasks (Stoudenmire and Schwab 2016; Han et al., 2018; Miller et al., 2020), as new theoretical tools to understand the expressiveness of neural networks (Cohen et al., 2016; Khrulkov et al., 2018) and for image completion problems (Yang et al., 2017; Wang et al., 2017) among others. In this work, we bridge a gap between these three classes of models: weighted automata, tensor networks and recurrent neural networks. We first exhibit an intrinsic relation between the computation of a weighted automata and the tensor train decomposition, a particular form of tensor network (also known as matrix product states in the quantum physics community). While such a connection has been sporadically noticed previously, we demonstrate how this relation implies a low tensor train structure of the so-called Hankel matrix of a function computed by a WFA. The Hankel matrix of a function is at the core of the spectral learning algorithm for WFA. This algorithm relies on the fact that the (matrix) rank of the Hankel matrix is directly related to the size of a WFA computing the function it represents. We show that, beyond being low rank, the Hankel matrix of a function computed by a WFA can be seen as a block matrix where each block is a matricization of a tensor with low tensor train rank. Building upon this result, we design an efficient implementation of the spectral learning algorithm that leverages this tensor train structure. When the Hankel matrices needed for the spectral algroithm are given in the tensor train format, the time complexity of the algorithm we propose is exponentially smaller (w.r.t. the size of the Hankel matrix) than the one of the classical spectral learning algorithm. We then unravel a fundamental connection between WFA and second-order RNN (2-RNN): when considering input sequences of discrete symbols, 2-RNN with linear activation functions and WFA are one and the same, i.e. they are expressively equivalent and there exists a one-to-one mapping between the two classes (moreover, this mapping conserves model sizes). While connections between finite state machines (e.g. deterministic finite automata) and recurrent neural networks have been noticed and investigated in the past (see e.g. Giles et al., 1992; Omlin and Giles 1996), to the best of our knowledge this is the first time that such a rigorous equivalence between linear 2-RNN and weighted automata is explicitly formalized. More precisely, we pinpoint exactly the class of recurrent neural architectures to which weighted automata are equivalent, namely second-order RNN with linear activation functions. This result naturally leads to the observation that linear 2-RNN are a natural generalization of WFA (which take sequences of discrete observations as inputs) to sequences of continuous vectors, and raises the question of whether the spectral learning algorithm for WFA can be extended to linear 2-RNN. The third contribution of this paper is to show that the answer is in the positive: building upon the classical spectral learning algorithm for WFA  (Hsu et al., 2009; Bailly et al., 2009; Balle et al., 2014) and its recent extension to vector-valued functions (Rabusseau et al., 2017), we propose the first provable learning algorithm for second-order RNN with linear activation functions. Our learning algorithm relies on estimating sub-blocks of the so-called Hankel tensor, from which the parameters of a 2-linear RNN can be recovered using basic linear algebra operations. One of the key technical difficulties in designing this algorithm resides in estimating these sub-blocks from training data where the inputs are sequences of continuous vectors. We leverage multilinear properties of linear 2-RNN and the tensor train structure of the Hankel matrix to perform this estimation efficiently using matrix sensing and tensor recovery techniques. In particular, we show that the Hankel matrices needed for learning can be estimated directly in the tensor train format, which allows us to use the efficient spectral learning algorithm in the tensor train format discussed previously. We validate our theoretical findings in a simulation study on synthetic and real world data where we experimentally compare different recovery methods and investigate the robustness of our algorithm to noise. We also show that refining the estimator returned by our algorithm using stochastic gradient descent can lead to significant improvements.

1.1 Summary of contributions

We present novel connections between WFA and the tensor train decomposition (Sect. 3.1) allowing us to design an highly efficient implementation of the spectral learning algorithm in the tensor train format (Sect. 3.2). We formalize a strict equivalence between weighted automata and second-order RNN with linear activation functions (Sect. 4), showing that linear 2-RNN can be seen as a natural extension of (vector-valued) weighted automata for input sequences of continuous vectors. We then propose a consistent learning algorithm for linear 2-RNN (Sect. 5). The relevance of our contributions can be seen from three perspectives. First, while learning feed-forward neural networks with linear activation functions is a trivial task (it reduces to linear or reduced-rank regression), this is not at all the case for recurrent architectures with linear activation functions; to the best of our knowledge, our algorithm is the first consistent learning algorithm for the class of functions computed by linear second-order recurrent networks. Second, from the perspective of learning weighted automata, we propose a natural extension of WFA to continuous inputs and our learning algorithm addresses the long-standing limitation of the spectral learning method to discrete inputs. Lastly, by connecting the spectral learning algorithm for WFA to recurrent neural networks on one side, and tensor networks on the other, our work opens the door to leveraging highly efficient optimization techniques for large scale tensor problems used in the quantum physics community for designing new learning learning algorithms for both linear and non-linear sequential models, as well as offering new tools for the theoretical analysis of these models.

1.2 Related work

Combining the spectral learning algorithm for WFA with matrix completion techniques (a problem which is closely related to matrix sensing) has been theoretically investigated in Balle and Mohri (2012). An extension of probabilistic transducers to continuous inputs (along with a spectral learning algorithm) has been proposed in Recasens and Quattoni (2013). The model considered in this work is closely related to the continuous extension of WFA we consider here but the learning algorithm proposed in Recasens and Quattoni (2013) is designed for (and limited to) stochastic transducers, whereas we consider arbitrary functions computed by linear 2-RNN. The connections between tensors and RNN have been previously leveraged to study the expressive power of RNN in Khrulkov et al. (2018) and to achieve model compression in Yu et al. (2017), Yang et al. (2017) and Tjandra et al. (2017). Exploring relationships between RNN and automata has recently received a renewed interest (Peng et al., 2018; Chen et al., 2018; Li et al., 2018; Merrill et al., 2020). In particular, such connections have been explored for interpretability purposes (Weiss et al., 2018; Ayache et al., 2018) and the ability of RNN to learn classes of formal languages has been investigated in Avcu et al. (2017). Connections between the tensor train decomposition and WFA have been previously noticed in Critch (2013), Critch and Morton (2014) and Rabusseau (2016). However, to the best of our knowledge, this is the first time that the tensor-train structure of the Hankel matrix of a function computed by a WFA is noticed and leveraged to design an efficient spectral learning algorithm for WFA. Other approaches have been proposed to scale the spectral learning algorithm to large datasets, notably by identifying a small basis of informative prefixes and suffixes to build the Hankel matrices (Quattoni et al., 2017). The predictive state RNN model introduced in Downey et al. (2017) is closely related to 2-RNN and the authors propose to use the spectral learning algorithm for predictive state representations to initialize a gradient based algorithm; their approach however comes without theoretical guarantees. Lastly, a provable algorithm for RNN relying on the tensor method of moments has been proposed in Sedghi and Anandkumar (2016) but it is limited to first-order RNN with quadratic activation functions (which do not encompass linear 2-RNN).

2 Preliminaries

In this section, we first present basic notions of tensor algebra and tensor networks before introducing second-order recurrent neural network, weighted finite automata and the spectral learning algorithm. We start by introducing some notations. For any integer k we use [k] to denote the set of integers from 1 to k. We use \(\lceil l \rceil\) to denote the smallest integer greater or equal to l. For any set \(S\), we denote by \(S^*=\bigcup _{k\in \mathbb {N}}S^k\) the set of all finite-length sequences of elements of \(S\) (in particular, \(\varSigma ^*\) will denote the set of strings on a finite alphabet \(\varSigma\)). We use lower case bold letters for vectors (e.g. \(\mathbf {v} \in \mathbb {R}^{d_1}\)), upper case bold letters for matrices (e.g. \(\mathbf {M}\in \mathbb {R}^{d_1 \times d_2}\)) and bold calligraphic letters for higher order tensors (e.g. \({\varvec{\mathcal {T}}}\in \mathbb {R}^{d_1 \times d_2 \times d_3}\)). We use \(\text {e}_i\) to denote the ith canonical basis vector of \(\mathbb {R}^d\) (where the dimension d will always appear clearly from context). The \(d\times d\) identity matrix will be written as \(\mathbf {I}_d\). The ith row (resp. column) of a matrix \(\mathbf {M}\) will be denoted by \(\mathbf {M}_{i,:}\) (resp. \(\mathbf {M}_{:,i}\)). This notation is extended to slices of a tensor in the straightforward way. If \(\mathbf {v} \in \mathbb {R}^{d_1}\) and \(\mathbf {v}' \in \mathbb {R}^{d_2}\), we use \(\mathbf {v} \otimes \mathbf {v}' \in \mathbb {R}^{d_1 \cdot d_2}\) to denote the Kronecker product between vectors, and its straightforward extension to matrices and tensors. Given a matrix \(\mathbf {M}\in \mathbb {R}^{d_1 \times d_2}\), we use \(\mathrm {vec}(\mathbf {M}) \in \mathbb {R}^{d_1 \cdot d_2}\) to denote the column vector obtained by concatenating the columns of \(\mathbf {M}\). The inverse of \(\mathbf {M}\) is denoted by \(\mathbf {M}^{-1}\), its Moore–Penrose pseudo-inverse by \(\mathbf {M}^\dagger\), and the transpose of its inverse by \(\mathbf {M}^{-\top }\); the Frobenius norm is denoted by \(\Vert \mathbf {M}\Vert _F\) and the nuclear norm by \(\Vert \mathbf {M}\Vert _*\).

2.1 Tensors and tensor networks

We first recall basic definitions of tensor algebra; more details can be found in Kolda and Bader (2009). A tensor \({\varvec{\mathcal {T}}}\in \mathbb {R}^{d_1\times \cdots \times d_p}\) can simply be seen as a multidimensional array \(( {\varvec{\mathcal {T}}}_{i_1,\ldots ,i_p}\ : \ i_n\in [d_n], n\in [p])\). The mode-n fibers of \({\varvec{\mathcal {T}}}\) are the vectors obtained by fixing all indices except the nth one, e.g. \({\varvec{\mathcal {T}}}_{:,i_2,\ldots ,i_p}\in \mathbb {R}^{d_1}\). The nth mode matricization of \({\varvec{\mathcal {T}}}\) is the matrix having the mode-n fibers of \({\varvec{\mathcal {T}}}\) for columns and is denoted by \({\varvec{\mathcal {T}}}_{(n)}\in \mathbb {R}^{d_n\times d_1\ldots d_{n-1}d_{n+1}\ldots d_p}\). The vectorization of a tensor is defined by \(\mathrm {vec}( {\varvec{\mathcal {T}}})=\mathrm {vec}( {\varvec{\mathcal {T}}}_{(1)})\). In the following \({\varvec{\mathcal {T}}}\) always denotes a tensor of size \(d_1\times \cdots \times d_p\). The mode-n matrix product of the tensor \({\varvec{\mathcal {T}}}\) and a matrix \(\mathbf {X}\in \mathbb {R}^{m\times d_n}\) is a tensor denoted by \({\varvec{\mathcal {T}}}\times _{n}\mathbf {X}\). It is of size \(d_1\times \cdots \times d_{n-1}\times m \times d_{n+1}\times \cdots \times d_p\) and is defined by the relation \({\varvec{\mathcal {Y}}}= {\varvec{\mathcal {T}}}\times _{n}\mathbf {X}\Leftrightarrow {\varvec{\mathcal {Y}}}_{(n)} = \mathbf {X} {\varvec{\mathcal {T}}}_{(n)}\). The mode-n vector product of the tensor \({\varvec{\mathcal {T}}}\) and a vector \(\mathbf {v}\in \mathbb {R}^{d_n}\) is a tensor defined by \({\varvec{\mathcal {T}}}\bullet _{n}\mathbf {v} = {\varvec{\mathcal {T}}}\times _{n}\mathbf {v}^\top \in \mathbb {R}^{d_1\times \cdots \times d_{n-1}\times d_{n+1}\times \cdots \times d_p}\). It is easy to check that the n-mode product satisfies \(( {\varvec{\mathcal {T}}}\times _{n}\mathbf {A})\times _{n}\mathbf {B} = {\varvec{\mathcal {T}}}\times _{n}\mathbf {BA}\) where we assume compatible dimensions of the tensor \({\varvec{\mathcal {T}}}\) and the matrices \(\mathbf {A}\) and \(\mathbf {B}\). Tensor network diagrams allow one to represent complex operations on tensors in a graphical and intuitive way. A tensor network is simply a graph where nodes represent tensors, and edges represent contractions between tensor modes, i.e. a summation over an index shared by two tensors. In a tensor network, the arity of a vertex (i.e. the number of legs of a node) corresponds to the order of the tensor: a node with one leg represents a vector, a node with two legs represents a matrix, and a node with three legs represents a 3rd order tensor (see Fig. 1). We will sometimes add indices to legs of a tensor network to refer to its components or sub-tensors. For example, the following tensor networks represent a matrix \(\mathbf {A}\in \mathbb {R}^{m\times n}\), the ith row of \(\mathbf {A}\) and the component \(\mathbf {A}_{i,j}\) respectively:

figure a
Fig. 1
figure 1

Tensor network representation of a vector \({\varvec{\mathrm{v}}}\in \mathbb {R}^d\), a matrix \({\varvec{\mathrm{M}}}\in \mathbb {R}^{m\times n}\) and a tensor \({\varvec{\mathcal {T}}}\in \mathbb {R}^{d_1\times d_2\times d_3}\). The gray labels over the edges indicate the dimensions of the corresponding modes of the tensors (such labels will only be sporadically displayed when necessary to avoid confusion)

Connecting two legs in a tensor network represents a contraction over the corresponding indices. Consider the following simple tensor network with two nodes:

figure b

The first node represents a matrix \({\varvec{\mathrm{A}}}\in \mathbb {R}^{m\times n}\) and the second one a vector \(\mathrm{x}\in \mathbb {R}^{n}\). Since this tensor network has one dangling leg (i.e. an edge which is not connected to any other node), it represents a vector. The edge between the second leg of \({\varvec{\mathrm{A}}}\) and the leg of \({\mathrm{x}}\) corresponds to a summation over the second mode of \({\varvec{\mathrm{A}}}\) and the first mode of \({\mathrm{x}}\),. Hence, the resulting tensor network represents the classical matrix-product, which can be seen by calculating the ith component of this tensor network:

figure c

Other examples of tensor network representations of common operations on vectors, matrices and tensors can be found in Fig. 2.

Fig. 2
figure 2

Tensor network representation of common operation on vectors, matrices and tensors

Given strictly positive integers \(n_1,\ldots , n_k\) satisfying \(\sum _i n_i = p\), we use the notation \({( {\varvec{\mathcal {T}}})}_{\langle \!\langle n_1,n_2,\ldots ,n_k\rangle \!\rangle }\) to denote the kth order tensor obtained by reshaping \({\varvec{\mathcal {T}}}\in \mathbb {R}^{d_1\times \cdots \times d_p}\) into a tensorFootnote 1 of size

$$\begin{aligned} (\prod _{i_1=1}^{n_1} d_{i_1}) \times (\prod _{i_2=1}^{n_2} d_{n_1 + i_2}) \times \cdots \times (\prod _{i_k=1}^{n_k} d_{n_1+\cdots +n_{k-1} + i_k}). \end{aligned}$$

For example, for a tensor \({\varvec{\mathcal {A}}}\) of size \(2\times 3\times 4\times 5\times 6\), the 3rd order tensor \({({\varvec{\mathcal {A}}})}_{\langle \!\langle 2,1,2\rangle \!\rangle }\) is obtained by grouping the first two modes and the last two modes respectively, to obtain a tensor of size \(6\times 4 \times 30\). This reshaping operation is related to vectorization and matricization by the following relations: \({( {\varvec{\mathcal {T}}})}_{\langle \!\langle p\rangle \!\rangle } = \mathrm {vec}( {\varvec{\mathcal {T}}})\) and \(({\varvec {\mathcal {T}}})_{\langle \!\langle 1,p-1\rangle \!\rangle } = {\varvec {\mathcal {T}}}_{(1)}\). A rank R tensor train (TT) decomposition (Oseledets, 2011) of a tensor \({\varvec{\mathcal {T}}}\in {\mathbb {R}}^{d_1\times \cdots \times d_p}\) consists in factorizing \({\varvec{\mathcal {T}}}\) into the product of p core tensors \({\varvec{\mathcal {G}}}_1\in \mathbb {R}^{d_1\times R}, {\varvec{\mathcal {G}}}_2\in \mathbb {R}^{R\times d_2\times R}, \ldots , {\varvec{\mathcal {G}}}_{p-1}\in \mathbb {R}^{R\times d_{p-1} \times R}, {\varvec{\mathcal {G}}}_p \in \mathbb {R}^{R\times d_p}\), and is definedFootnote 2 by

$$\begin{aligned} {\varvec{\mathcal {T}}}_{i_1,\ldots ,i_p} = ( {\varvec{\mathcal {G}}}_1)_{i_1,:}( {\varvec{\mathcal {G}}}_2)_{:,i_2,:}\ldots ( {\varvec{\mathcal {G}}}_{p-1})_{:,i_{p-1},:}( {\varvec{\mathcal {G}}}_p)_{:,i_p} \end{aligned}$$
(1)

for all indices \(i_1\in [d_1],\ldots ,i_p\in [d_p]\) (here \(( {\varvec{\mathcal {G}}}_1)_{i_1,:}\) is a row vector, \(( {\varvec{\mathcal {G}}}_2)_{:,i_2,:}\) is an \(R\times R\) matrix, etc.). We will use the notation \({\varvec{\mathcal {T}}}= \llbracket {\varvec{\mathcal {G}}}_1,\ldots , {\varvec{\mathcal {G}}}_p \rrbracket\) to denote such a decomposition. A tensor network representation of this decomposition is shown in Fig. 3. The name of this decomposition comes from the fact that the tensor \({\varvec {\mathcal {T}}}\) is decomposed into a train of lower-order tensors. This decomposition is also known in the quantum physics community as Matrix Product States (Orús, 2014; Schollwöck, 2011), where this denomination comes from the fact that each entry of \({\varvec {\mathcal {T}}}\) is given by a product of matrices, see Eq. (1).

Fig. 3
figure 3

Tensor network representation of a tensor train decomposition

While the problem of finding the best approximation of TT-rank R of a given tensor is NP-hard (Hillar & Lim, 2013), a quasi-optimal SVD based compression algorithm (TT-SVD) has been proposed in Oseledets (2011). It is worth mentioning that the TT decomposition is invariant under change of basis: for any invertible matrix \(\mathbf {M}\) and any core tensors \({\varvec{\mathcal {G}}}_1, {\varvec{\mathcal {G}}}_2,\ldots , {\varvec{\mathcal {G}}}_p\), we have

$$\begin{aligned} \llbracket {\varvec{\mathcal {G}}}_1,\ldots , {\varvec{\mathcal {G}}}_p \rrbracket = \llbracket {\varvec{\mathcal {G}}}_1\times _{2}\mathbf {M}^{-\top }, {\varvec{\mathcal {G}}}_2\times _{1}\mathbf {M}\times _{3}\mathbf {M}^{-\top },\ldots , {\varvec{\mathcal {G}}}_{p-1}\times _{1}\mathbf {M}\times _{3}\mathbf {M}^{-\top }, {\varvec{\mathcal {G}}}_p\times _{1}\mathbf {M} \rrbracket . \end{aligned}$$

This relation appears clearly using tensor network diagrams, e.g. with \(p=4\) we haveFootnote 3:

figure d

2.2 Weighted automata and spectral learning

Vector-valued weighted finite automata (vv-WFA) have been introduced in Rabusseau et al. (2017) as a natural generalization of weighted automata from scalar-valued functions to vector-valued ones.

Definition 1

A p-dimensional vv-WFA with n states is a tuple \(A=(\alpha , \{ \mathrm{A}^{ \sigma }\}_{\sigma \in \varSigma },\varOmega )\) where‘ \({\varvec{\alpha }}\in \mathbb {R}^n\) is the initial weights vector, \({\varvec{\varOmega }}\in \mathbb {R}^{p\times n}\) is the matrix of final weights, and \(\mathbf {A}^\sigma \in \mathbb {R}^{n\times n}\) is the transition matrix for each symbol \(\sigma\) in a finite alphabet \(\varSigma\). A vv-WFA A computes a function \(f_A:\varSigma ^*\rightarrow \mathbb {R}^p\) defined by

$$\begin{aligned} f_A(x) = {\varvec{\varOmega }}(\mathbf {A}^{x_1}\mathbf {A}^{x_2}\ldots \mathbf {A}^{x_k})^\top {\varvec{\alpha }}\end{aligned}$$

for each word \(x=x_1x_2\ldots x_k\in \varSigma ^*\).

We call a vv-WFA minimal if its number of states is minimal, that is, any vv-WFA computing the same function as at least as many states as the minimal vv-WFA. Given a function \(f:\varSigma ^*\rightarrow \mathbb {R}^p\), we denote by \({{\,\mathrm{rank}\,}}(f)\) the number of states of a minimal vv-WFA computing f (which is set to \(\infty\) if f cannot be computed by a vv-WFA). The spectral learning algorithm is a consistent learning algorithm for weighted finite automata. It has been introduced concurrently in Hsu et al. (2009) and Bailly et al. (2009) (see Balle et al., 2014 for a comprehensive presentation of the algorithm). This algorithm relies on a fundamental object: the Hankel matrix. Given a function \(f:\varSigma ^*\rightarrow \mathbb {R}\), its Hankel matrix \(\mathbf {H}\in \mathbb {R}^{\varSigma ^*\times \varSigma ^*}\) is the bi-infinite matrix defined by

$$\begin{aligned} \mathbf {H}_{u,v} = f(uv)\ \ \ \ \text {for all }u,v\in \varSigma ^* \end{aligned}$$

where uv denotes the concatenation of the prefix u and the suffix v. The striking relation between the Hankel matrix and the rank of a function f has been well known in the formal language community (Fliess, 1974; Carlyle & Paz, 1971) and is at the heart of the spectral learning algorithm. This relation states that the rank of the Hankel matrix of a function f exactly coincides with the rank of f, i.e. the number of states of the smallest WFA computing f. In particular, the rank of the Hankel matrix of f is finite if and only if f can be computed by a weighted automaton. An example of a function which cannot be computed by a WFA is the indicator function of the language \(a^nb^n\) (on the alphabet \(\varSigma =\{a,b\}\)):

$$\begin{aligned} f(x) = {\left\{ \begin{array}{ll} 1&{} \text { if } x=a^nb^n\text { for some integer }n\\ 0&{}\text {otherwise.} \end{array}\right. } \end{aligned}$$
(2)

The spectral learning algorithm was naturally extended to vector-valued WFA in Rabusseau et al. (2017), where the Hankel matrix is replaced by the Hankel tensor \(\varvec{\mathcal {H}}\in \mathbb {R}^{\varSigma ^*\times \varSigma ^*\times p}\) of a vector-valued function \(f:\varSigma ^*\rightarrow \mathbb {R}^p\), which is defined by

$$\begin{aligned} {\varvec{\mathcal {H}}}_{u,v,:} = f(uv)\ \ \ \ \text { for all }u,v\in \varSigma ^*. \end{aligned}$$

The relation between the rank of the Hankel matrix and the function f naturally carries over to the vector-valued case and is given in the following theorem.

Theorem 1

(Rabusseau et al., 2017) Let \(f:\varSigma ^*\rightarrow \mathbb {R}^d\) and let \({\varvec{\mathcal {H}}}\) be its Hankel tensor. Then \({{\,\mathrm{rank}\,}}(f) = {{\,\mathrm{rank}\,}}({\varvec{\mathcal {H}}}_{(1)})\).

The vv-WFA learning algorithm leverages the fact that the proof of this theorem is constructive: one can recover a vv-WFA computing f from any low rank factorization of \({\varvec{\mathcal {H}}}_{(1)}\). In practice, a finite sub-block \({\varvec{\mathcal {H}}}_{P,S} \in \mathbb {R}^{P\times S\times p}\) of the Hankel tensor is used to recover the vv-WFA, where \(P,S\subset \varSigma ^*\) are finite sets of prefixes and suffixes forming a complete basis for f, i.e. such that \({{\,\mathrm{rank}\,}}(({\varvec{\mathcal {H}}}_{P,S})_{(1)}) = {{\,\mathrm{rank}\,}}({\varvec{\mathcal {H}}}_{(1)})\). Indeed, one can show that Theorem 1 still holds when replacing the Hankel tensor by such a sub-block \({\varvec{\mathcal {H}}}_{P,S}\). The spectral learning algorithm then consists of the following steps:

  1. 1.

    Choose a target rank n and a set of prefixes and suffixes \(P,S\subset \varSigma ^*\).

  2. 2.

    Estimate the following sub-block of the Hankel tensor from data:

    • \(\varvec{\mathcal {H}}_{P,S}\in \mathbb {R}^{P\times S\times p}\) defined by \((\varvec{\mathcal {H}}_{P,S})_{u,v,:}=f(uv)\) for all \(u\in P,v\in S\).

    • \(\mathbf {H}_{P}\in \mathbb {R}^{P\times p}\) defined by \((\mathbf {H}_{P})_{u,:}=f(u)\) for all \(u\in P\).

    • \(\mathbf {H}_{S}\in \mathbb {R}^{S\times p}\) defined by \((\mathbf {H}_{S})_{v,:}=f(v)\) for all \(v\in S\).

    • \(\varvec{\mathcal {H}}^\sigma _{P,S}\in \mathbb {R}^{P\times S\times p}\) for each \(\sigma \in \varSigma\) defined by \((\varvec{\mathcal {H}}^\sigma _{P,S})_{u,v,:}=f(u\sigma v)\) for all \(u\in P,v\in S\).

  3. 3.

    Obtain a (approximate) low rank factorization of the Hankel tensor (using e.g. truncated SVD)

    $$\begin{aligned} ({\varvec{\mathcal {H}}}_{P,S})_{(1)}\simeq {\varvec{\mathrm{P}}}{\varvec{\mathcal {S}}}_{(1)} \end{aligned}$$

    where \({\varvec{\mathrm{P}}}\in \mathbb {R}^{P\times n}\) and \({\varvec{\mathcal {S}}}\in \mathbb {R}^{n \times S\times p}\).

  4. 4.

    Compute the parameters of the learned vv-WFA using the relations

    $$\begin{aligned} {\varvec{\alpha }}^\top&= \mathrm {vec}(\mathbf {H}_S)^\top ({\varvec{\mathcal {\mathcal {S}}}}_{(1)})^\dagger \\ {\varvec{\varOmega }}&={\varvec{\mathrm{P}}}^{-1}\mathbf {H}_P\\ \mathbf {A}^\sigma&= {\varvec{\mathrm{P}}}^\dagger {\varvec{\mathcal {H}}}^\sigma _{(1)}({\varvec{\mathcal {S}}}_{(1)})^\dagger \ \text { for each }\sigma \in \varSigma . \end{aligned}$$

This learning algorithm is consistent: in the limit of infinite training data (i.e. the Hankel sub-blocks are exactly estimated from data), this algorithm is guaranteed to return a WFA that computes the target function f if \(P\) and \(S\) form a complete basis. That is, the algorithm is consistent if the rank of the sub-block \(({\varvec{\mathcal {H}}}_{P,S})_{(1)}\) is equal to the rank of the full Hankel tensor, i.e. \({{\,\mathrm{rank}\,}}(({\varvec{\mathcal {H}}}_{P,S})_{(1)}) = {{\,\mathrm{rank}\,}}({\varvec{\mathcal {H}}}_{(1)})\). More details can be found in Balle et al. (2014) for WFA and in Rabusseau et al. (2017) for vv-WFA. Using tensor network diagrams, steps 3) and 4) of the spectral learning algorithm can be represented as follows:

figure e

2.3 Recurrent neural networks

Recurrent neural networks (RNN) are a class of neural networks designed to handle sequential data. A RNN takes as input a sequence (of arbitrary length) of elements from an input space \(\mathcal {X}\) and outputs an element in the output space \(\mathcal {Y}\). Thus a RNN computes a function from \(\mathcal {X}^*\), the set of all finite-length sequences of elements of \(\mathcal {X}\), to \(\mathcal {Y}\). In most applications, \(\mathcal {X}\) is a vector space, typically \(\mathbb {R}^d\). When the input of the problem are sequences of symbols from a finite alphabet \(\varSigma\), so-called one-hot encoding are often used to embed \(\varSigma\) into \(\mathbb {R}^{|\varSigma |}\) by representing each symbol in \(\varSigma\) by one of the canonical basis vector. There are several ways to describe recurrent neural networks. We opt here for a relatively abstract one, which will allow us to seamlessly draw connections with the vv-WFA model presented in the previous section. We first introduce the general notion of a recurrent model, which encompasses many of the models used in machine learning to handle sequential data (most RNN architectures, hidden Markov models, WFA, etc.).

Definition 2

Let \(\mathcal {X}\) and \(\mathcal {Y}\) be the input and output space, respectively.

A recurrent model with n states is given by a tuple \(R=(\phi ,\psi ,\mathbf {h}_0)\) where \(\phi :\mathcal {X}\times \mathbb {R}^n \rightarrow \mathbb {R}^n\) is the recurrent function, \(\psi :\mathbb {R}^n\rightarrow \mathcal {Y}\) is the output function and \(\mathbf {h}_0\in \mathbb {R}^n\) is the initial state. A recurrent model R computes a function \(f_R:\mathcal {X}^*\rightarrow \mathcal {Y}\) defined by the (recurrent) relation:

$$\begin{aligned} f_R(x_1x_2\ldots x_k) = \psi (\mathbf {h}_k)\ \ \ \text {where } \mathbf {h}_t = \phi (x_t,\mathbf {h}_{t-1})\ \text {for } 1\le t\le k \end{aligned}$$

for all \(k\ge 0\) and \(x_1,x_2,\dots ,x_k\in \mathcal {X}\).

One can easily check that this definition encompasses vv-WFA. Indeed, a WFA \(A=(\alpha , \{ \mathrm{A}^{ \sigma }\}_{\sigma \in \varSigma },\varOmega )\) is a recurrent model with n states where \(\mathcal {X}=\varSigma\), \(\mathcal {Y}=\mathbb {R}^p\), \(\mathbf {h}_0={\varvec{\alpha }}\) and the recurrent and output functions are given by

$$\begin{aligned} \phi (\sigma ,\mathbf {h}) = (\mathbf {A}^\sigma )^\top \mathbf {h}\ \ \ \text {and}\ \ \ \psi (\mathbf {h})= {\varvec{\varOmega }}\mathbf {h}\end{aligned}$$

for all \(x\in \varSigma ,\ \mathbf {h}\in \mathbb {R}^n\). Many architectures of recurrent neural networks have been proposed and used in practice. In this paper, we focus on vanilla RNN, also known as Elman network (Elman, 1990), and second-order RNN (2-RNN) (Giles et al., 1990; Pollack, 1991; Lee et al., 1986),Footnote 4 which can be seen as a multilinear extension of vanilla RNN. We now give the formal definitions of these two models.

Definition 3

A first-order RNN (or vanilla RNN) with n states (or, equivalently, n hidden neurons) is a recurrent model \(R=(\phi ,\psi ,\mathbf {h}_0)\) with input space \(\mathcal {X}=\mathbb {R}^d\) and output space \(\mathcal {Y}=\mathbb {R}^p\). It computes a function \(f_R:(\mathbb {R}^d)^*\rightarrow \mathbb {R}^p\) defined by \(f_R(\mathbf {x}_1,\dots , \mathbf {x}_k)=\psi (\mathbf {h}_k)\), where the recurrent and output functions are defined by

$$\begin{aligned} \mathbf {h}_t=\phi (\mathbf {x}_t,\mathbf {h}_{t-1}) = z_{rec}(\mathbf {U}\mathbf {x}_t + \mathbf {V}\mathbf {h}_{t-1})\ \ \ \text { and }\ \ \ \mathbf {y}_t=\psi (\mathbf {h}_t)=z_{out}({\varvec{{\mathrm {W}}}}_{t}). \end{aligned}$$

The parameters of a first-order RNN are:

  • the initial state \(\mathbf {h}_0\in \mathbb {R}^n\),

  • the weight matrices \(\mathbf {U}\in \mathbb {R}^{n\times d}\), \(\mathbf {V}\in \mathbb {R}^{n\times n}\) and \({\varvec{{\mathrm {W}}}}\in \mathbb {R}^{p\times n}\),

  • the activation functions \(z_{rec}:\mathbb {R}^n\rightarrow \mathbb {R}^n\) and \(z_{out}:\mathbb {R}^p\rightarrow \mathbb {R}^p\).

For the sake of simplicity, we omitted the bias vectors usually included in the definition of first-order RNN. Note however that this is without loss of generality when \(z_{rec}\) is either a rectified linear unit or the identity (which will be the cases considered in this paper). Indeed, for any recurrent model with n states \(R=(\phi ,\psi ,\mathbf {h}_0)\) with input space \(\mathcal {X}=\mathbb {R}^d\) and output space \(\mathcal {Y}=\mathbb {R}^p\) defined by

$$\begin{aligned} \mathbf {h}_t=\phi (\mathbf {x}_t,\mathbf {h}_{t-1}) = z_{rec}(\mathbf {U}\mathbf {x}_t + \mathbf {V}\mathbf {h}_{t-1} + \mathrm {b})\ \ \ \text { and }\ \ \ \psi (\mathbf {h}_t)=z_{out}({\varvec{{\mathrm {W}}}}\mathbf {h}_{t}+ \end{aligned}$$

one can append a 1 to all input vectors, \({\tilde{\mathbf {x}}}_t= (\mathbf {x}_t\ 1)^\top\), and define a new recurrent model with \(n+1\) states \({\tilde{R}}=({\tilde{\phi }},{\tilde{\psi }},{\tilde{\mathbf {h}}}_0)\) with input space \(\mathcal {X}=\mathbb {R}^{d+1}\) and output space \(\mathcal {Y}=\mathbb {R}^p\) defined by

$$\begin{aligned} {\tilde{\mathbf {h}}}_t=\phi ({\tilde{\mathbf {x}}}_t,{\tilde{\mathbf {h}}}_{t-1}) = z_{rec}({\tilde{\mathbf {U}}}{\tilde{\mathbf {x}}}_t + {\tilde{\mathbf {V}}}{\tilde{\mathbf {h}}}_{t-1}),\ \ \psi ({\tilde{\mathbf {h}}}_t)=z_{out}({\tilde{{\varvec{{\mathrm {W}}}}}}{\tilde{\mathbf {h}}}_{t})\ \ \text {and } {\tilde{\mathbf {h}}}_0=(\mathbf {h}_0\ 1)^\top \end{aligned}$$

computing the same function.

Definition 4

A second-order RNN (2-RNN) with n states is a recurrent model \(R=(\phi ,\psi ,\mathbf {h}_0)\) with input space \(\mathcal {X}=\mathbb {R}^d\) and output space \(\mathcal {Y}=\mathbb {R}^p\). It computes a function \(f_R:(\mathbb {R}^d)^*\rightarrow \mathbb {R}^p\) defined by \(f_R(\mathbf {x}_1,\dots , \mathbf {x}_k)=\psi (\mathbf {h}_k)\), where the recurrent and output functions are defined by

$$\begin{aligned} \mathbf {h}_t=\phi (\mathbf {x}_t,\mathbf {h}_{t-1}) = z_{rec}({\varvec{\mathcal {A}}}\bullet _{1}\mathbf {h}_{t-1}\bullet _{2}\mathbf {x}_t)\ \ \ \text { and }\ \ \ \mathbf {y}_t=\psi (\mathbf {h}_t)=z_{out}({\varvec{{\mathrm {W}}}}\mathbf {h}_{t}). \end{aligned}$$

The parameters of a second-order RNN are:

  • the initial state \(\mathbf {h}_0\in \mathbb {R}^n\),

  • the weight tensor \({\varvec{\mathcal {A}}}\in \mathbb {R}^{n\times d\times n}\) and output matrix \({\varvec{{\mathrm {W}}}}\in \mathbb {R}^{p\times n}\),

  • the activation functions \(z_{rec}:\mathbb {R}^n\rightarrow \mathbb {R}^n\) and \(z_{out}:\mathbb {R}^p\rightarrow \mathbb {R}^p\).

A linear 2-RNN R with n states is called minimal if its number of states is minimal (i.e. any linear 2-RNN computing \(f_R\) has at least n states).

In the remaining of the paper, we will define a second-order RNN using its parameters, i.e. \(R=(\mathbf {h}_0,{\varvec{\mathcal {A}}},{\varvec{{\mathrm {W}}}},z_{rec},z_{out})\). In the particular case where the activation functions are linear (i.e. equal to the identity function), we will omit them from the definition, e.g. \(R=(\mathbf {h}_0,{\varvec{\mathcal {A}}},{\varvec{{\mathrm {W}}}})\) defines a linear second-order RNN. The recurrent activation function \(z_{rec}\) of a RNN is usually a componentwise non-linear function such as a hyperbolic tangent or rectified linear unit, while the output activation function often depends on the task (the softmax function being the most popular for classification and language modeling tasks). One can see that the difference between first-order and second-order RNN only lies in the recurrent function. For first-order RNN, the pre-activation \(\mathrm {a}_t=\mathbf {U}\mathbf {x}_t + \mathbf {V}\mathbf {h}_{t-1} + \mathrm {b}\) is a linear function of \(\mathbf {x}_t\) and \(\mathbf {h}_{t-1}\), while for second-order RNN the pre-activation \(\mathrm {a}_t={\varvec{\mathcal {A}}}\bullet _{1}\mathbf {h}_{t-1}\bullet _{2}\mathbf {x}_t\) is a bilinear map applied to \(\mathbf {x}_t\) and \(\mathbf {h}_{t-1}\) (hence the second-order denomination). It is worth mentioning that second-order RNN are often defined with additional parameters to account for first-order interactions and bias terms:

$$\begin{aligned} \mathbf {h}_t=\phi (\mathbf {x}_t,\mathbf {h}_{t-1}) = z_{rec}({\varvec{\mathcal {A}}}\bullet _{1}\mathbf {h}_{t-1}\bullet _{2}\mathbf {x}_t + \mathbf {U}\mathbf {x}_t + \mathbf {V}\mathbf {h}_{t-1} + \mathrm {b}). \end{aligned}$$

The definition we use here is conceptually simpler and without loss of generality (similarly to the omission of the bias vectors in the definition of first-order RNN). Indeed, when \(z_{rec}\) is either the identity or a rectified linear unit, one can always append a 1 to all input vectors and augment the state space by one state to obtain a 2-RNN computing the same function. It follows from this discussion that 2-RNN are a strict generalization of vanilla RNN: any function that can be computed by a vanilla RNN can be computed by a 2-RNN (provided that all input vectors are appended a constant entry equal to one). The recurrent and output functions of a 2-RNN can be represented by the following simple tensor networks:

figure f

By introducing the notion of recurrent models, we presented a unified view of WFA and first and second-order RNN. All these sequential models are recurrent models and differ in the way their recurrent and output functions are defined. The difference between WFA and RNN can thus be summarized by the fact that the recurrent and output functions of a WFA are linear, whereas they are non-linear maps for RNN. In essence, one could say that RNN are non-linear extensions of WFA. In Sect. 4, we will formalize this intuition by proving the exact equivalence between the classes of functions that can be computed by WFA and second-order RNN with linear activation functions.

3 Weighted automata and tensor networks

In this section, we present connections between weighted automata and tensor networks. In particular, we will show that the computation of a WFA on a sequence is intrinsically connected to the matrix product states model used in quantum physics and the tensor train decomposition. This connection will allow us to unravel a fundamental structure in the Hankel matrix of a function computed by a WFA: in addition to being low rank, we will show that the Hankel matrix can be decomposed into sub-blocks which are all matricizations of tensors with low tensor train rank. We will then leverage this structure to design an efficient spectral learning algorithm for WFA relying on efficient computations of pseudo-inverse of matrices given in the tensor train format.

3.1 Tensor train structure of the Hankel matrix

For the sake of simplicity, we will consider scalar valued WFA in this section but all the results we present can be straightforwardly extended to vv-WFA. Let \(A=(\alpha , \{ \mathrm{A}^{ \sigma }\}_{\sigma \in \varSigma },\varOmega )\) be a WFA with n states. Recall that A computes a function \(f_A: \varSigma ^* \rightarrow \mathbb {R}\) defined by

$$\begin{aligned} f_A(x_1x_2\ldots x_k) = {\varvec{\alpha }}^\top \mathbf {A}^{x_1}\mathbf {A}^{x_2}\ldots \mathbf {A}^{x_k}{\varvec{\omega }}\end{aligned}$$

for any \(k\ge 0\) and \(x_1,x_2,\ldots , x_k\in \varSigma\). The computation of a WFA on a sequence can be represented by the following tensor network:

figure g

By stacking the transition matrices \(\{\mathbf {A}^\sigma \}_{\sigma \in \varSigma }\) into a third order tensor \({\varvec{\mathcal {A}}}\in \mathbb {R}^{n\times \varSigma \times n}\) defined by

$$\begin{aligned} {\varvec{\mathcal {A}}}_{:,\sigma ,:}=\mathbf {A}^{\sigma }\ \ \ \text { for all }\sigma \in \varSigma , \end{aligned}$$

this computation can be rewritten into

figure h

This graphically shows the tight connection between WFA and the tensor train decomposition. More formally, for any integer l, let us define the lth order Hankel tensor \(\varvec{\mathcal {H}}^{(l)}\in \mathbb {R}^{\varSigma \times \varSigma \times \cdots \varSigma }\) by

$$\begin{aligned} \varvec{\mathcal {H}}^{(l)}_{\sigma _1,\sigma _2,\ldots ,\sigma _l} = f(\sigma _1\sigma _2\ldots \sigma _l)\text { for all }\sigma _1,\ldots \sigma _l\in \varSigma . \end{aligned}$$
(3)

Then, one can easily check that each such Hankel tensor admits the following rank n tensor train decomposition:

figure i

It follows that the Hankel matrix of a recognizable function can be decomposed into sub-blocks which are all matricization of Hankel tensors with low tensor train rank. To the best of our knowledge, this is a novel result that has not been noticed in the past. We conclude this section by formalizing this result in the following theorem.

Theorem 2

Let \(f:\varSigma ^*\rightarrow \mathbb {R}\) be a function computed by a WFA with n states and let \(\mathbf {H}\in \mathbb {R}^{\varSigma ^*\times \varSigma ^*}\) be its Hankel matrix defined by \(\mathbf {H}_{u,v}=f(uv)\) for all \(u,v\in \varSigma ^*\). Furthermore, for any integer l, let \(\varvec{\mathcal {H}}^{(l)}\in \mathbb {R}^{\varSigma \times \varSigma \times \cdots \times \varSigma }\) be the lth order tensor defined by \(\varvec{\mathcal {H}}^{(l)}_{\sigma _1,\sigma _2,\ldots ,\sigma _l} = f(\sigma _1\sigma _2\ldots \sigma _l)\). Then, the Hankel matrix \(\mathbf {H}\) can be decomposed into sub-blocks, each sub-block being the matricization of a tensor of tensor train rank at most n. More precisely, each of these sub-blocks is equal to \({(\varvec{\mathcal {H}}^{(l)})}_{\langle \!\langle k,l-k\rangle \!\rangle }\) for some values of l and k, and each Hankel tensor \(\varvec{\mathcal {H}}^{(l)}\) has tensor train rank at most n.

Proof

For each \(m,k\in \mathbb {N}\), let \(\mathbf {H}^{(m,k)}\in \mathbb {R}^{\varSigma ^m\times \varSigma ^k}\) denote the sub-block of the Hankel matrix with prefixes \(\varSigma ^m\) and suffixes \(\varSigma ^k\). It is easy to check that the Hankel matrix \(\mathbf {H}\in \mathbb {R}^{\varSigma ^*\times \varSigma ^*}\) can be partitioned into the sub-blocks \(\mathbf {H}^{(m,k)}\) for \(m,k\in \mathbb {N}\):

$$\begin{aligned} \mathbf {H}= \left[ \begin{array}{lllll} \mathbf {H}^{(0,0)} &{} \mathbf {H}^{(0,1)} &{} \mathbf {H}^{(0,2)} &{} \mathbf {H}^{(0,3)} &{} \ldots \\ \mathbf {H}^{(1,0)} &{} \mathbf {H}^{(1,1)} &{} \mathbf {H}^{(1,2)} &{} \mathbf {H}^{(1,3)} &{} \ldots \\ \mathbf {H}^{(2,0)} &{} \mathbf {H}^{(2,1)} &{} \mathbf {H}^{(2,2)} &{} \mathbf {H}^{(2,3)} &{} \ldots \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \ddots \end{array} \right] \ . \end{aligned}$$

Now, by definition of the tensors \({\varvec{\mathcal {H}}}^{(l)}\), we have \(\mathbf {H}^{(m,k)} = {(\varvec{\mathcal {H}}^{(m+k)})}_{\langle \!\langle m,k\rangle \!\rangle }\). Moreover, let \(A=(\alpha , \{ \mathrm{A}^{ \sigma }\}_{\sigma \in \varSigma },\varOmega )\) be a WFA with n states computing f and let \({\varvec{\mathcal {A}}}\in \mathbb {R}^{n\times \varSigma \times n}\) be the 3rd order tensor defined by \({\varvec{\mathcal {A}}}_{:,\sigma ,:}=\mathbf {A}^{\sigma }\) for each \(\sigma \in \varSigma\). For any \(m,k\in \mathbb {N}\) and any \(\sigma _{1},\ldots ,\sigma _{m+k}\in \varSigma\), we have

$$\begin{aligned} \mathbf {H}^{(m,k)}_{\sigma _{1},\ldots ,\sigma _{m+k}}&= ({(\varvec{\mathcal {H}}^{(m+k)})}_{\langle \!\langle m,k\rangle \!\rangle })_{\sigma _{1}\ldots \sigma _m,\sigma _{m+1}\ldots \sigma _{m+k}} \\&= f(\sigma _1,\sigma _2,\ldots ,\sigma _{m+k}) \\&= {\varvec{\alpha }}^\top \mathbf {A}^{\sigma _1}\mathbf {A}^{\sigma _2}\ldots \mathbf {A}^{\sigma _{m+k}}{\varvec{\omega }}\\&= {\varvec{\alpha }}^\top {\varvec{\mathcal {A}}}_{:,\sigma _1,:}{\varvec{\mathcal {A}}}_{:,\sigma _2,:}\ldots {\varvec{\mathcal {A}}}_{:,\sigma _{m+k},:}{\varvec{\omega }}\\&= \llbracket {\varvec{\mathcal {A}}}\bullet _{1}{\varvec{\alpha }},{\varvec{\mathcal {A}}},\ldots ,{\varvec{\mathcal {A}}},{\varvec{\mathcal {A}}}\bullet _{3}{\varvec{\omega }} \rrbracket _{{\sigma _{1},\ldots ,\sigma _{m+k}}}. \end{aligned}$$

It follows that

$$\begin{aligned} \mathbf {H}^{(m,k)} = {(\varvec{\mathcal {H}}^{(m+k)})}_{\langle \!\langle m,k\rangle \!\rangle } = {({\llbracket {\varvec{\mathcal {A}}}\bullet _{1}{\varvec{\alpha }},\overbrace{{\varvec{\mathcal {A}}},\ldots ,{\varvec{\mathcal {A}}},}^{m+k-2\text { times}}{\varvec{\mathcal {A}}}\bullet _{3}{\varvec{\omega }} \rrbracket })}_{\langle \!\langle m,k\rangle \!\rangle } \end{aligned}$$

and thus that each sub-block \(\mathbf {H}^{(m,k)}\) is a matricization of a tensor of tensor train rank at most n. \(\square\)

3.2 Spectral learning in the tensor train format

We now present how the tensor train structure of the Hankel matrix can be leveraged to significantly improve the computational complexity of steps 3 and 4 of the spectral learning algorithm described in Sect. 2.2. These two steps consist in first computing a low rank approximation of the Hankel sub-block

$$\begin{aligned} \mathbf {H}_{P,S}\simeq {\varvec{\mathrm{P}}}{\varvec{\mathrm{S}}}\end{aligned}$$

before estimating the parameters of the WFA using simple pseudo-inverse and matrix product computations

$$\begin{aligned} {\varvec{\alpha }}^\top = {\varvec{\mathrm{h}}}_S^\top {\varvec{\mathrm{S}}}^\dagger ,\ {\varvec{\omega }}={\varvec{\mathrm{P}}}^{-1}{\varvec{\mathrm{h}}}_P\ \text {and}\ \mathbf {A}^\sigma = {\varvec{\mathrm{P}}}^\dagger \mathbf {H}^\sigma _{P,S}{\varvec{\mathrm{S}}}^\dagger \ \text { for each }\sigma \in \varSigma \end{aligned}$$

where the Hankel sub-blocks are defined by

$$\begin{aligned} ({\varvec{\mathrm{h}}}_P)_u=f(u),\ \ ({\varvec{\mathrm{h}}}_S)_v=f(v),\ \ (\mathbf {H}_{P,S})_{u,v}=f(uv)\ \text { and }\ (\mathbf {H}^\sigma _{P,S})_{u,v}=f(u\sigma v) \end{aligned}$$

for all \(\sigma \in \varSigma ,u\in P,v\in S\). Note that we again focus on scalar-valued WFA for the sake of clarity (i.e. \(A=(\alpha , \{ \mathrm{A}^{ \sigma }\}_{\sigma \in \varSigma },\varOmega )\)) but the results we present can be straightforwardly extended to vector-valued WFA. Using tensor networks, these two steps are described as follows:

figure j

We now focus on the case where the basis of prefixes and suffixes are both equal to the set of all sequences of length l for some integer l, i.e. \(P=S=\varSigma ^l\). A first important observation is that, in this case, the Hankel sub-block \(\mathbf {H}_{P,S}\) is a matricization of the 2l-th order Hankel tensor \(\varvec{\mathcal {H}}^{(2l)}\in \mathbb {R}^{\varSigma \times \cdots \times \varSigma }\) defined in Eq. (3):

$$\begin{aligned} \mathbf {H}_{P,S} = {(\varvec{\mathcal {H}}^{(2l)})}_{\langle \!\langle l,l\rangle \!\rangle }. \end{aligned}$$

Indeed, for \(u=u_1u_2\ldots u_l\in P=\varSigma ^l\) and \(v=v_1v_2\ldots v_l\in S=\varSigma ^l\) we have

$$\begin{aligned} (\mathbf {H}_{P,S})_{u,v}=f(uv)=f(u_1u_2\ldots u_lv_1v_2\ldots v_l)=\varvec{\mathcal {H}}^{(2l)}_{u_1,u_2,\ldots , u_l,v_1,v_2,\ldots , v_l}. \end{aligned}$$

Using the same argument, one can easily show that \({\varvec{\mathrm{h}}}_P={\varvec{\mathrm{h}}}_S=\mathrm {vec}(\varvec{\mathcal {H}}^{(l)})\). Lastly, a similar observation can be done for the Hankel sub-blocks \(\mathbf {H}_{P,S}^\sigma\) for each \(\sigma \in \varSigma\): all of these sub-blocks are slices of the Hankel tensor \(\varvec{\mathcal {H}}^{(2l+1)}\). Indeed, for any \(u=u_1u_2\ldots u_l\in P=\varSigma ^l\) and \(v=v_1v_2\ldots v_l\in S=\varSigma ^l\) we have

$$\begin{aligned} (\mathbf {H}^\sigma _{P,S})_{u,v}=f(u\sigma v)=f(u_1\ldots u_l\sigma v_1\ldots v_l)=\varvec{\mathcal {H}}^{(2l+1)}_{u_1,\ldots , u_l,\sigma ,v_1,\ldots , v_l} \end{aligned}$$

from which it follows that

$$\begin{aligned} \mathbf {H}_{P,S}^\sigma = \left( {(\varvec{\mathcal {H}}^{(2l+1)})}_{\langle \!\langle l,1,l\rangle \!\rangle }\right) _{:,\sigma ,:}\text { for all }\sigma \in \varSigma . \end{aligned}$$

Thus, in the case where \(P=S=\varSigma ^l\), all the sub-blocks of the Hankel matrix one needs to estimate for the spectral learning algorithm are matricization of Hankel tensors of tensor train rank at most n (where n is the number of states of the target WFA). Let us assume for now that we have access to the true Hankel tensors \(\varvec{\mathcal {H}}^{(l)}\), \(\varvec{\mathcal {H}}^{(2l)}\) and \(\varvec{\mathcal {H}}^{(2l+1)}\) given in the tensor train format (how to estimate these Hankel tensors in the tensor train format from data will be discussed in Sect. 5.3):

$$\begin{aligned} \varvec{\mathcal {H}}^{(l)}&= \llbracket \varvec{{ g}}^{(l)}_1,\ldots ,\varvec{{ g}}^{(l)}_l \rrbracket \\ \varvec{\mathcal {H}}^{(2l)}&= \llbracket \varvec{{ g}}^{(2l)}_1,\ldots ,\varvec{{ g}}^{(2l)}_{2l} \rrbracket \\ \varvec{\mathcal {H}}^{(2l+1)}&= \llbracket \varvec{{ g}}^{(2l+1)}_1,\ldots ,\varvec{{ g}}^{(2l+1)}_{2l+1} \rrbracket \end{aligned}$$

where all tensor train decompositions are of rank n. We now show how the tensor train structure of the Hankel tensors can be leveraged to significantly improve the computational complexity of the spectral learning algorithm. Recall first that in the standard case, this complexity is in \({\mathcal {O}}\left( n|P||S| + n^2|P||\varSigma |\right)\) (where the first term corresponds to the truncated SVD of the Hankel matrix, and the second one to computing the transition matrices \(\mathbf {A}^\sigma )\), which is equal to \({\mathcal {O}}\left( n|\varSigma |^{2l} + n^2|\varSigma |^{l+1}\right)\) when \(P=S=\varSigma ^l\). In contrast, we will show that if the Hankel tensors are given in the tensor train format, the complexity of the spectral learning algorithm can be reduced to \({\mathcal {O}}\left( n^3l|\varSigma |\right)\). First observe that the tensor train decomposition of the Hankel tensor \(\varvec{\mathcal {H}}^{(2l)}\) already gives us the rank n factorization of the Hankel matrix \(\mathbf {H}_{P,S}={(\varvec{\mathcal {H}}^{(2l)})}_{\langle \!\langle l,l\rangle \!\rangle }\), which can easily be seen from the following tensor network:

figure k

More formally, this shows that \(\mathbf {H}_{P,S}={\varvec{\mathrm{P}}}{\varvec{\mathrm{S}}}\) with \({\varvec{\mathrm{P}}}={(\llbracket \varvec{{ g}}^{(2l)}_1,\ldots ,\varvec{{ g}}^{(2l)}_{l},\mathbf {I} \rrbracket )}_{\langle \!\langle l,1\rangle \!\rangle }\) and \({\varvec{\mathrm{S}}}={(\llbracket \mathbf {I},\varvec{{ g}}^{(2l)}_{l+1},\ldots ,\varvec{{ g}}^{(2l)}_{2l} \rrbracket )}_{\langle \!\langle 1,l\rangle \!\rangle }\). The remaining step of the spectral learning algorithm consists in computing the pseudo-inverse of \({\varvec{\mathrm{P}}}\) and \({\varvec{\mathrm{S}}}\) and performing various matrix products involving the Hankel sub-blocks \({\varvec{\mathrm{h}}}_P\), \({\varvec{\mathrm{h}}}_S\) and \(\mathbf {H}_{P,S}^\sigma\) for each \(\sigma \in \varSigma\). Observe that all the elements involved in these computations are tensors of tensor train rank at most n (or matricizations of such tensors). It turns out that all these operations can be performed efficiently in the tensor tensor train format: the pseudo-inverses of \({\varvec{\mathrm{P}}}\) and \({\varvec{\mathrm{S}}}\) in the tensor train format can be computed in time \({\mathcal {O}}\left( n^3l|\varSigma |\right)\) and all the matrix products between \({\varvec{\mathrm{P}}}^\dagger\) and \({\varvec{\mathrm{S}}}^\dagger\) and the Hankel tensors can also be done in time \({\mathcal {O}}\left( n^3l|\varSigma |\right)\). Describing these tensor train computations in details go beyond the scope of this paper but these algorithms are well known in the tensor train and matrix product states communities. We refer the reader to Oseledets (2011) for efficient computations of matrix products in the tensor train format, and to Gelß (2017) and Klus et al. (2018) for the computation of pseudo-inverse in the tensor train format. We showed that in the case where \(P=S=\varSigma ^l\), the time complexity of the last two steps of the spectral learning algorithm can be reduced from an exponential dependency on l to a linear one. This is achieved by leveraging the tensor train structure of the Hankel sub-blocks. However, recall that the spectral learning algorithm is consistent (i.e. guaranteed to return the target WFA from an infinite amount of training data) only if \(P\) and \(S\) form a complete basis, that is \(P\) and \(S\) are such that \({{\,\mathrm{rank}\,}}(({\varvec{\mathcal {H}}}_{P,S})_{(1)}) = {{\,\mathrm{rank}\,}}(\mathbf {H})\). In the case where \(P=S=\varSigma ^l\), this condition is equivalent to \({{\,\mathrm{rank}\,}}({({\varvec{\mathcal {H}}}^{(2l)})}_{\langle \!\langle l,l\rangle \!\rangle }) = {{\,\mathrm{rank}\,}}(\mathbf {H})\). But it is not the case that for any function computed by a WFA, there exists an integer l such that \(P=S=\varSigma ^l\) form a complete basis. Indeed, consider for example the function f on the alphabet \(\{a,b\}\) defined by \(f(x)=1\) if \(x=aa\) and 0 otherwise. One can easily show that there exists a minimal WFA with 3 states computing f. However, it is easy to check that \({{\,\mathrm{rank}\,}}({({\varvec{\mathcal {H}}}^{(2l)})}_{\langle \!\langle l,l\rangle \!\rangle })\) is equal to 1 for \(l=1\) and to 0 for any other value of l. This implies that not all functions can be consistently recovered from training data using the efficient spectral learning algorithm we propose. Luckily, this caveat can be addressed using a simple workaround. For any function \(f:\varSigma ^*\rightarrow \mathbb {R}\), one can define a new alphabet \({\tilde{\varSigma }}=\varSigma \cup \{{\underline{\lambda }}\}\) where \({\underline{\lambda }}\) is a new symbol not in \(\varSigma\) which will be treated as the empty string. One can then extend f to \({\tilde{f}}:{\tilde{\varSigma }}^*\rightarrow \mathbb {R}\) naturally by ignoring the new symbol \({\underline{\lambda }}\), e.g. \({\tilde{f}}({\underline{\lambda }}ab{\underline{\lambda }}c)=f(abc)\). Let \(\mathbf {H}\in \mathbb {R}^{\varSigma ^*\times \varSigma ^*}\), \({\tilde{\mathbf {H}}}\in \mathbb {R}^{{\tilde{\varSigma }}^*\times {\tilde{\varSigma }}^*}\), \(\varvec{\mathcal {H}}^{(2l)}\in \mathbb {R}^{\varSigma \times \cdots \times \varSigma }\) and \({\tilde{\varvec{\mathcal {H}}}}^{(2l)}\in \mathbb {R}^{{\tilde{\varSigma }}\times \cdots \times {\tilde{\varSigma }}}\) be the Hankel matrix and tensors of f and \({\tilde{f}}\). Then, one can show that if f can be computed by a WFA with n states, there always exists an integer l such that \({{\,\mathrm{rank}\,}}({({\tilde{{\varvec{\mathcal {H}}}}}^{(2l)})}_{\langle \!\langle l,l\rangle \!\rangle })={{\,\mathrm{rank}\,}}(\mathbf {H})=n\). Indeed, in contrast with the Hankel tensor \(\varvec{\mathcal {H}}^{(2l)}\) which only contains the values of f on sequences of length exactly 2l, the Hankel tensors \({\tilde{\varvec{\mathcal {H}}}}^{(2l)}\) contains the values of f on all sequences of length smaller than or equal to 2l. One potential workaround would consist in estimating the Hankel sub-blocks of \({\tilde{f}}\) from data generated by f and perform steps 3 and 4 of the spectral learning to recover the parameters of a WFA computing \({\tilde{f}}\). The transition matrix associated with the new symbol \({\underline{\lambda }}\) can be discarded to obtain the parameters of a WFA estimating f (note that since the spectral learning algorithm is consistent, the transition matrix associated with the new symbol \({\underline{\lambda }}\) estimated from data is guaranteed to converge to the identity matrix as the training data increases). In practice, to estimate a Hankel tensor of length L, one could append every sequence in the dataset that is of length L or smaller than L with \({\underline{\lambda }}\) until it reaches length L and then perform the standard Hankel recovery routine. It is worth mentioning that we did not have to use this workaround for any of the experiments presented in Sect. 6. More importantly, one can show that if the parameters of a 2-RNN are drawn randomly then the workaround discussed above is not necessary (i.e., one can consistently recover a random 2-RNN from data using the learning algorithm we propose), which is shown in the following proposition.

Proposition 1

Let \(\mathcal {A}= \langle {\varvec{\alpha }}, {\varvec{\mathcal {A}}}, {\varvec{\omega }}\rangle\) be a 2-RNN with n states whose parameters are randomly drawn from a continuous distribution (w.r.t. the Lebesgue measure) and let \(\mathbf {H}\in \mathbb {R}^{\varSigma ^*\times \varSigma ^*}\) be its Hankel matrix. Then, with probability one, \({{\,\mathrm{rank}\,}}({({\varvec{\mathcal {H}}}^{(2l)})}_{\langle \!\langle l,l\rangle \!\rangle }) = {{\,\mathrm{rank}\,}}(\mathbf {H})\) for any l such that \(|\varSigma |^l\ge n\) (where \({\varvec{\mathcal {H}}}^{(2l)}\) is as defined in Eq. (3)).

Proof

Let \(\mathbf {F}_l\in {\mathbb {R}}^{\varSigma ^l\times n}\) and \(\mathbf {B}_l\in {\mathbb {R}}^{n\times \varSigma ^l}\) be the forward and backward matrices of the random 2-RNN, that is the rows of \(\mathbf {F}_l\) are \({\varvec{\alpha }}^\top {\varvec{\mathcal {A}}}_{:.u_1,:}\ldots {\varvec{\mathcal {A}}}_{:.u_l,:}\) for \(u_1\ldots u_l \in \varSigma ^l\) and the columns of \(\mathbf {B}_l\) are \({\varvec{\mathcal {A}}}_{:.v_1,:}\ldots {\varvec{\mathcal {A}}}_{:.v_l,:}{\varvec{\omega }}\) for \(v_1\ldots v_l \in \varSigma ^l\). Let l be any integer such that \(|\varSigma |^l \ge n\). We first show that both \(\mathbf {F}_l\) and \(\mathbf {B}_l\) are full rank with probability one. Observe that \(\det (\mathbf {F}_l^\top \mathbf {F}_l)\) is a polynomial of the 2-RNN parameters \({\varvec{\alpha }}, {\varvec{\mathcal {A}}}\). Since a polynomial is either zero or non-zero almost everywhere (Caron & Traynor, 2005), and since one can easily find a 2-RNN such that \(\det (\mathbf {F}_l^\top \mathbf {F}_l)\ne 0\) (using the fact that \(|\varSigma |^l \ge n\)), it follows that \(\det (\mathbf {F}_l^\top \mathbf {F}_l)\) is non-zero almost everywhere. Consequently, since the parameters \({\varvec{\alpha }}\) and \({\varvec{\mathcal {A}}}\) are drawn from a continuous distribution, \(\det (\mathbf {F}_l^\top \mathbf {F}_l)\ne 0\) with probability one, i.e. \(\mathbf {F}_l\) is of rank n with probability one. With a similar argument, one can show that \(\mathbf {B}_l\) is of rank n with probability one. To conclude, since \({({\varvec{\mathcal {H}}}^{(2l)})}_{\langle \!\langle l,l\rangle \!\rangle } = \mathbf {F}_l\mathbf {B}_l\), both \(\mathbf {F}_l\) and \(\mathbf {B}_l\) have rank n, and \(|\varSigma |^l \ge n\), it follows that \({({\varvec{\mathcal {H}}}^{(2l)})}_{\langle \!\langle l,l\rangle \!\rangle }\) has rank n with probability one. \(\square\)

4 Weighted automata and second-order recurrent neural networks

In this section, we present an equivalence result between weighted automata and second-order RNN with linear activation functions (linear 2-RNN). This result rigorously formalizes the idea that WFA are linear RNN. Recall that a 2-RNN \(R=({\varvec{\alpha }},{\varvec{\mathcal {A}}},{\varvec{\varOmega }})\) maps any sequence of inputs \(\mathbf {x}_1,\ldots ,\mathbf {x}_k\in \mathbb {R}^d\) to a sequence of outputs \(\mathbf {y}_1,\ldots ,\mathbf {y}_k\in \mathbb {R}^p\) defined for any \(t=1,\ldots ,k\) by

$$\begin{aligned} \mathbf {y}_t = z_2({\varvec{\varOmega }}\mathbf {h}_t) \text { with }\mathbf {h}_t = z_1({\varvec{\mathcal {A}}}\bullet _{1}\mathbf {h}_{t-1}\bullet _{2}\mathbf {x}_t) \end{aligned}$$
(4)

where \(z_1:\mathbb {R}^n\rightarrow \mathbb {R}^n\) and \(z_2:\mathbb {R}^p\rightarrow \mathbb {R}^p\) are activation functions. We think of a 2-RNN as computing a function \(f_R:(\mathbb {R}^d)^*\rightarrow \mathbb {R}^p\) mapping each input sequence \(\mathbf {x}_1,\ldots ,\mathbf {x}_k\) to the corresponding final output \(\mathbf {y}_k\). While \(z_1\) and \(z_2\) are usually non-linear component-wise functions, we consider here the case where both \(z_1\) and \(z_2\) are the identity, and we refer to the resulting model as a linear 2-RNN. Observe that for a linear 2-RNN R, the function \(f_R\) is multilinear in the sense that, for any integer l, its restriction to the domain \((\mathbb {R}^d)^l\) is multilinear. Another useful observation is that linear 2-RNN are invariant under change of basis: for any invertible matrix \({\varvec{\mathrm{P}}}\), the linear 2-RNN \({\tilde{M}}=({\varvec{\mathrm{P}}}^{-\top }\mathbf {h}_0,{\varvec{\mathcal {A}}}\times _{1}{\varvec{\mathrm{P}}}\times _{3}{\varvec{\mathrm{P}}}^{-\top },{\varvec{\mathrm{P}}}{\varvec{\varOmega }})\) is such that \(f_{{\tilde{M}}}=f_M\). One can easily show that the computation of the linear 2-RNN \(R=({\varvec{\alpha }},{\varvec{\mathcal {A}}},{\varvec{\varOmega }})\) boils down to the following tensor network (see proof of Theorem 3):

figure l

This computation is very similar, not to say equivalent, to the computation of a WFA \(A=(\alpha , \{ \mathrm{A}^{ \sigma }\}_{\sigma \in \varSigma },{\varvec{\varOmega }})\). Indeed, as we showed in the previous section, by stacking the transition matrices \(\{\mathbf {A}^\sigma \}_{\sigma \in \varSigma }\) into a third order tensor \({\varvec{\mathcal {A}}}\in \mathbb {R}^{n\times \varSigma \times n}\) the computation of the WFA A can be written as

figure m

Thus, if we restrict the input vectors of a linear 2-RNN to be one-hot encoding (i.e. vectors from the canonical basis), the two models are strictly equivalent. These observations unravel a fundamental connection between vv-WFA and linear 2-RNN: vv-WFA and linear 2-RNN are expressively equivalent for representing functions defined over sequences of discrete symbols. Moreover, both models have the same capacity in the sense that there is a direct correspondence between the hidden units of a linear 2-RNN and the states of a vv-WFA computing the same function. More formally, we have the following theorem.

Theorem 3

Any function that can be computed by a vv-WFA with n states can be computed by a linear 2-RNN with n hidden units. Conversely, any function that can be computed by a linear 2-RNN with n hidden units on sequences of one-hot vectors (i.e. canonical basis vectors) can be computed by a WFA with n states. More precisely, the WFA \(A=(\alpha , \{ \mathrm{A}^{ \sigma }\}_{\sigma \in \varSigma },{\varvec{\varOmega }})\) with n states and the linear 2-RNN \(M=({\varvec{\alpha }},{\varvec{\mathcal {A}}},{\varvec{\varOmega }})\) with n hidden units, where \({\varvec{\mathcal {A}}}\in \mathbb {R}^{n\times \varSigma \times n}\) is defined by \({\varvec{\mathcal {A}}}_{:,\sigma ,:}=\mathbf {A}^\sigma\) for all \(\sigma \in \varSigma\), are such that \(f_A(\sigma _1\sigma _2\ldots \sigma _k) = f_M(\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_k)\) for all sequences of input symbols \(\sigma _1,\ldots ,\sigma _k\in \varSigma\), where for each \(i\in [k]\) the input vector \(\mathbf {x}_i\in \mathbb {R}^\varSigma\) is the one-hot encoding of the symbol \(\sigma _i\).

Proof

We first show by induction on k that, for any sequence \(\sigma _1\ldots \sigma _k\in \varSigma ^*\), the hidden state \(\mathbf {h}_k\) computed by M (see Eq. (4)) on the corresponding one-hot encoded sequence \(\mathbf {x}_1,\ldots ,\mathbf {x}_k\in \mathbb {R}^d\) satisfies \(\mathbf {h}_k = (\mathbf {A}^{\sigma _1}\ldots \mathbf {A}^{\sigma _k})^\top {\varvec{\alpha }}\). The case \(k=0\) is immediate. Suppose the result true for sequences of length up to k. One can check easily check that \({\varvec{\mathcal {A}}}\bullet _{2}\mathbf {x}_i = \mathbf {A}^{\sigma _i}\) for any index i. Using the induction hypothesis it then follows that

$$\begin{aligned} \mathbf {h}_{k+1}&= {\varvec{\mathcal {A}}}\bullet _{1}\mathbf {h}_k \bullet _{2} \mathbf {x}_{k+1} = \mathbf {A}^{\sigma _{k+1}}\bullet _{1} \mathbf {h}_k = (\mathbf {A}^{\sigma _{k+1}})^\top \mathbf {h}_k\\&= (\mathbf {A}^{\sigma _{k+1}})^\top (\mathbf {A}^{\sigma _1}\ldots \mathbf {A}^{\sigma _k})^\top {\varvec{\alpha }}= (\mathbf {A}^{\sigma _1}\ldots \mathbf {A}^{\sigma _{k+1}})^\top {\varvec{\alpha }}. \end{aligned}$$

To conclude, we thus have

$$\begin{aligned} f_M(\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_k) = {\varvec{\varOmega }}\mathbf {h}_{k} = {\varvec{\varOmega }}(\mathbf {A}^{\sigma _1}\ldots \mathbf {A}^{\sigma _{k}})^\top {\varvec{\alpha }}= f_A(\sigma _1\sigma _2\ldots \sigma _k). \end{aligned}$$

\(\square\)

This result first implies that linear 2-RNN defined over sequences of discrete symbols (using one-hot encoding) can be provably learned using the spectral learning algorithm for WFA/vv-WFA; indeed, these algorithms have been proved to compute consistent estimators. Let us stress again that, contrary to the case of feed-forward architectures, learning recurrent networks with linear activation functions is not a trivial task. Furthermore, Theorem 3 reveals that linear 2-RNN are a natural generalization of classical weighted automata to functions defined over sequences of continuous vectors (instead of discrete symbols). This spontaneously raises the question of whether the spectral learning algorithms for WFA and vv-WFA can be extended to the general setting of linear 2-RNN; we show that the answer is in the positive in the next section.

5 Spectral learning of continuous weighted automata

In this section, we extend the learning algorithm for vv-WFA to linear 2-RNN, thus at the same time addressing the limitation of the spectral learning algorithm to discrete inputs and providing the first consistent learning algorithm for linear second-order RNN.

5.1 Recovering 2-RNN from Hankel tensors

We first present an identifiability result showing how one can recover a linear 2-RNN computing a function \(f:(\mathbb {R}^d)^*\rightarrow \mathbb {R}^p\) from observable tensors extracted from some Hankel tensor associated with f. Intuitively, we obtain this result by reducing the problem to the one of learning a vv-WFA. This is done by considering the restriction of f to canonical basis vectors; loosely speaking, since the domain of this restricted function is isomorphic to \([d]^*\), this allows us to fall back onto the setting of sequences of discrete symbols. It is not straightforward how the notion of Hankel matrix can be extended to a function \(f:(\mathbb {R}^d)^*\rightarrow \mathbb {R}^p\) taking sequences of continuous vectors as input. One natural way to proceed is to consider how f acts on sequences of vectors from the canonical basis. Given a function \(f:(\mathbb {R}^d)^*\rightarrow \mathbb {R}^p\), we define its Hankel tensor \({\varvec{\mathcal {H}}}_f\in \mathbb {R}^{[d]^* \times [d]^* \times p}\) by

$$\begin{aligned} ({\varvec{\mathcal {H}}}_f)_{i_1\ldots i_s, j_1\ldots j_t,:} = f(\mathrm {e}_{i_1},\ldots ,\mathrm {e}_{i_s},\mathrm {e}_{j_1},\ldots ,\mathrm {e}_{j_t}), \end{aligned}$$

for all \(i_1,\ldots ,i_s,j_1,\ldots ,j_t\in [d]\), which is infinite in two of its modes. It is easy to see that \({\varvec{\mathcal {H}}}_f\) is also the Hankel tensor associated with the function \({\tilde{f}}:[d]^* \rightarrow \mathbb {R}^p\) mapping any sequence \(i_1i_2\ldots i_k\in [d]^*\) to \(f(\mathrm {e}_{i_1},\ldots ,\mathrm {e}_{i_k})\). Moreover, in the special case where f can be computed by a linear 2-RNN, one can use the multilinearity of f to show that

$$\begin{aligned} f(\mathbf {x}_1,\ldots ,\mathbf {x}_k) = \sum _{i_1,\ldots ,i_k = 1}^d (\mathbf {x}_1)_{i_1}\ldots (\mathbf {x}_l)_{i_k} {\tilde{f}}(i_1\ldots i_k). \end{aligned}$$

This gives us some intuition on how one could learn f by learning a vv-WFA computing \({\tilde{f}}\) using the spectral learning algorithm. That is, assuming access to the sub-blocks of the Hankel tensor \({\varvec{\mathcal {H}}}\) for a complete basis of prefixes and suffixes \(P,S\subseteq [d]^*\), the spectral learning algorithm can be used to recover a vv-WFA computing \({\tilde{f}}\) and consequently a linear 2-RNN computing f using Theorem 3. We now state the main result of this section, showing that a (minimal) linear 2-RNN computing a function \(f:(\mathbb {R}^d)^*\rightarrow \mathbb {R}\) can be exactly recovered from sub-blocks of the Hankel tensor \({\varvec{\mathcal {H}}}_f\). For the sake of clarity, we present the learning algorithm for the particular case where there exists an L such that the prefix and suffix sets consisting of all sequences of length L, that is \(P= S= [d]^L\), forms a complete basis for \({\tilde{f}}\) (i.e. the sub-block \({\varvec{\mathcal {H}}}_{P,S}\in \mathbb {R}^{[d]^L\times [d]^L\times p}\) of the Hankel tensor \({\varvec{\mathcal {H}}}_f\) is such that \({{\,\mathrm{rank}\,}}(({\varvec{\mathcal {H}}}_{P,S})_{(1)}) = {{\,\mathrm{rank}\,}}(({\varvec{\mathcal {H}}}_f)_{(1)})\)). As discussed in Sect. 3.2, such an integer L does not always exist even when the underlying function f can be computed by a linear 2-RNN. However, the workaround described at the end of Sect. 3.2 can be used here as well to extend this theorem to the case of any function f that can be computed by a linear 2-RNN. The following theorem can be seen as a reformulation of the classical spectral learning theorem using the low rank Hankel tensors \(\varvec{\mathcal {H}}^{(l)}\) introduced in Sect. 3.1. In the case of a continuous function \(f:(\mathbb {R}^d)^*\rightarrow \mathbb {R}\), for any integer l, the finite tensor \({\varvec{\mathcal {H}}}^{(l)}_f\in \mathbb {R}^{ d\times \cdots \times d\times p}\) of order \(l+1\) is defined by

$$\begin{aligned} ({\varvec{\mathcal {H}}}^{(l)}_f)_{i_1,\ldots ,i_l,:} = f(\mathrm {e}_{i_1},\ldots ,\mathrm {e}_{i_l}) \ \ \ \text {for all } i_1,\ldots ,i_l\in [d]. \end{aligned}$$

Observe that for any integer l, the tensor \({\varvec{\mathcal {H}}}^{(l)}_f\) can be obtained by reshaping a finite sub-block of the Hankel tensor \({\varvec{\mathcal {H}}}_f\).

Theorem 4

Let \(f:(\mathbb {R}^d)^*\rightarrow \mathbb {R}^p\) be a function computed by a minimal linear 2-RNN with n hidden units and let L be an integer such that

$$\begin{aligned} {{\,\mathrm{rank}\,}}({({\varvec{\mathcal {H}}}^{(2L)}_f)}_{\langle \!\langle L,L+1\rangle \!\rangle }) = n. \end{aligned}$$

Then, for any \({\varvec{\mathrm{P}}}\in \mathbb {R}^{d^L\times n}\) and \({\varvec{\mathrm{S}}}\in \mathbb {R}^{n\times d^Lp}\) such that

$$\begin{aligned} {({\varvec{\mathcal {H}}}^{(2L)}_f)}_{\langle \!\langle L,L+1\rangle \!\rangle } = {\varvec{\mathrm{P}}}{\varvec{\mathrm{S}}}, \end{aligned}$$

the linear 2-RNN \(R=({\varvec{\alpha }},{\varvec{\mathcal {A}}},{\varvec{\varOmega }})\) defined by

$$\begin{aligned} {\varvec{\alpha }}= ({\varvec{\mathrm{S}}}^\dagger )^\top {({\varvec{\mathcal {H}}}^{(L)}_f)}_{\langle \!\langle L+1\rangle \!\rangle }, \ \ \ \ {\varvec{\varOmega }}^\top = {\varvec{\mathrm{P}}}^\dagger {({\varvec{\mathcal {H}}}^{(L)}_f)}_{\langle \!\langle L,1\rangle \!\rangle } \\ {\varvec{\mathcal {A}}}= ({({\varvec{\mathcal {H}}}^{(2L+1)}_f)}_{\langle \!\langle L,1,L+1\rangle \!\rangle })\times _{1}{\varvec{\mathrm{P}}}^\dagger \times _{3}({\varvec{\mathrm{S}}}^\dagger )^\top \end{aligned}$$

is a minimal linear 2-RNN computing f.

Proof

Let \({\varvec{\mathrm{P}}}\in \mathbb {R}^{d^L\times n}\) and \({\varvec{\mathrm{S}}}\in \mathbb {R}^{n\times d^Lp}\) be such that \({({\varvec{\mathcal {H}}}^{(2L)}_f)}_{\langle \!\langle L,L+1\rangle \!\rangle } = {\varvec{\mathrm{P}}}{\varvec{\mathrm{S}}}\) and let \(R^\star =({\varvec{\alpha }},{\varvec{\mathcal {A}}},{\varvec{\varOmega }})\) be a minimal linear 2-RNN computing f. Define the tensors

$$\begin{aligned} {\varvec{\mathcal {P}}}^{\star } = \llbracket {\varvec{\mathcal {A}}}^\star \bullet _{1}{\varvec{\alpha }}^\star , \underbrace{{\varvec{\mathcal {A}}}^\star , \ldots , {\varvec{\mathcal {A}}}^\star }_{L-1\text { times}}, \mathbf {I}_n \rrbracket \in \mathbb {R}^{d\times \cdots \times d\times n} \end{aligned}$$

and

$$\begin{aligned} {\varvec{\mathcal {S}}}^\star = \llbracket \mathbf {I}_n,\underbrace{{\varvec{\mathcal {A}}}^\star , \ldots , {\varvec{\mathcal {A}}}^\star }_{L\text { times}}, {\varvec{\varOmega }}^\star \rrbracket \in \mathbb {R}^{n\times d\times \cdots \times d\times p} \end{aligned}$$

of order \(L+1\) and \(L+2\) respectively, and let \({\varvec{\mathrm{P}}}^\star = {({\varvec{\mathcal {P}}}^\star )}_{\langle \!\langle l,1\rangle \!\rangle } \in \mathbb {R}^{d^L\times n}\) and \({\varvec{\mathrm{S}}}^\star = {({\varvec{\mathcal {S}}}^\star )}_{\langle \!\langle 1,L+1\rangle \!\rangle } \in \mathbb {R}^{n\times d^Lp}\). Using the identity \({\varvec{\mathcal {H}}}^{(l)}_f = \llbracket {\varvec{\mathcal {A}}}\bullet _{1}{\varvec{\alpha }}, \underbrace{{\varvec{\mathcal {A}}}, \ldots , {\varvec{\mathcal {A}}}}_{l-1\text { times}}, {\varvec{\varOmega }}^\top \rrbracket\) for any l, one can easily check the following identities (see also Sect. 3.1):

$$\begin{aligned} {({\varvec{\mathcal {H}}}^{(2L)}_f)}_{\langle \!\langle L,L+1\rangle \!\rangle } = {\varvec{\mathrm{P}}}^\star {\varvec{\mathrm{S}}}^\star ,\ \ \ \ {({\varvec{\mathcal {H}}}^{(2L+1)}_f)}_{\langle \!\langle L,1,L+1\rangle \!\rangle }= {\varvec{\mathcal {A}}}^\star \times _{1} {\varvec{\mathrm{P}}}^\star \times _{3} ({\varvec{\mathrm{S}}}^\star )^\top ,\\ {({\varvec{\mathcal {H}}}^{(L)}_f)}_{\langle \!\langle L,1\rangle \!\rangle } = {\varvec{\mathrm{P}}}^\star ({\varvec{\varOmega }}^\star )^\top , \ \ \ \ \ \ \ {({\varvec{\mathcal {H}}}^{(L)}_f)}_{\langle \!\langle L+1\rangle \!\rangle } = ({\varvec{\mathrm{S}}}^\star )^\top {\varvec{\alpha }}. \end{aligned}$$

Let \(\mathbf {M}= {\varvec{\mathrm{P}}}^\dagger {\varvec{\mathrm{P}}}^\star\). We will show that \({\varvec{\alpha }}= \mathbf {M}^{-\top }{\varvec{\alpha }}^\star\), \({\varvec{\mathcal {A}}}= {\varvec{\mathcal {A}}}^\star \times _{1}\mathbf {M}\times _{3}\mathbf {M}^{-\top }\) and \({\varvec{\varOmega }}= \mathbf {M}{\varvec{\varOmega }}^\star\), which will entail the results since linear 2-RNN are invariant under change of basis. First observe that \(\mathbf {M}^{-1}= {\varvec{\mathrm{S}}}^\star {\varvec{\mathrm{S}}}^\dagger\). Indeed, we have \({\varvec{\mathrm{P}}}^\dagger {\varvec{\mathrm{P}}}^\star {\varvec{\mathrm{S}}}^\star {\varvec{\mathrm{S}}}^\dagger = {\varvec{\mathrm{P}}}^\dagger {({\varvec{\mathcal {H}}}^{(2L)}_f)}_{\langle \!\langle L,L+1\rangle \!\rangle }{\varvec{\mathrm{S}}}^\dagger = {\varvec{\mathrm{P}}}^\dagger {\varvec{\mathrm{P}}}{\varvec{\mathrm{S}}}{\varvec{\mathrm{S}}}^\dagger = \mathbf {I}\) where we used the fact that \({\varvec{\mathrm{P}}}\) (resp. \({\varvec{\mathrm{S}}}\)) is of full column rank (resp. row rank) for the last equality. The following derivations then follow from basic tensor algebra:

$$\begin{aligned} {\varvec{\alpha }}&= ({\varvec{\mathrm{S}}}^\dagger )^\top {({\varvec{\mathcal {H}}}^{(L)}_f)}_{\langle \!\langle L+1\rangle \!\rangle } = ({\varvec{\mathrm{S}}}^\dagger )^\top ({\varvec{\mathrm{S}}}^\star )^\top {\varvec{\alpha }}= ({\varvec{\mathrm{S}}}^\star {\varvec{\mathrm{S}}}^\dagger )^\top = \mathbf {M}^{-\top }{\varvec{\alpha }}^\star ,\\ \ \\ {\varvec{\mathcal {A}}}&= ({({\varvec{\mathcal {H}}}^{(2L+1)}_f)}_{\langle \!\langle L,1,L+1\rangle \!\rangle })\times _{1}{\varvec{\mathrm{P}}}^\dagger \times _{3}({\varvec{\mathrm{S}}}^\dagger )^\top \\&= ({\varvec{\mathcal {A}}}^\star \times _{1} {\varvec{\mathrm{P}}}^\star \times _{3} ({\varvec{\mathrm{S}}}^\star )^\top )\times _{1}{\varvec{\mathrm{P}}}^\dagger \times _{3}({\varvec{\mathrm{S}}}^\dagger )^\top \\&= {\varvec{\mathcal {A}}}^\star \times _{1} {\varvec{\mathrm{P}}}^\dagger {\varvec{\mathrm{P}}}^\star \times _{3} ({\varvec{\mathrm{S}}}^\star {\varvec{\mathrm{S}}}^\dagger )^\top = {\varvec{\mathcal {A}}}^\star \times _{1}\mathbf {M}\times _{3}\mathbf {M}^{-\top },\\ \ \\ {\varvec{\varOmega }}^\top&= {\varvec{\mathrm{P}}}^\dagger {({\varvec{\mathcal {H}}}^{(L)}_f)}_{\langle \!\langle L,1\rangle \!\rangle } = {\varvec{\mathrm{P}}}^\dagger {\varvec{\mathrm{P}}}^\star ({\varvec{\varOmega }}^\star )^\top = \mathbf {M}{\varvec{\varOmega }}^\star , \end{aligned}$$

which concludes the proof. \(\square\)

Observe that such an integer L exists under the assumption that \(\mathcal {P}= \mathcal {S}= [d]^L\) forms a complete basis for \({\tilde{f}}\). It is also worth mentioning that a necessary condition for \({{\,\mathrm{rank}\,}}({({\varvec{\mathcal {H}}}^{(2L)}_f)}_{\langle \!\langle L,L+1\rangle \!\rangle }) = n\) is that \(d^L\ge n\), i.e. L must be of the order \(\log _d(n)\).

5.2 Hankel tensors recovery from linear measurements

We showed in the previous section that, given the Hankel tensors \({\varvec{\mathcal {H}}}^{(L)}_f\), \({\varvec{\mathcal {H}}}^{(2L)}_f\) and \({\varvec{\mathcal {H}}}^{(2L+1)}_f\), one can recover a linear 2-RNN computing f if it exists. This first implies that the class of functions that can be computed by linear 2-RNN is learnable in Angluin’s exact learning model (Angluin, 1988) where one has access to an oracle that can answer membership queries (e.g. what is the value computed by the target f on \((\mathbf {x}_1,\ldots ,\mathbf {x}_k)\)?) and equivalence queries (e.g. is the current hypothesis h equal to the target f?). While this fundamental result is of significant theoretical interest, assuming access to such an oracle is unrealistic. In this section, we show that a stronger learnability result can be obtained in a more realistic setting, where we only assume access to randomly generated input/output examples \(((\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_l^{(i)}),\mathbf {y}^{(i)})\in (\mathbb {R}^d)^*\times \mathbb {R}^p\) where \(\mathbf {y}^{(i)} = f(\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_l^{(i)})\). The key observation is that such an example \(((\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_l^{(i)}),\mathbf {y}^{(i)})\) can be seen as a linear measurement of the Hankel tensor \({\varvec{\mathcal {H}}}^{(l)}\). Indeed, let f be a function computed by a linear 2-RNN. Using the multilinearity of f we have

$$\begin{aligned} f(\mathbf {x}_1,\mathbf {x}_2,\ldots ,\mathbf {x}_l)&= f\left( \sum _{i_1} (\mathbf {x}_1)_{i_1}\mathrm {e}_{i_1}, \sum _{i_2} (\mathbf {x}_2)_{i_2}\mathrm {e}_{i_2},\ldots ,\sum _{i_l} (\mathbf {x}_l)_{i_l}\mathrm {e}_{i_l}\right) \\&= \sum _{i_1,\ldots ,i_l} (\mathbf {x}_1)_{i_1}\ldots (\mathbf {x}_l)_{i_l} f( \mathrm {e}_{i_1}, \ldots ,\mathrm {e}_{i_l}) \\&= \sum _{i_1,\ldots ,i_l} (\mathbf {x}_1)_{i_1}\ldots (\mathbf {x}_l)_{i_l} \varvec{\mathcal {H}}^{(l)}_{i_1,\ldots ,i_l} \\&= {\varvec{\mathcal {H}}}^{(l)}_f \bullet _{1} \mathbf {x}_1 \bullet _{2} \ldots \bullet _{l} \mathbf {x}_l \\&= {({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle l,1\rangle \!\rangle }^\top (\mathbf {x}_1\otimes \ldots \otimes \mathbf {x}_{l}) \end{aligned}$$

where \((\mathrm {e}_1,\ldots ,\mathrm {e}_l)\) denotes the canonical basis of \(\mathbb {R}^l\). It follows that each input/output example \(((\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_l^{(i)}),\mathbf {y}^{(i)})\) constitutes a linear measurement of \(\varvec{\mathcal {H}}^{(l)}\):

$$\begin{aligned} \mathbf {y}^{(i)} = {({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle l,1\rangle \!\rangle }^\top (\mathbf {x}^{(i)}_1\otimes \ldots \otimes \mathbf {x}_{l}^{(i)}) = {({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle l,1\rangle \!\rangle }^\top \mathbf {x}^{(i)} \end{aligned}$$

where \(\mathbf {x}^{(i)} {:=} \mathbf {x}^{(i)}_1\otimes \cdots \otimes \mathbf {x}_{l}^{(i)}\in \mathbb {R}^{d^l}\). Hence, by regrouping N output examples \(\mathbf {y}^{(i)}\) into the matrix \(\mathrm {Y}\in \mathbb {R}^{N\times p}\) and the corresponding input vectors \(\mathbf {x}^{(i)}\) into the matrix \(\mathbf {X}\in \mathbb {R}^{N\times d^l}\), one can recover \({\varvec{\mathcal {H}}}^{(l)}\) by solving the linear system \(\mathrm {Y}= \mathbf {X}{({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle l,1\rangle \!\rangle }\), which has a unique solution whenever \(\mathbf {X}\) is of full column rank. This simple estimation technique for the Hankel tensors allows us to design the first consistent learning algorithm for linear 2-RNN, which is summarized in Algorithm 1 (with the "Least-Squares" recovery method). More efficient recovery methods for the Hankel tensors will be discussed in the next section. The following theorem shows that this learning algorithm is consistent. Its proof relies on the fact that \(\mathbf {X}\) will be of full column rank whenever \(N\ge d^l\) and the components of each \(\mathbf {x}^{(i)}_j\) for \(j\in [l],i\in [N]\) are drawn independently from a continuous distribution over \(\mathbb {R}^{d}\) (w.r.t. the Lebesgue measure).

Theorem 5

Let \((\mathbf {h}_0,{\varvec{\mathcal {A}}},{\varvec{\varOmega }})\) be a minimal linear 2-RNN with n hidden units computing a function \(f:(\mathbb {R}^d)^*\rightarrow \mathbb {R}^p\), and let L be an integersuch that \({{\,\mathrm{rank}\,}}({({\varvec{\mathcal {H}}}^{(2L)}_f)}_{\langle \!\langle L,L+1\rangle \!\rangle }) = n\). Suppose we have access to 3 datasets

$$\begin{aligned} D_l = \{((\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_l^{(i)}),\mathbf {y}^{(i)}) \}_{i=1}^{N_l}\subset (\mathbb {R}^d)^l\times \mathbb {R}^p \text { for } l\in \{L,2L,2L+1\} \end{aligned}$$

where the entries of each \(\mathbf {x}^{(i)}_j\) are drawn independently from the standard normal distribution and where each \(\mathbf {y}^{(i)} = f(\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_l^{(i)})\). Then, if \(N_l \ge d^l\) for \(l =L,\ 2L,\ 2L+1\), the linear 2-RNN M returned by Algorithm 1 with the least-squares method satisfies \(f_M = f\) with probability one.

Proof

We just need to show for each \(l\in \{L,2L,2L+1\}\) that, under the hypothesis of the Theorem, the Hankel tensors \({\hat{{\varvec{\mathcal {H}}}}}^{(l)}\) computed in line 4 of Algorithm 1 are equal to the true Hankel tensors \({\varvec{\mathcal {H}}}^{(l)}\) with probability one. Recall that these tensors are computed by solving the least-squares problem

$$\begin{aligned} {\hat{{\varvec{\mathcal {H}}}}}^{(l)} = \mathop {\mathrm {arg \, min}}\limits _{T\in \mathbb {R}^{d\times \cdots \times d\times p}} \Vert \mathbf {X}{({\varvec{\mathcal {T}}})}_{\langle \!\langle l,1\rangle \!\rangle } - \mathrm {Y}\Vert _F^2 \end{aligned}$$

where \(\mathbf {X}\in \mathbb {R}^{N_l\times d_l}\) is the matrix with rows \(\mathbf {x}^{(i)}_1\otimes \mathbf {x}_2^{(i)}\otimes \cdots \otimes \mathbf {x}_l^{(i)}\) for each \(i\in [N_l]\). Since \(\mathbf {X}{({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle l,1\rangle \!\rangle } = \mathrm {Y}\) and the solution of the least-squares problem is unique as soon as \(\mathbf {X}\) is of full column rank, we just need to show that this is the case with probability one when the entries of the vectors \(\mathbf {x}^{(i)}_j\) are drawn at random from a standard normal distribution. The result will then directly follow by applying Theorem 4. We will show that the set

$$\begin{aligned} \mathcal {S}= \{ (\mathbf {x}_1^{(i)},\ldots , \mathbf {x}_l^{(i)}) | \ i\in [N_l],\ dim(span(\{ \mathbf {x}^{(i)}_1\otimes \mathbf {x}_2^{(i)}\otimes \cdots \otimes \mathbf {x}_l^{(i)} \})) < d^l\} \end{aligned}$$

has Lebesgue measure 0 in \(((\mathbb {R}^d)^{l})^{N_l}\simeq \mathbb {R}^{dlN_l}\) as soon as \(N_l \ge d^l\), which will imply that it has probability 0 under any continuous probability, hence the result. For any \(S=\{(\mathbf {x}_1^{(i)},\ldots , \mathbf {x}_l^{(i)})\}_{i=1}^{N_l}\), we denote by \(\mathbf {X}_S\in \mathbb {R}^{N_l\times d^l}\) the matrix with rows \(\mathbf {x}^{(i)}_1\otimes \mathbf {x}_2^{(i)}\otimes \cdots \otimes \mathbf {x}_l^{(i)}\). One can easily check that \(S\in \mathcal {S}\) if and only if \(\mathbf {X}_S\) is of rank strictly less than \(d^l\), which is equivalent to the determinant of \(\mathbf {X}_S^\top \mathbf {X}_S\) being equal to 0. Since this determinant is a polynomial in the entries of the vectors \(\mathbf {x}_j^{(i)}\), \(\mathcal {S}\) is an algebraic subvariety of \(\mathbb {R}^{dlN_l}\). It is then easy to check that the polynomial \(det(\mathbf {X}_S^\top \mathbf {X}_S)\) is not uniformly 0 when \(N_l \ge d^l\). Indeed, it suffices to choose the vectors \(\mathbf {x}_j^{(i)}\) such that the family \((\mathbf {x}^{(i)}_1\otimes \mathbf {x}_2^{(i)}\otimes \cdots \otimes \mathbf {x}_l^{(i)})_{n=1}^{N_l}\) spans the whole space \(\mathbb {R}^{d^l}\) (which is possible since we can choose arbitrarily any of the \(N_l\ge d^l\) elements of this family), hence the result. In conclusion, \(\mathcal {S}\) is a proper algebraic subvariety of \(\mathbb {R}^{dlN_l}\) and hence has Lebesgue measure zero (Federer 2014, Section 2.6.5). \(\square\)

figure n

A few remarks on this theorem are in order. The first observation is that the 3 datasets \(D_L\), \(D_{2L}\) and \(D_{2L+1}\) do not need to be drawn independently from one another (e.g. the sequences in \(D_{L}\) can be prefixes of the sequences in \(D_{2L}\) but it is not necessary). In particular, the result still holds when the datasets \(D_L\), \(D_{2L}\) and \(D_{2L+1}\) are constructed from a unique dataset

$$\begin{aligned} S =\{((\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_T^{(i)}),(\mathbf {y}^{(i)}_1,\mathbf {y}^{(i)}_2,\ldots ,\mathbf {y}^{(i)}_T)) \}_{i=1}^{N} \end{aligned}$$

of input/output sequences with \(T\ge 2L+1\), where \(\mathbf {y}^{(i)}_t = f(\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_t^{(i)})\) for any \(t\in [T]\). Observe that having access to such input/output training sequences is not an unrealistic assumption: for example when training RNN for language modeling the output \(\mathbf {y}_t\) is the conditional probability vector of the next symbol. Lastly, when the outputs \(\mathbf {y}^{(i)}\) are noisy, one can solve the least-squares problem \(\Vert \mathrm {Y}- \mathbf {X}{({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle l,1\rangle \!\rangle }\Vert ^2_F\) to approximate the Hankel tensors; we will empirically evaluate this approach in Sect. 6 and we defer its theoretical analysis in the noisy setting to future work.

5.3 Leveraging the low rank structure of the Hankel tensors

While the least-squares method is sufficient to obtain the theoretical guarantees of Theorem 5, it does not leverage the low rank structure of the Hankel tensors \({\varvec{\mathcal {H}}}^{(L)}\), \({\varvec{\mathcal {H}}}^{(2L)}\) and \({\varvec{\mathcal {H}}}^{(2L+1)}\). We now propose several alternative recovery methods to leverage this structure, in order to improve both sample complexity and time complexity. The sample efficiency and running time of these methods will be assessed in a simulation study in Sect. 6 (deriving improved sample complexity guarantees using these methods is left for future work). We first propose two alternatives to solving the least-squares problem \(\mathrm {Y}= \mathbf {X}{({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle l,1\rangle \!\rangle }\) that leverage the low matrix rank structure of the Hankel tensor. Indeed, knowing that \({({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle {\left\lceil l/2\right\rceil },l-{\left\lceil l/2\right\rceil } + 1\rangle \!\rangle }\) can be approximately low rank (if the target function is computed by a WFA with a small number of states), one can achieve better sample complexity by taking into account the fact that the effective number of parameters needed to describe this matrix can be significantly lower than its number of entries. The first approach is to reformulate the least-squares problem as a nuclear norm minimization problem (see line 6 of Algorithm 1). The nuclear norm is the tightest convex relaxation of the matrix rank and the resulting optimization problem can be solved using standard convex optimization toolbox (Candes & Plan, 2011; Recht et al., 2010). A second approach is a non-convex optimization algorithm: iterative hard thresholding (IHT) (Jain et al., 2010) (see lines 7–12 of Algorithm 1). This optimization method is iterative and boils down to a projected gradient descent algorithm: at each iteration, the Hankel tensor is updated by taking a step in the direction opposite to the gradient of the least-squares objective, before being projected onto the manifold of low rank matrices using truncated SVD. More precisely, first the following gradient update is performed:

$$\begin{aligned} {({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle l,1\rangle \!\rangle }&\leftarrow {({\varvec{\mathcal {H}}}^{(l)} )}_{\langle \!\langle l,1\rangle \!\rangle } - \gamma \nabla _{{({\varvec{\mathcal {H}}}^{(l)} )}_{\langle \!\langle l,1\rangle \!\rangle }} \Vert \mathbf {X}{(\varvec{\mathcal {H}}^{(l)})}_{\langle \!\langle l,1\rangle \!\rangle } - \mathrm {Y}\Vert _F^2\\&= {({\varvec{\mathcal {H}}}^{(l)} )}_{\langle \!\langle l,1\rangle \!\rangle } + \gamma \mathbf {X}^\top (\mathrm {Y}- \mathbf {X}{({\varvec{\mathcal {H}}}^{(l)} )}_{\langle \!\langle l,1\rangle \!\rangle }) \end{aligned}$$

where \(\gamma\) is the learning rate. Then, a truncated SVD of the matricization \({({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle {\left\lceil l/2\right\rceil },l-{\left\lceil l/2\right\rceil } + 1\rangle \!\rangle }\) is performed to obtain a low rank approximation of the Hankel tensor. Both the nuclear norm minimization and the iterative hard thresholding algorithm only leverages the fact that the matrix rank of \({({\varvec{\mathcal {H}}}^{(l)})}_{\langle \!\langle {\left\lceil l/2\right\rceil },l-{\left\lceil l/2\right\rceil } + 1\rangle \!\rangle }\) is small. However, as we have shown in Sect. 3.1, the Hankel tensor \({\varvec{\mathcal {H}}}^{(l)}\) exhibits a stronger structure: it is of low tensor train rank (which implies that any of its matricization is a low rank matrix). We now present three methods leveraging this structure for the recovery of the Hankel tensors from linear measurements. The first optimization algorithm is tensor iterative hard thresholding (TIHT) (Rauhut et al., 2017) which is the tensor generalization of IHT. Similarly to IHT, TIHT is a projected gradient descent algorithm where the projection step consists in projecting the Hankel tensor onto the manifold of tensors with low tensor train rank (instead of projecting onto the set of low rank matrices): after the gradient update described above, a low rank tensor train approximation of the Hankel tensor \({\varvec{\mathcal {H}}}^{(l)}\) is computed using the TT-SVD algorithm (Oseledets 2011). Even though TIHT leverages the tensor train structure of the Hankel tensors to obtain better sample complexity, its computational complexity remains high since the Hankel tensor \({\varvec{\mathcal {H}}}^{(l)}\) needs to alternatively be converted between its dense form (for the gradient descent step) and its tensor train decomposition (for the projection steps). Observe here that the size of these two objects significantly differs: the full Hankel tensor \({\varvec{\mathcal {H}}}^{(l)}\) has size \(d^lp\) whereas the number of parameters of its tensor train decomposition is only in \({\mathcal {O}}\left( ldR^2+pR\right)\), where R is the rank of the tensor train decomposition. Similarly to the efficient learning algorithm in the tensor train format presented in Sect. 3.2, the recovery of the Hankel tensors can be carried out in the tensor train format without never having to explicitly construct the tensor \({\varvec{\mathcal {H}}}^{(l)}\). We conclude by presenting two optimization methods to recover the Hankel tensors from data directly in the tensor train format. For both methods, the Hankel tensor \({\varvec{\mathcal {H}}}^{(l)}\) is never explicitly constructed but parameterized by the core tensors \(\varvec{{ g}}_1,\ldots ,\varvec{{ g}}_{l+1}\) of its TT decomposition:

$$\begin{aligned} {\varvec{\mathcal {H}}}^{(l)} = \llbracket \varvec{{g}}_1,\ldots ,\varvec{{g}}_{l+1} \rrbracket . \end{aligned}$$

Both methods are iterative and will optimize the least-squares objective with respect to each of the core tensors in turn until convergence. The first one is the alternating least-squares algorithm (ALS), which is one of the workhorse of tensor decomposition algorithms (Kolda and Bader 2009). In ALS, at each iteration a least-squares problem is solved in turn for each one of the cores of the TT decomposition:

$$\begin{aligned} \varvec{{g}}_i\leftarrow \mathop {\mathrm {arg\, min}}\limits _{\varvec{{g}}_i} \Vert \mathbf {X}{(\llbracket \varvec{{g}}_1,\ldots ,\varvec{{g}}_{l+1} \rrbracket )}_{\langle \!\langle l,1\rangle \!\rangle } - \mathrm {Y}\Vert _F^2\ \ \text {for }i=1,\cdots ,l+1. \end{aligned}$$

The second one consists in simply using gradient descent to perform a gradient step with respect to each one of the core tensors at each iteration:

$$\begin{aligned} \varvec{{g}}_i\leftarrow {\varvec{{g}}_i} -\gamma \nabla _{\varvec{{g}}_i}\Vert \mathbf {X}{(\llbracket \varvec{{g}}_1,\ldots ,\varvec{{g}}_{l+1} \rrbracket )}_{\langle \!\langle l,1\rangle \!\rangle } - \mathrm {Y}\Vert _F^2\ \ \text {for }i=1,\ldots ,l+1 \end{aligned}$$

where \(\gamma\) is the learning rate. Both methods are described in lines 15–18 of Algorithm 1. Combining these two optimization methods with the spectral learning algorithm in the tensor train format described in Sect. 3.2 results in an efficient learning algorithm to estimate a linear 2-RNN from training data, where the Hankel tensors are never explicitly constructed but always manipulated in the tensor train format. To conclude this section, we briefly mention that the ALS and gradient descent algorithms can straightforwardly be adapted to perform optimization with respect to mini-batches instead of the whole training dataset. This allows us to further scale the algorithm to large training sets.

6 Experiments

In this section, we perform experiments on two toy examples to compare how the choice of the recovery method (LeastSquares, NuclearNorm, IHT, TIHT, ALS and Gradient Descent) affects the sample efficiency of Algorithm 1, and the corresponding computation time. We also report the performance obtained by refining the solutions returned by our algorithm (with both TIHT and ALS recovery methods) using stochastic gradient descent (TIHT+SGD, ALS+SGD). In addition, we perform experiments on a real world dataset of wind speed data from TUDelft, which is used in Lin et al. (2016). For the real world data, we include the original results for competitive approaches from Lin et al. (2016).

6.1 Synthetic data

We perform experiments on two toy problems: recovering a random 2-RNN from data and a simple addition task. For the random 2-RNN problem, we randomly generate a linear 2-RNN with 5 units computing a function \(f:\mathbb {R}^3\rightarrow \mathbb {R}^2\) by drawing the entries of all parameters \((\mathbf {h}_0,{\varvec{\mathcal {A}}},{\varvec{\varOmega }})\) independently from a normal distribution \(\mathcal {N}(0,0.2)\). The training data consists of 3 independently drawn sets \(D_l = \{((\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_l^{(i)}),\mathbf {y}^{(i)}) \}_{i=1}^{N_l}\subset (\mathbb {R}^d)^l\times \mathbb {R}^p\) for \(l\in \{L,2L,2L+1\}\) with \(L=2\), where each \(\mathbf {x}^{(i)}_j\sim \mathcal {N}(\mathbf {0},\mathbf {I})\) and where the outputs can be noisy, i.e. \(\mathbf {y}^{(i)} = f(\mathbf {x}^{(i)}_1,\mathbf {x}_2^{(i)},\ldots ,\mathbf {x}_l^{(i)}) + {\varvec{\xi }}^{(i)}\) where \({\varvec{\xi }}^{(i)}\sim \mathcal {N}(0,\sigma ^2)\) for some noise variance parameter \(\sigma ^2\). For the addition problem, the goal is to learn a simple arithmetic function computing the sum of the running differences between the two components of a sequence of 2-dimensional vectors, i.e. \(f(\mathbf {x}_1,\ldots ,\mathbf {x}_k) = \sum _{i=1}^k \mathbf {v}^\top \mathbf {x}_i\) where \(\mathbf {v}^\top = (-1\ \ 1)\). The 3 training datasets are generated using the same process as above and a constant entry equal to one is added to all the input vectors to encode a bias term (one can check that the resulting function can be computed by a linear 2-RNN with 2 hidden units). We run the experiments for different sizes of training data ranging from \(N=20\) to \(N=5000\) (we set \(N_L=N_{2L}=N_{2L+1}=N\)) and we compare the different methods in terms of mean squared error (MSE) on a test set of 1, 000 sequences of length 6 generated in the same way as the training data (note that the training data only contains sequences of length up to 5). The IHT/TIHT methods sometimes returned aberrant models (due to numerical instabilities), we used the following scheme to circumvent this issue: when the training MSE of the hypothesis was greater than the one of the zero function, the zero function was returned instead (we applied this scheme to all other methods in the experiments). For the gradient descent approach, we use the autograd method from Pytorch with the Adam (Kingma & Ba, 2015) optimizer with learning rate 0.001.

6.1.1 Results

The results are reported in Figs. 4 and 5 where we see that all recovery methods lead to consistent estimates of the target function given enough training data. This is the case even in the presence of noise (in which case more samples are needed to achieve the same accuracy, as expected). We can also see that TIHT and ALS tend to be overall more sample efficient than the other methods (especially with noisy data), showing that taking the low rank structure of the Hankel tensors into account is profitable. Moreover, TIHT tends to perform better than its matrix counterpart, confirming our intuition that leveraging the tensor train structure is beneficial. We also found that using gradient descent to refine the learned 2-RNN model often leads to a performance boost. In Figs. 6 and 7 we show the advantage MSE obtained by fine tuning the learned 2-RNN using gradient descent. We use Pytorch to implement the fine-tuning process with the Adam optimizer with a learning rate of 0.0001. Fine tuning helps the model to converge to the optimal solution with less data, resulting in a more sample efficient approach. Lastly, we briefly mention than on these two tasks, previous experiments showed that both non-linear and linear recurrent neural network architectures trained with the back-propagation algorithm performed significantly worse than the spectral learning based learning algorithm we propose (see Rabusseau et al., 2019).

Fig. 4
figure 4

Average MSE as a function of the training set size for learning a random linear 2-RNN with different values of output noise

Fig. 5
figure 5

Average MSE as a function of the training set size for learning a simple arithmetic function with different values of output noise

Fig. 6
figure 6

Performance comparison between vanilla methods and fine-tuned methods on Random 2-RNN problem

Fig. 7
figure 7

Performance comparison between vanilla methods and fine-tuned methods on Addition problem

6.1.2 Running time analysis

By directly recovering the Hankel tensor in its tensor train form, ALS and SGD significantly reduces the computation time needed to recover the Hankel tensor. In Fig. 8a, we report the computation time of the different Hankel recovery methods for Hankel tensors with various length (L). The experiment is performed with 1000 examples for the addition problem and all iterative methods (excluding OLS) are stopped when reaching the same fixed training accuracy. In the figure, there is a clear reduction in computation time for both ALS and SGD compared to other methods, which is expected. More specifically, these methods have much smaller computation time growth rate with respect to the length L compared to the matrix-based methods. This is especially beneficial when dealing with data that exhibits long term dependencies of the input variables. In comparison to the Hankel tensor recovery time, the spectral learning step takes significantly less time, typically within a second. However, one important note is that if the length L gets larger, directly performing spectral learning on the matrix form of the Hankel tensor may not be possible due to the curse of dimensionality. Therefore, under this circumstance, one should directly perform the spectral learning algorithm in its tensor train form as described in Sect. 3.2. To demonstrate the benefits of performing the spectral learning algorithm in the TT format (as described in Sect. 3.2), we perform an additional experiment showing that leveraging the TT format allows one to save significant amount of computation time and memory resources in the spectral learning phase, especially when the corresponding Hankel tensor is large (i.e. large length and input dimension). In Fig. 8b we compare the running time of the spectral learning phase (after recovering the Hankel tensors) in the matrix and TT formats, where the latter leverages the TT structure in the spectral learning routine. We randomly generate 100,000 input-output examples using a Random 2-RNN with 3 states, input dimension 5 and output dimension 1. We use ALS to recover the Hankel tensors in the TT format and compare the running time of the spectral learning in the TT format with the time needed to perform the classical spectral learning algorithm after reshaping the Hankel tensors in matrices (note that the time needed to convert the TT Hankel tensors into the corresponding Hankel matrices is not counted towards the matrix spectral learning time). In Fig. 8b, we report the time needed to recover the Hankel tensors from data (Hankel_ALS) and the time to recover the WFA in both the matrix and TT formats. One can observe that although classic matrix-based spectral learning is significantly faster than the TT-based one when the length is relatively small, the running time of the matrix method grows exponentially with the length while the one of the TT method is linear. For example, when the length equals to 12, TT spectral learning is more than 1000 times faster than the classic spectral learning. This computation time gap significantly shows the benefit of leveraging TT format in the spectral learning phase. One remark is that other types of Hankel tensor recovery methods that we mentioned (i.e. TIHT, IHT, LeastSquares and NuclearNorm) fail to scale in this setup, due to excessive memory required by these algorithms in preparing the training data. In addition, directly recovering the Hankel tensors and performing spectral learning in the TT format also helps drastically reduce the memory resources. As an illustration, we compare the size of the Hankel matrix in the TT format and the matrix format in Table 1. As one can see the size of the matrix version of the Hankel grows exponentially w.r.t the length while the TT Hankel size grows linearly. This also echoes with the computation time for these two methods.

Fig. 8
figure 8

Running time comparison

Table 1 Memory size of the Hankel tensor \(\varvec{\mathcal {H}}^{(\ell )}\) for the random 2-RNN problem (see Fig. 8b) in both TT and matrix formats

6.2 Real world data

In addition to the synthetic data experiments presented above, we conduct experiments on the wind speed data from TUDelft.Footnote 5 For this experiment, to compare with existing results, we specifically use the data from Rijnhaven station as described in Lin et al. (2016), which proposed a regression automata model and performed various experiments on the wind speed dataset. The data contains wind speed and related information at the Rijnhaven station from 2013-04-22 at 14:55:00 to 2018-10-20 at 11:40:00 and was collected every five minutes. To compare with the results in Lin et al. (2016), we strictly followed the data preprocessing procedure described in the paper. We use the data from 2013-04-23 to 2015-10-12 as training data and the rest as our testing data. The paper uses SAX as a preprocessing method to discretize the data. However, as there is no need to discretize data for our algorithm, we did not perform this procedure. For our method, we set the length \(L = 3\) and we use a window size of 6 to predict the future values at test time. We calculate hourly averages of the wind speed, and predict one/three/six hour(s) ahead, as in Lin et al. (2016). In this experiment, our model only predicts the next hour from the past 6 observations. To make k-hour-ahead prediction, we use the forecast of the model itself as input and bootstrap from it. For our methods we use a linear 2-RNN with 10 states. Averages over 5 runs of this experiment for one-hour-ahead, three-hour-ahead, six-hour-ahead prediction error can be found in Tables 2, 3 and 4. The results for RA, RNN and persistence are taken directly from Lin et al. (2016).

The results of this experiment are presented in Tables 2, 3 and 4 where we can see that while TIHT+SGD performs slightly worse than ARIMA and RA for one-hour-ahead prediction, it outperforms all other methods for three-hours and six-hours ahead predictions (and the superiority w.r.t. other methods increases as the prediction horizon gets longer). One important note is that although ALS and ALS+SGD is slightly under-performing compared to TIHT and TIHT+SGD, the computation time has been significantly reduced for ALS by a factor of 5 (TIHT takes 3542 s while ALS takes 804 s).

Table 2 One-hour-ahead speed prediction performance comparisons
Table 3 Three-hour-ahead speed prediction performance comparisons
Table 4 Six-hour-ahead speed prediction performance comparisons

7 Conclusion and future directions

We proposed the first provable learning algorithm for second-order RNN with linear activation functions: we showed that linear 2-RNN are a natural extension of vv-WFA to the setting of input sequences of continuous vectors (rather than discrete symbol) and we extended the vv-WFA spectral learning algorithm to this setting. We also presented novel connections between WFA and tensor networks, showing that the computation of a WFA is intrinsically linked with the tensor train decomposition. We leveraged this connection to adapt the standard spectral learning algorithm to the tensor train format, allowing one to scale up the spectral algorithm to exponentially large sub-blocks of the Hankel matrix. We believe that the results presented in this paper open a number of exciting and promising research directions on both the theoretical and practical perspectives. We first plan to use the spectral learning estimate as a starting point for gradient based methods to train non-linear 2-RNN. More precisely, linear 2-RNN can be thought of as 2-RNN using LeakyRelu activation functions with negative slope 1, therefore one could use a linear 2-RNN as initialization before gradually reducing the negative slope parameter during training. The extension of the spectral method to linear 2-RNN also opens the door to scaling up the classical spectral algorithm to problems with large discrete alphabets (which is a known caveat of the spectral algorithm for WFA) since it allows one to use low dimensional embeddings of large vocabularies (using e.g. word2vec or latent semantic analysis). From the theoretical perspective, we plan on deriving learning guarantees for linear 2-RNN in the noisy setting (e.g. using the PAC learnability framework). Even though it is intuitive that such guarantees should hold (given the continuity of all operations used in our algorithm), we believe that such an analysis may entail results of independent interest. In particular, analogously to the matrix case studied in Cai and Zhang (2015), obtaining optimal convergence rates for the recovery of the low TT-rank Hankel tensors from rank one measurements is an interesting direction; such a result could for example allow one to improve the generalization bounds provided in Balle and Mohri (2012) for spectral learning of general WFA. Lastly, establishing other equivalence results between classical classes of formal languages and functions computed by recurrent architectures is a worthwhile endeavor; such equivalence results give a novel light on classical models from theoretical computer science and linguistics while at the same time sparkling original perspectives on modern machine learning architectures. A first direction could be to establish connections between weighted tree automata and tree-structured neural models such as recursive tensor neural networks (Socher et al., 2013a, 2013b).