Spectral learning of weighted automata
 684 Downloads
 7 Citations
Abstract
In recent years we have seen the development of efficient provably correct algorithms for learning Weighted Finite Automata (WFA). Most of these algorithms avoid the known hardness results by defining parameters beyond the number of states that can be used to quantify the complexity of learning automata under a particular distribution. One such class of methods are the socalled spectral algorithms that measure learning complexity in terms of the smallest singular value of some Hankel matrix. However, despite their simplicity and wide applicability to real problems, their impact in application domains remains marginal to this date. One of the goals of this paper is to remedy this situation by presenting a derivation of the spectral method for learning WFA that—without sacrificing rigor and mathematical elegance—puts emphasis on providing intuitions on the inner workings of the method and does not assume a strong background in formal algebraic methods. In addition, our algorithm overcomes some of the shortcomings of previous work and is able to learn from statistics of substrings. To illustrate the approach we present experiments on a real application of the method to natural language parsing.
Keywords
Spectral learning Weighted finite automata Dependency parsing1 Introduction
Learning finite automata is a fundamental task in Grammatical Inference. Over the years, a multitude of variations on this problem have been studied. For example, several learning models with different degrees of realism have been considered, ranging from query models and the learning in the limit paradigm, to the more challenging PAC learning framework. The main differences between these models are the ways in which learning algorithms can interact with the target machine. But not only the choice of learning model makes a difference in the study of this task, but also the particular kind of target automata that must be learned. These can range from the classical acceptors for regular languages like Deterministic Finite Automata (DFA) and Nondeterministic Finite Automata (NFA), to the more general Weighted Finite Automata (WFA) and Multiplicity Automata (MA), while also considering intermediate case like several classes of Probabilistic Finite Automata (PFA).
Efficient algorithms for learning all these classes of machines have been proposed in query models where algorithms have access to a minimal adequate teacher. Furthermore, most of these learning problems are also known to have polynomial informationtheoretic complexity in the PAC learning model. But despite these encouraging results, it has been known for decades that the most basic problems regarding learnability of automata in the PAC model are computationally untractable under both complexitytheoretic and cryptographic assumptions. Since these general worstcase results preclude the existence of efficient learning algorithms for all machines under all possible probability distributions, lots of efforts have been done in identifying problems involving special cases for which provably efficient learning algorithms can be given. An alternative approach has been to identify additional parameters beyond the number of states that can be used to quantify the complexity of learning a particular automaton under a particular distribution. A paradigmatic example of this line of work are the PAC learning algorithms for PDFA given in Ron et al. (1998), Clark and Thollard (2004), Palmer and Goldberg (2007), Castro and Gavaldà (2008), Balle et al. (2013) whose running time depend on a distinguishability parameter quantifying the minimal distance between distributions generated by different states in the target machine.
Spectral learning methods are a family of algorithms that also fall into this particular line of work. In particular, starting with the seminal works of Hsu et al. (2009) and Bailly et al. (2009), efficient provably correct algorithms for learning nondeterministic machines that define probability distributions over sets of strings have been recently developed. A workaround to the aforementioned hardness results is obtained in this case by including the smallest singular value of some Hankel matrix in the bounds on the running time of spectral algorithms. The initial enthusiasm generated by such algorithms has been corroborated by the appearance of numerous followups devoted to extending the method to more complex probabilistic models. However, despite the fact that these type of algorithms can be used to learn classes of machines widely used in applications like Hidden Markov Models (HMM) and PNFA, the impact of these methods in application domains remains marginal to this date. This remains so even when implementing such methods involves just a few linear algebra operations available in most general mathematical computing software packages. One of the main purposes of this paper is to try to remedy this situation by providing practical intuitions around the foundations of these algorithms and clear guidelines on how to use them in practice.
In our opinion, a major cause for the gap between the theoretical and practical development of spectral methods is the overwhelmingly theoretical nature of most papers in this area. The state of the art seems to suggest that there is no known workaround to these long mathematical proofs when seeking PAC learning results. However, it is also the case that most of the times the derivations given for these learning algorithms provide no intuitions on why or how one should expect them to work. Thus, obliterating the matter of PAC bounds, our first contribution is to provide a new derivation of the spectral learning algorithm for WFA that stresses the main intuitions behind the method. This yields an efficient algorithm for learning stochastic WFA defining probability distributions over strings. Our second contribution is showing how a simple transformation of this algorithm yields a more sampleefficient learning method that can work with substring statistics in contrast to the usual prefix statistics used in other methods.
Finite automata can also be used as building blocks for constructing more general contextfree grammatical formalisms. In this paper we consider the case of nondeterministic Split HeadAutomata Grammars (SHAG). These are a family of hiddenstate parsing models that have been successfully used to model the significant amount of nonlocal phenomena exhibited by dependency structures in natural language. A SHAG is composed by a collection of stochastic automata and can be used to define a probability distribution over dependency structures for a given sentence. Each automaton in a SHAG describes the generation of particular headmodifier sequences. Our third contribution is to apply the spectral method to the problem of learning the constituent automata of a target SHAG. Contrary to previous works where PDFA were used as basic constituent automata for SHAG, using the spectral method allows us to learn SHAG built out of nondeterministic automata.
1.1 Related work
In the last years multiple spectral learning algorithms have been proposed for a wide range of models. Many of these models deal with data whose nature is eminently sequential, like the work of Bailly et al. (2009) on WFA, or other works on particular subclasses of WFA like HMM (Hsu et al. 2009) and related extensions (Siddiqi et al. 2010; Song et al. 2010), Predictive State Representations (PSR) (Boots et al. 2011), Finite State Transducers (FST) (Balle et al. 2011), and Quadratic Weighted Automata (QWA) (Bailly 2011). Besides direct applications of the spectral algorithm to different classes of sequential models, the method has also been combined with convex optimization algorithms in Balle et al. (2012), Balle and Mohri (2012).
Despite this overwhelming diversity, to our knowledge the only previous work that has considered spectral learning for the general class of probabilistic weighted automata is due to Bailly et al. (2009). In spirit, their technique for deriving the spectral method is similar to ours. However, their elegant mathematical derivations are presented assuming a target audience with a strong background on formal algebraic methods. As such their presentation lacks the intuitions necessary to make the work accessible to a more general audience of machine learning practitioners. In contrast—without sacrificing rigor and mathematical elegance—our derivations put emphasis on providing intuitions on the inner working of the spectral method.
Besides sequential models, spectral learning algorithms for treelike structures appearing in contextfree grammatical models and probabilistic graphical models have also been considered (Bailly et al. 2010; Parikh et al. 2011; Luque et al. 2012; Cohen et al. 2012; Dhillon et al. 2012). In Sect. 6.4 we give a more detailed comparison between our work on SHAG and related methods that learn treeshaped models. The spectral method has been applied as well to other classes of probabilistic mixture models (Anandkumar et al. 2012c,a).
2 Weighted automata and Hankel matrices
In this section we present Weighted Finite Automata (WFA), the finite state machine formulations that will be used throughout the paper. We begin by introducing some notation for dealing with functions from strings to real numbers and then proceed to define Hankel matrices. These matrices will play a very important role in the derivation of the spectral learning algorithm given in Sect. 4. Then we proceed to describe the algebraic formulation of WFA and its relation to Hankel matrices. Finally, we discuss some special properties of stochastic WFA realizing probability distributions over strings. These properties will allow us to use the spectral method to learn from substring statistics, thus yielding more sampleefficient methods than other approaches based on string or prefix statistics.
2.1 Functions on strings and their Hankel matrices
Let Σ be a finite alphabet. We use σ to denote an arbitrary symbol in Σ. The set of all finite strings over Σ is denoted by Σ ^{⋆}, where we write λ for the empty string. We use bold letters to represent vectors v and matrices M. We use M ^{+} to denote the Moore–Penrose pseudoinverse of some matrix M.
Let \(f : \varSigma^{\star}\rightarrow\mathbb{R}\) be a function over strings. The Hankel matrix of f is a biinfinite matrix \(\mathbf {H}_{f} \in \mathbb{R} ^{\varSigma^{\star}\times\varSigma^{\star}}\) whose entries are defined as H _{ f }(u,v)=f(uv) for any u,v∈Σ ^{⋆}. That is, rows are indexed by prefixes and columns by suffixes. Note that the Hankel matrix of a function f is a very redundant way to represent f. In particular, the value f(x) appears x+1 times in H _{ f }, and we have f(x)=H _{ f }(x,λ)=H _{ f }(λ,x). An obvious observation is that a matrix \(\mathbf {M}\in\mathbb{R}^{\varSigma ^{\star}\times \varSigma^{\star}}\) satisfying M(u _{1},v _{1})=M(u _{2},v _{2}) for any u _{1} v _{1}=u _{2} v _{2} is the Hankel matrix of some function \(f : \varSigma^{\star}\rightarrow\mathbb{R}\).
We will be considering (finite) subblocks of a biinfinite Hankel matrix H _{ f }. An easy way to define such subblocks is using a basis \(\mathcal{B}= (\mathcal{P},\mathcal{S})\), where \(\mathcal{P}\subseteq\varSigma^{\star}\) is a set of prefixes and \(\mathcal{S}\subseteq \varSigma^{\star}\) a set of suffixes. We write \(p = \mathcal{P}\) and \(s = \mathcal{S}\). The subblock of H _{ f } defined by \(\mathcal{B}\) is the p×s matrix \(\mathbf {H}_{\mathcal{B}}\in \mathbb{R}^{\mathcal{P}\times\mathcal{S}}\) with \(\mathbf {H}_{\mathcal{B}}(u,v) = \mathbf {H}_{f}(u,v) = f(uv)\) for any \(u \in\mathcal{P}\) and \(v \in\mathcal{S}\). We may just write H if the basis \(\mathcal{B}\) is arbitrary or obvious from the context.
The rank of a function \(f : \varSigma^{\star}\rightarrow\mathbb {R}\) is defined as the rank of its Hankel matrix: \(\operatorname{rank}(f) = \operatorname {rank}( \mathbf {H}_{f})\). The rank of a subblock of H _{ f } cannot exceed \(\operatorname {rank}(f)\), and we will be specially interested on subblocks with full rank. We say that a basis \(\mathcal{B}= (\mathcal{P},\mathcal{S})\) is complete for f if the subblock \(\mathbf {H}_{\mathcal{B}}\) has full rank: \(\operatorname{rank}(\mathbf {H}_{\mathcal{B}}) = \operatorname{rank}( \mathbf {H}_{f})\). In this case we say that \(\mathbf {H}_{\mathcal{B}}\) is a complete subblock of H _{ f }. It turns out that the rank of f is related to the number of states needed to compute f with a weighted automaton, and that the prefixclosure of a complete subblock of H _{ f } contains enough information to compute this automaton. These two results will provide the basis for the learning algorithm presented in Sect. 4.
2.2 Weighted finite automata
Theorem 1
(Carlyle and Paz 1971; Fliess 1974)
A function \(f : \varSigma^{\star}\rightarrow\mathbb{R}\) can be defined by a WFA iff \(\operatorname{rank}( \mathbf {H}_{f})\) is finite, and in that case \(\operatorname{rank}( \mathbf {H}_{f})\) is the minimal number of states of any WFA A such that f=f _{ A }.
In view of this result, we will say that A is minimal for f if f _{ A }=f and \(A = \operatorname{rank}(f)\).
Another useful fact about WFA is their invariance under change of basis. It follows from the definition of f _{ A } that if \(\mathbf {M} \in\mathbb{R}^{n \times n}\) is an invertible matrix, then the WFA B=〈M ^{⊤} α _{1},M ^{−1} α _{∞},{M ^{−1} A _{ σ } M}〉 satisfies f _{ B }=f _{ A }. Sometimes B will be denoted by M ^{−1} A M. This fact will prove very useful when we consider the problem of learning a WFA realizing a certain function.
Weighted automata are related to other finite state computational models. In particular, WFA can also be defined more generally over an arbitrary semiring instead of the field of real numbers, in which case there are sometimes called multiplicity automata (MA) (e.g. Beimel et al. 2000). It is well known that using weights over an arbitrary semiring more computational power is obtained. However, in this paper we will only consider WFA with real weights. It is easy to see that several other models of automata (DFA, PDFA, PNFA) can be cast as special cases of WFA.
2.2.1 Example
3 Observables in stochastic weighted automata
Previous section introduces the class of WFA in a general setting. As we will see in next section, in order to learn an automata realizing (an approximation of) a function \(f : \varSigma^{\star}\rightarrow\mathbb {R}\) using a spectral algorithm, we will need to compute (an estimate) of a subblock of the Hankel matrix H _{ f }. In general such subblocks may be hard to obtain. However, in the case when f computes a probability distribution over Σ ^{⋆} and we have access to a sample of i.i.d. examples from this distribution, estimates of subblocks of H _{ f } can be obtained efficiently. In this section we discuss some properties of WFA which realize probability distributions. In particular, we are interested in showing how different kinds of statistics that can be computed from a sample of strings induce functions on Σ ^{⋆} realized by similar WFA.
We say that a WFA A is stochastic if the function f=f _{ A } is a probability distribution over Σ ^{⋆}. That is, if f(x)≥0 for all x∈Σ ^{⋆} and \(\sum_{x \in\varSigma^{\star}} f(x) = 1\). To make it clear that f represents a probability distribution we may sometimes write it as \(f(x) = \mathbb{P}[x]\).
Note that when f realizes a distribution over Σ ^{⋆}, one can think of computing other probabilistic quantities besides probabilities of strings \(\mathbb{P}[x]\). For example, one can define the function f _{p} that computes probabilities of prefixes; that is, \(f_{\mathrm{p}}(x) = \mathbb{P}[x \varSigma^{\star}]\). Another probabilistic function that can be computed from a distribution over Σ ^{⋆} is the expected number of times a particular string appears as a substring of random strings; we use f _{s} to denote this function. More formally, given two strings w,x∈Σ ^{⋆} let w_{ x } denote the number of times that x appears in w as a substring. Then we can write \(f_{\mathrm{s}}(x) = \mathbb{E}[w_{x}]\), where the expectation is with respect to w sampled from f: \(\mathbb{E}[w_{x}] = \sum_{w \in \varSigma^{\star}} w_{x} \mathbb{P}[w]\).
In general the class of stochastic WFA may include some pathological examples with states that are not connected to any terminating state. In order to avoid such cases we introduce the following technical condition. Given a stochastic WFA A=〈α _{1},α _{∞},{A _{ σ }}〉 let A=∑_{ σ∈Σ } A _{ σ }. We say that A is irredundant if ∥A∥<1 for some submultiplicative matrix norm ∥⋅∥. Note that a necessary condition for this to happen is that the spectral radius of A is less than one: ρ(A)<1. In particular, irredundancy implies that the sum ∑_{ k≥0} A ^{ k } converges to (I−A)^{−1}. An interesting property of irredundant stochastic WFA is that both f _{p} and f _{s} can also be computed by WFA as shown by the following result.
Lemma 1
 1.
A=〈α _{1},α _{∞},{A _{ σ }}〉 realizes f,
 2.
\(A_{\mathrm{p}} = \langle\boldsymbol{\alpha}_{1}, {\tilde {\boldsymbol{\alpha}}_{\infty}},\{ \mathbf {A}_{\sigma}\} \rangle\) realizes f _{p},
 3.
\(A_{\mathrm{s}} = \langle{\tilde{\boldsymbol{\alpha }}_{1}},{\tilde{\boldsymbol{\alpha}}_{\infty}}, \{ \mathbf {A}_{\sigma}\}\rangle\) realizes f _{s}.
Proof
A direct consequence of this constructive result is that given a WFA realizing a probability distribution \(\mathbb{P}[x]\) we can easily compute WFA realizing the functions f _{p} and f _{s}; and the converse holds as well. Lemma 1 also implies the following result, which characterizes the rank of f _{p} and f _{s}.
Corollary 1
Suppose \(f : \varSigma^{\star}\rightarrow\mathbb{R}\) is stochastic and admits a minimal irredundant WFA. Then \(\operatorname{rank}(f) = \operatorname {rank}(f_{\mathrm{p}}) = \operatorname{rank} (f_{\mathrm{s}})\).
Proof
Since all the constructions of Lemma 1 preserve the number of states, the result follows from considering minimal WFA for f, f _{p}, and f _{s}. □
From the point of view of learning, Lemma 1 provides us with tools for proving twosided reductions between the problems of learning f, f _{p}, and f _{s}. Since for all these problems the corresponding empirical Hankel matrices can be easily computed, this implies that for each particular task we can use the statistics which better suit its needs. For example, if we are interested in learning a model that predicts the next symbol in a string we might learn the function f _{p}. On the other hand, if we want to predict missing symbols in the middle of string we might learn the distribution f itself. Using Lemma 1 we see that both could be learned from substring statistics.
4 Duality, spectral learning, and forwardbackward decompositions
In this section we give a derivation of the spectral learning algorithm. Our approach follows from a duality result between minimal WFA and factorizations of Hankel matrices. We begin by presenting this duality result and some of its consequences. Afterwards we proceed to describe the spectral method, which is just an efficient implementation of the arguments used in the proof of the duality result. Finally we give an interpretation of this method from the point of view of forward and backward recursions in finite automata. This provides extra intuitions about the method and stresses the role played by factorizations in its derivation.
4.1 Duality and minimal weighted automata
Let f be a real function on strings and H _{ f } its Hankel matrix. In this section we consider factorizations of H _{ f } and minimal WFA for f. We will show that there exists an interesting relation between these two concepts. This relation will motivate the algorithm presented on next section that factorizes a (subblock of a) Hankel matrix in order to learn a WFA for some unknown function.
Our initial observation is that a WFA A=〈α _{1},α _{∞},{A _{ σ }}〉 for f with n states induces a factorization of H _{ f }. Let \(\mathbf {P} \in\mathbb{R}^{\varSigma^{\star}\times n}\) be a matrix whose uth row equals \(\boldsymbol{\alpha}_{1}^{\top} \mathbf {A}_{u}\) for any u∈Σ ^{⋆}. Furthermore, let \(\mathbf {S} \in\mathbb{R}^{n \times\varSigma^{\star}}\) be a matrix whose columns are of the form A _{ v } α _{∞} for all v∈Σ ^{⋆}. It is trivial to check that one has H _{ f }=PS. The same happens for subblocks: if \(\mathbf {H}_{\mathcal{B}}\) is a subblock of H _{ f } defined over an arbitrary basis \(\mathcal{B}= (\mathcal{P},\mathcal{S})\), then the corresponding restrictions \(\mathbf {P}_{\mathcal{B}}\in\mathbb{R}^{\mathcal{P}\times n}\) and \(\mathbf {S}_{\mathcal{B}}\in\mathbb{R}^{n \times \mathcal{S}}\) of P and S induce the factorization \(\mathbf {H}_{\mathcal{B}}= \mathbf {P}_{\mathcal{B}} \mathbf {S}_{\mathcal{B}}\). Furthermore, if H _{ σ } is a subblock of the matrix \(\mathbf {H}_{\mathcal{B} ^{\prime}}\) corresponding to the prefixclosure of \(\mathbf {H}_{\mathcal{B}}\), then we also have the factorization \(\mathbf {H}_{\sigma}= \mathbf {P}_{\mathcal{B}} \mathbf {A}_{\sigma} \mathbf {S}_{\mathcal{B}}\).
An interesting consequence of this construction is that if A is minimal for f—i.e. \(n = \operatorname{rank}(f)\)—then the factorization H _{ f }=PS is in fact a rank factorization. Since in general \(\operatorname{rank}( \mathbf {H}_{\mathcal{B}}) \leq n\), in this case the factorization \(\mathbf {H}_{\mathcal{B}} = \mathbf {P}_{\mathcal{B}}\mathbf {S}_{\mathcal{B}}\) is a rank factorization if and only if \(\mathbf {H}_{\mathcal{B}}\) is a complete subblock. Thus, we see that a minimal WFA that realizes a function f induces a rank factorization on any complete subblock of H _{ f }. The converse is even more interesting: give a rank factorization of a complete subblock of H _{ f }, one can compute a minimal WFA for f.
Let H be a complete subblock of H _{ f } defined by the basis \(\mathcal{B}= (\mathcal{P},\mathcal{S})\) and let H _{ σ } denote the subblock of the prefixclosure of H corresponding to the basis \((\mathcal{P}\sigma, \mathcal{S})\). Let \(\mathbf {h}_{\mathcal {P},\lambda }\in\mathbb{R}^{\mathcal{P}}\) denote the pdimensional vector with coordinates \(\mathbf {h}_{\mathcal {P},\lambda }(u) = f(u)\), and \(\mathbf {h}_{\lambda ,\mathcal {S}}\in \mathbb{R}^{\mathcal{S}}\) the sdimensional vector with coordinates \(\mathbf {h}_{\lambda ,\mathcal {S}}(v) = f(v)\). Now we can state our result.
Lemma 2
If H=PS is a rank factorization, then the WFA A=〈α _{1},α _{∞},{A _{ σ }}〉 with \(\boldsymbol{\alpha}_{1}^{\top}= \mathbf {h}_{\lambda ,\mathcal {S}}^{\top} \mathbf {S}^{+}\), \(\boldsymbol {\alpha }_{\infty }= \mathbf {P}^{+} \mathbf {h}_{\mathcal {P},\lambda }\), and A _{ σ }=P ^{+} H _{ σ } S ^{+}, is minimal for f.
Proof
Let \(A^{\prime}= \langle \boldsymbol {\alpha }_{1}', \boldsymbol {\alpha }_{\infty }', \{\mathbf {A}'_{\sigma}\} \rangle \) be a minimal WFA for f that induces a rank factorization H=P′S′. It suffices to show that there exists an invertible M such that M ^{−1} A′M=A. Define M=S′S ^{+} and note that P ^{+} P′S′S ^{+}=P ^{+} HS ^{+}=I implies that M is invertible with M ^{−1}=P ^{+} P′. Now we check that the operators of A correspond to the operators of A′ under this change of basis. First we see that \(\mathbf {A}_{\sigma}= \mathbf {P}^{+} \mathbf {H}_{\sigma} \mathbf {S}^{+} = \mathbf {P}^{+} \mathbf {P}^{\prime} \mathbf {A}_{\sigma}^{\prime} \mathbf {S}^{\prime} \mathbf {S}^{+} = \mathbf {M}^{1} \mathbf {A}_{\sigma}^{\prime} \mathbf {M}\). Now observe that by the construction of S′ and P′ we have \({\boldsymbol{\alpha}_{1}^{\prime}}^{\top} \mathbf {S}^{\prime}= \mathbf {h}_{\lambda ,\mathcal {S}}\), and \(\mathbf {P}^{\prime} \boldsymbol {\alpha }_{\infty }^{\prime}= \mathbf {h}_{\mathcal {P},\lambda }\). Thus, it follows that \(\boldsymbol{\alpha}_{1}^{\top}= {\boldsymbol {\alpha}_{1}^{\prime}}^{\top} \mathbf {M}\) and \(\boldsymbol {\alpha }_{\infty }= \mathbf {M}^{1} \boldsymbol {\alpha }_{\infty }^{\prime}\). □
This result shows that there exists a duality between rank factorizations of complete subblocks of H _{ f } and minimal WFA for f. A consequence of this duality is that all minimal WFA for a function f are related via some change of basis. In other words, modulo change of basis, there exists a unique minimal WFA for any function f of finite rank.
Corollary 2
Let A=〈α _{1},α _{∞},{A _{ σ }}〉 and \(A^{\prime}= \langle \boldsymbol {\alpha }_{1}', \boldsymbol {\alpha }_{\infty }', \{\mathbf {A}'_{\sigma}\} \rangle \) be minimal WFA for some f of rank n. Then there exists an invertible matrix \(\mathbf {M} \in\mathbb{R}^{n \times n}\) such that A=M ^{−1} A′M.
Proof
Suppose that H _{ f }=PS=P′S′ are the rank factorizations induced by A and A′ respectively. Then, by the same arguments used in Lemma 2, the matrix M=S′S ^{+} is invertible and satisfies the equation A=M ^{−1} A′M. □
4.2 A spectral learning algorithm
The spectral method is basically an efficient algorithm that implements the ideas in the proof of Lemma 2 to find a rank factorization of a complete subblock H of H _{ f } and obtain from it a minimal WFA for f. The term spectral comes from the fact that it uses SVD, a type of spectral decomposition. We describe the algorithm in detail in this section and give a complete set of experiments that explores the practical behavior of this method in Sect. 5.
Suppose \(f : \varSigma^{\star}\rightarrow\mathbb{R}\) is an unknown function of finite rank n and we want to compute a minimal WFA for it. Let us assume that we know that \(\mathcal{B}= (\mathcal{P},\mathcal {S})\) is a complete basis for f. Our algorithm receives as input: the basis \(\mathcal{B}\) and the values of f on a set of strings \(\mathcal{W}\). In particular, we assume that \(\mathcal{P} \varSigma^{\prime}\mathcal{S}\cup\mathcal{P}\cup\mathcal{S}\subseteq\mathcal{W}\). It is clear that using these values of f the algorithm can compute subblocks H _{ σ } for σ∈Σ′ of H _{ f }. Furthermore, it can compute the vectors \(\mathbf {h}_{\lambda ,\mathcal {S}}\) and \(\mathbf {h}_{\mathcal {P},\lambda }\). Thus, the algorithm only needs a rank factorization of H _{ λ } to be able to apply the formulas given in Lemma 2.
Recall that the compact SVD of a p×s matrix H _{ λ } of rank n is given by the expression \(\mathbf {H}_{\lambda}= \mathbf {U} \mathbf {\varLambda} \mathbf {V}^{\top}\), where \(\mathbf {U} \in\mathbb{R}^{p \times n}\) and \(\mathbf {V} \in\mathbb{R} ^{s \times n}\) are orthogonal matrices, and \(\mathbf {\varLambda} \in\mathbb{R}^{n \times n}\) is a diagonal matrix containing the singular values of H _{ λ }. The most interesting property of compact SVD for our purposes is that \(\mathbf {H}_{\lambda}= (\mathbf {U} \mathbf {\varLambda}) \mathbf {V}^{\top}\) is a rank factorization. We will use this factorization in the algorithm, but write it in a different way. Note that since V is orthogonal we have V ^{⊤} V=I, and in particular V ^{+}=V ^{⊤}. Thus, the factorization above is equivalent to H _{ λ }=(H _{ λ } V)V ^{⊤}.
4.2.1 Sample complexity of spectral learning
The spectral algorithm we just described can be used even when H and H _{ σ } are not known exactly, but approximations \(\hat{ \mathbf {H}}\) and \(\hat{ \mathbf {H}}_{\sigma}\) are available. In this context, an approximation means that we have an estimate for each entry in these matrices; that is, we know an estimate of f for every string in \(\mathcal{W}\). A different concept of approximation could be that one knows f exactly in some, but not all strings in \(\mathcal{W}\). In this context, one can still apply the spectral method after a preliminary matrix completion step; see Balle and Mohri (2012) for details. When the goal is to learn a probability distribution over strings—or prefixes, or substrings—we are always in the first of these two settings. In these cases we can apply the spectral algorithm directly using empirical estimations \(\hat{ \mathbf {H}}\) and \(\hat{ \mathbf {H}}_{\sigma}\). A natural question is then how close to f is the approximate function \(\hat{f}\) computed by the learned automaton \(\hat{A}\). Experiments described in the following sections explore this question from an empirical perspective and compare the performance of spectral learning with other approaches. Here we give a very brief outline of what is known about the sample complexity of spectral learning. Since an indepth discussion of these results and the techniques used in their proofs is outside the scope of this paper, for further details we refer the reader to papers where these bounds were originally presented (Hsu et al. 2009; Bailly et al. 2009; Siddiqi et al. 2010; Bailly 2011; Balle 2013).
All known results about learning stochastic WFA with spectral methods fall into the wellknown PAClearning framework (Valiant 1984; Kearns et al. 1994). In particular, assuming that a large enough sample of i.i.d. strings drawn from some distribution f over Σ ^{⋆} realized by a WFA is given to the spectral learning algorithm, we know that with high probability the output WFA computes a function \(\hat{f}\) that is close to f. Sample bounds in this type of results usually depend polynomially on the usual PAC parameters—accuracy ε and confidence δ—as well as other parameters depending on the target f: the size of the alphabet Σ, the number of states n of a minimal WFA realizing f, the size of the basis \(\mathcal{B}\), and the smallest singular values of H and other related matrices.
These results come in different flavors, depending on what assumptions are made on the automaton computing f and what criteria is used to measure how close \(\hat{f}\) is to f. When f can be realized by a Hidden Markov Model (HMM), Hsu et al. (2009) proved a PAClearning result under the L_{1} distance restricted to strings in Σ ^{ t } for some t≥0—their bound depends polynomially in t. A similar result was obtained in Siddiqi et al. (2010) for Reduced Rank HMM. For targets f computed by a general stochastic WFA, Bailly et al. (2009) gave a similar results under the milder L_{∞} distance. When f can be computed by a Quadratic WFA one can obtain L_{1} bounds over all Σ ^{⋆}; see Bailly (2011). The case where the function can be computed by a Probabilistic WFA was analyzed in Balle (2013), where L_{1} bounds over strings in Σ ^{≤t } are given. It is important to note that, with the exception of Bailly (2011), none of these methods is guaranteed to return a stochastic WFA. That is, though the hypothesis \(\hat{f}\) is close to a probability distribution in L_{1} distance, it does not necessarily assign a nonnegative number to each strings, much less adds up to one when summed over all strings—though both properties are satisfied in the limit. In practice this is a problem when trying to evaluate these methods using perplexitylike accuracy measures. We do not face this difficulty in our experiments because we use WERlike accuracy measures. See the discussion in Sect. 8 for pointers to some attempts to solve this problem.
 1.
Convergence of empirical estimates \(\hat{ \mathbf {H}}\) and \(\hat{\mathbf {H}}_{\sigma}\) to their true values at a rate of O(m ^{−1/2}) in terms of Frobenius norms; here m is the sample size.
 2.
Stability of linear algebra operations—SVD, pseudoinverse and matrix multiplication—under small perturbations. This implies that when the errors in empirical Hankel matrices are small, we get operators \(\hat {\boldsymbol{\alpha}}_{1}\), \(\hat{\boldsymbol{\alpha}}_{\infty}\), and \(\hat{ \mathbf {A}}_{\sigma}\) which are close to their true values, modulo a change of basis.
 3.
Mild aggregation of errors when computing \(\sum f(x)  \hat {f}(x)\) over large sets of strings.
4.2.2 Choosing the parameters
When run with approximate data \(\hat{ \mathbf {H}}_{\lambda}\), \(\hat{\mathbf {H}}_{\sigma}\) for σ ∈Σ, \(\hat{\mathbf {h}}_{\lambda,\mathcal{S}}\), and \(\hat{\mathbf {h}}_{\mathcal{P},\lambda}\), the algorithm also receives as input the number of states n of the target WFA. That is because the rank of \(\hat{ \mathbf {H}}_{\lambda}\) may be different from the rank of H _{ λ } due to the noise, and in this case the algorithm may need to ignore some of the smallest singular values of \(\hat{ \mathbf {H}}_{\lambda}\), which just correspond to zeros in the original matrix that have been corrupted by noise. This is done by just computing a truncated SVD of \(\hat{\mathbf {H}}_{\lambda}\) up to dimension n—we note that the cost of this computation is the same as the computation of a compact SVD on a matrix of rank n. It was shown in Bailly (2011) that when empirical Hankel matrices are sufficiently accurate, inspection of the singular values of \(\hat{ \mathbf {H}}\) can yield accurate estimates of the number of states n in the target. In practice one usually chooses the number of states via some sort of crossvalidation procedure. We will get back to this issue in Sect. 5.
The other important parameter to choose when using the spectral algorithm is the basis. It is easy to show that for functions of rank n there always exist complete basis with \(\mathcal{P} = \mathcal{S} = n\). In general there exist infinitely many complete basis and it is safe to assume in theoretical results that at least one is given to the algorithm. However, choosing a basis in practice turns out to be a complex task. A common choice are basis of the form \(\mathcal{P}= \mathcal{S}= \varSigma^{\leq k}\) for some k>0 (Hsu et al. 2009; Siddiqi et al. 2010). Another approach, is to choose a basis that contains the most frequent elements observed in the sample, which depending on the particular target model can be either strings, prefixes, suffixes, or substrings. This approach is motivated by the theoretical results from Balle et al. (2012). It is shown there that a random sampling strategy will succeed with high probability in finding a complete basis when given a large enough sample. This suggests that including frequent prefixes and suffixes might be a good heuristic. This approach is much faster than the greedy heuristic presented in Wiewiora (2005), which for each prefix added to the basis makes a computation taking exponential time in the number of states n. Other authors suggest using the largest Hankel matrix that can be estimated using the given sample; that is, build a basis that includes every prefix and suffix seen in the sample (Bailly et al. 2009). While the statistical properties of such estimation remain unclear, this approach becomes computationally unfeasible for large samples because in this case the size of the basis does grow with the number of examples m. All in all, designing an efficient algorithm for obtaining an optimal sampledependent basis is an open problem. In our experiments we decided to adopt the simplest sampledependent strategy: choosing the most frequent prefixes and suffixes in the sample. See Sects. 5 and 7 for details.
4.3 The forwardbackward interpretation
 1.
All parameters are nonnegative. That is, for all σ∈Σ and all i,j∈[n]: A _{ σ }(i,j)≥0, α _{1}(i)≥0, and α _{∞}(i)≥0.
 2.
Initial weights add up to one: ∑_{ i∈[n]} α _{1}(i)=1.
 3.
Transition and final weights from each state add up to one. That is, for all i∈[n]: α _{∞}(i)+∑_{ σ∈Σ }∑_{ j∈[n]} A _{ σ }(i,j)=1.
It turns out that when a probabilistic WFA A=〈α _{1},α _{∞},{A _{ σ }}〉 is considered, the factorization induced on H has a nice probabilistic interpretation. Analyzing the spectral algorithm from this perspective yields additional insights which are useful to keep in mind.
The same interpretation applies to the factorization induced on a subblock \(\mathbf {H}_{\mathcal{B}} = \mathbf {P}_{\mathcal{B}}\mathbf {S}_{\mathcal{B}}\). Therefore, assuming there exists a minimal WFA for \(f(x) = \mathbb{P}[x]\) which is probabilistic,^{2} Lemma 2 says that a WFA for f can be learned from information about the forward and backward probabilities over a small set of prefixes and suffixes. Teaming this basic observation with the spectral method and invariance under change of basis one can show an interesting fact: forward and backward (empirical) probabilities for a probabilistic WFA can be recovered (modulo a change of basis) by computing an SVD on (empirical) string probabilities. In other words, though state probabilities are nonobservable, they can be recovered (modulo a linear transformation) from observable quantities.
5 Experiments on learning PNFA
In this section we present some experiments that illustrate the behavior of the spectral learning algorithm at learning weighted automata under different configurations. We also present a comparison to alternative methods for learning WFA, namely to baseline unigram and bigram methods, and to an Expectation Maximization algorithm for learning PNFA (Dempster et al. 1977).
As a measure of error, we compute the word error rate (WER) on the validation set. WER computes the error at predicting the symbol that most likely follows a given prefix sequence, or predicting a special stop symbol if the given prefix is most likely to be a complete sequence. If w is a validation sequence of length t, we evaluate t+1 events, one per each symbol w _{ i } given the prefix w _{1:i−1} and one for the stopping event; note that each event is independent of the others, and that we always use the correct prefix to condition on. WER is the percentage of errors averaged over all events in the validation set.
We would like to remind the reader that a WFA learned by the spectral method is only guaranteed to realize a probabilistic distribution on Σ ^{∗} when we use an exact complete subblock of the Hankel of a stochastic function. In experiments, we only have access to a finite sample, and even though the SVD is robust to noise, we in fact observe that the WFA we obtain do not define distributions. Hence, standard evaluation metrics for probabilistic language models such as perplexity are not well defined here, and we prefer to use an error metric such as WER that does not require normalized predictions. We also avoid saying that these WFA compute probabilities over strings, and we will just say they compute scores.
5.1 Methods compared
We now describe the weighted automata we compare, and give some details about how they were estimated and used to make predictions.
Unigram model
A WFA with a single state, that emits symbols according to their frequency in training data. When evaluating WER, this method will always predict the most likely symbol (in our data NN, which stands for singular noun).
Bigram model

α _{1}(λ)=1 and α _{1}(σ)=0 for σ∈Σ

A _{ σ }(i,j)=0 if σ≠j

For each state i, A _{ σ }(i,σ) for all σ and α _{∞}(i) is a distribution estimated from training counts, without smoothing.
EM model
A nondeterministic WFA with n states trained with Expectation Maximization (EM), where n is a parameter of the method. The learning algorithm initializes the WFA randomly, and then it proceeds iteratively by computing expected counts of state transitions on training sequences, and resetting the parameters of the WFA by maximum likelihood given the expected counts. On validation data, we use a special operator \(\tilde{\boldsymbol{\alpha}}_{\infty}= \mathbf {1}\) to compute prefix probabilities, and we use the α _{∞} resulting from EM to compute probabilities of complete sequences.
Spectral model
 Σ Basis:

We consider one prefix/suffix for each symbol in the alphabet, that is \(\mathcal{P}= \mathcal{S}= \varSigma\). This is the setting analyzed by Hsu et al. (2009) in their theoretical work. In this case, the statistics gathered at training to estimate the automaton will correspond to unigram, bigram and trigram statistics.
 Topk Basis:

In this setting we set the prefixes and suffixes to be frequent subsequences of the training set. In particular, we consider all subsequences of symbols up to length 4, and sort them by frequency in the training set. We then set \(\mathcal{P}\) and \(\mathcal{S}\) to be the most frequent k subsequences, where k is a parameter of the model.
As a final detail, when computing nextsymbol predictions with WFA we kept normalizing the state vector. That is, if we are given a prefix sequence w _{1,i } we compute \(\boldsymbol{\alpha}^{i\ \top} A_{\sigma}\tilde{\boldsymbol{\alpha}}_{\infty}\) as the score for symbol σ and α ^{ i ⊤} α _{∞} as the score for stopping, where α ^{ i } is a normalized state vector at position i. It is recursively computed as α ^{1}=α _{1} and \(\boldsymbol{\alpha}^{i+1} = \frac{\boldsymbol{\alpha}^{i\ \top} A_{w_{i}}}{ \boldsymbol{\alpha}^{i\ \top} A_{w_{i}} \tilde{\boldsymbol{\alpha }}_{\infty}}\). This normalization should not change the predictions, but it helps avoiding numerical precision problems when validation sequences are relatively long.
5.2 Results
We trained all types of models for the two sets of tags, namely the simplified set of 12 tags and the original tagset of 45 tags. For the simplified set, the unigram model obtained a WER of 69.4 % on validation data and the bigram improved to 66.6 %. For the original tagset, the unigram and bigram WER were of 87.2 % and 69.4 %.
6 Nondeterministic split headautomata grammars
6.1 SHAG
We will use x _{ i:j }=x _{ i } x _{ i+1}⋯x _{ j } to denote a sequence of symbols x _{ t } with i≤t≤j. A SHAG generates sentences s _{0:N }, where symbols s _{ t }∈Σ with 1≤t≤N are regular words and s _{0}=⋆∉Σ is a special root symbol. Let \(\bar{\varSigma} = \varSigma \cup \{\star\}\). A derivation y, i.e. a dependency tree, is a collection of headmodifier sequences 〈h,d,x _{1:T }〉, where \(h \in\bar{\varSigma}\) is a word, d∈{left,right} is a direction, and x _{1:T } is a sequence of T words, where each x _{ t }∈Σ is a modifier of h in direction d. We say that h is the head of each x _{ t }. Modifier sequences x _{1:T } are ordered headoutwards, i.e. among x _{1:T }, x _{1} is the word closest to h in the derived sentence, and x _{ T } is the furthest. A derivation y of a sentence s _{0:N } consists of a left and a right headmodifier sequence for each s _{ t }, i.e. there are always two sequences per symbol in the sentence. As special cases, the left sequence of the root symbol is always empty, while the right one consists of a single word corresponding to the head of the sentence. We denote by \(\mathcal{Y}\) the set of all valid derivations. See Fig. 6 to see the headmodifier sequences associated with an example dependency tree.
6.1.1 Probabilistic SHAG
6.2 Learning SHAG
A property of our nondeterministic SHAG models is that the probability of a derivation factors into the probability of each headmodifier sequence. In other words, the state processes only model horizontal structure of the tree, and different WFA do not interact in a derivation. In addition, in this article we make the assumption that training sequences come paired with dependency trees, i.e. we assume a supervised setting. Hence, we do not deal with the hard problem of inducing grammars from sequences.
These two facts make the application of the spectral method for WFA almost trivial. From the training set, we can decompose each dependency tree into sequences of modifiers, and create a training set for each head of direction containing the corresponding sequences of modifiers. Then, for each head and direction, we can learn WFA by direct application of the spectral method.
6.3 Parsing with nondeterministic SHAG
6.4 Related work
There have been a number of works that apply spectral learning methods to tree structures. Dhillon et al. (2012) present a latentvariable model for dependency parsing, where the state process models vertical interactions between heads and modifiers, such that hidden states pass information from the root of the tree to each leaf. In their model, given the state of a head, the modifiers are independent of each other. In contrast, in our case the hidden states model interactions between the children of a head, but hidden states do not pass information vertically. In our case the application of the spectral method is straightforward, while the vertical case requires taking into account that at each node the sequence from the root to the node branches out into multiple children.
There have been extensions by Bailly et al. (2010) and Cohen et al. (2012) of the spectral method for probabilistic contextfree grammars (PCFG), a formalism that includes SHAG. In this case the state process can model horizontal and vertical interactions simultaneously, by making use of tensor operators associated to the rules of the grammar. Recently, Cohen et al. (2013) have presented experiments to learn phrasestructure models using the a spectral method.
The works mentioned so far model a joint distribution over trees of different sizes, which is the suitable setting for models like natural language parsing. In contrast, Parikh et al. (2011) presented a spectral method to learn distributions over labelings of a fixed (though arbitrary) tree topology.
In all these cases, the learning setting is supervised, in the sense that training sequences are paired with their tree structure, and the spectral algorithm is used to induce the hidden state process. A more ambitious problem is that of grammatical inference, where the goal is to induce the model only from sequences. Regarding spectral methods, Mossel and Roch (2005) study the induction of the topology of a phylogenetic treeshaped model, and Hsu et al. (2012) discuss spectral techniques to induce PCFG, with dependency grammars as a special case.
7 Experiments on learning SHAG
In this section we present experiments with SHAG. We learn nondeterministic SHAG using different versions of the spectral algorithm, and compare them to nondeterministic SHAG learned with EM and to some baseline deterministic SHAG.
Our experiments involve fully unlexicalized models, i.e. parsing partofspeech tag sequences. While this setting falls behind the stateoftheart, it is nonetheless valid to analyze empirically the effect of incorporating hidden states via weighted automata, which results in large improvements. At the end, we present some analysis of the automaton learned by the spectral algorithm to see the information that is captured in the hidden state space.
All the experiments were done with the dependency version of the English WSJ Penn Treebank, using the standard partitions for training and validation (see Sect. 5). The models were trained using the modifier sequences extracted from the training dependency trees, and they were evaluated parsing the validation set and computing the Unlabeled Attachment Score (UAS). UAS is an accuracy measure that accounts for the percentage of tokens that were assigned the correct head word (note that in a dependency tree, each word modifies exactly one head).
7.1 Methods compared
As a SHAG is a collection of automata, each one has its own alphabet Σ ^{ h,d }, defined as the set of symbols occurring in the training modifier sequences for that head h and direction d. We compare the following models:
Baseline models
EM model
A nondeterministic SHAG with n states trained with Expectation Maximization (EM) as in Sect. 5.
Spectral models
 Σ′ basis:

The basis for each WFA is \(\mathcal{P}^{h,d} = \mathcal{S}^{h,d} = (\varSigma^{h,d})^{\prime}= \varSigma^{h,d} \cup \{ \lambda\}\). For this model, we use an additional parameter m, a minimal mass used to discard states. In each WFA, we discard the states with proportional singular value <m.
 Extended basis:

f is a parameter of the model, namely a cut factor that defines the size of the basis as follows. For each WFA, we use as basis \(\mathcal{P}^{h,d}\) and \(\mathcal {S}^{h,d}\) the set of Σ ^{ h,d }f most frequent training subsequences of symbols (up to length 4). Hence, f is a relative size of the basis for a WFA, proportional to the size of its alphabet. We always include the empty sequence λ in the basis.
7.2 Results
The results for the deterministic baselines were a UAS of 68.52 % for Det and a UAS of 74.80 % for Det+F.
In the second set of experiments with the spectral method, we evaluated models estimated with extended basis. Figure 8(b) shows curves for different cut factors f, plotting UAS scores in terms of the number of states.^{4} Here, we clearly see that the performance largely improves and is more stable with bigger values for f.^{5} The best results are clearly better than the ones of the basic model (UAS 80.90 % vs. 79.81 %) and, more interestingly, the curves reach stability without the need of a state discarding strategy.
7.3 Result analysis
Our purpose in this section is to see what information is encoded in the models learned by the spectral algorithm. However, hidden state spaces are hard to interpret, and this is even harder if they are projected into a nonprobabilistic space through a basis change, as in our case. To do the analysis, we build DFA that approximate the behaviour of the nondeterministic models when they generate highly probable sequences. The DFA approximations allows us to observe in a simple way some linguistically relevant phenomena encoded in the states, and to compare them with manually encoded features of wellknown models. In this section we describe the DFA approximation construction method, and then we use it to analyze the most relevant unlexicalized automaton in terms of number of dependencies, namely, the automaton for h=NN and d=left.
7.3.1 DFA approximation for stochastic WFA
To build a DFA approximation, we first compute a set of forward vectors corresponding to the most frequent prefixes of training sequences. Then, we cluster these vectors using a Group Average Agglomerative algorithm using the cosine similarity measure (Manning et al. 2008). Each cluster i defines a state in the DFA, and we say that a sequence m _{1:t } is in state i if its corresponding forward vector at time t is in cluster i. The transitions in the DFA are defined using a procedure that looks at how sequences traverse the states. If a sequence m _{1:t } is at state i at time t−1, and goes to state j at time t, then we define a transition from state i to state j with label m _{ t }. This procedure may require merging states to give a consistent DFA, because different sequences may define different transitions for the same states and modifiers. After doing a merge, new merges may be required, so the procedure must be repeated until a DFA is obtained.
7.3.2 Experiments on unlexicalized WFA
8 Conclusion
The central objective of this paper was to offer a broad view of the main results in spectral learning in the context of grammatical inference, and more precisely in the context of learning weighted automata. With this goal in mind, we presented the recent advances in the field in a way that makes the main underlying principles of spectral learning accessible to a wide audience.
We believe this to be useful since spectral methods are becoming an interesting alternative to the classical EM algorithms widely used for grammatical inference. One of the attractiveness of the spectral approach resides in its computational efficiency (at least in the context of automata learning). This efficiency might open the door to largescale applications of automata learning, where models can be inferred from big data collections.

EM attempts to minimize the KL divergence between the model distribution and the observed distribution. In contrast, the spectral method attempts to minimize an ℓ _{ p } distance between model and observed distribution.

EM searches for stable points of the likelihood function. Instead, the spectral method finds an approximate minimizer of a global loss function.
Most empirical studies, including ours, suggest that the statistical performance of spectral methods is similar to that of EM (e.g. see Cohen et al. 2013 for experiments learning latentvariable probabilistic context free grammars). However, our empirical understanding is still quite limited and more research needs to be done to understand the relative performance of each algorithm with respect to the complexity of the target model (i.e., size of the alphabet and number of states). Nonetheless, spectral methods offer a very competitive computational performance when compared to iterative methods like EM.
A key difference between the spectral method and other approaches to induce weighted automata is at the conceptual level, particularly in the way in which the learning problem is framed. This conceptual difference is precisely what we tried to emphasize in our presentation of the subject. In a snapshot, the central idea of the spectral approach to learning functions over Σ ^{⋆} is to directly exploit recurrence relations satisfied by families of functions. This is done by providing algebraic formulations of these recurrence relations.
Because spectral learning for grammatical inference is still a young field, many problems remain open. At a technical level, we have already mentioned the two most important: how to choose a sampledependent basis for the Hankel matrices fed to the method, and how to guarantee that the output WFA is stochastic or probabilistic. The former problem has been discussed at large in Sect. 4.2.2, where we gave heuristics for choosing the input parameters given to the algorithm. The latter problem has received less attention in the present paper, mainly because our experimental framework is not affected by it. However, ensuring the output of the spectral method is a proper probability distribution is important in many applications. Different solutions have been proposed to address this issue: Bailly (2011) gave a spectral method for Quadratic WFA which by definition always define a nonnegative function; heuristics to modify the output of a spectral algorithm in order to enforce nonnegativity were discussed in Cohen et al. (2013) in the context of PCFG, though they also apply to WFA; for HMM one can use methods based on spectral decompositions of tensors to overcome this problem (Anandkumar et al. 2012b); one can obtain probabilistic WFA by imposing some convex constraints on the search space of the optimizationbased spectral method presented in Balle et al. (2012). All these methods rely on variations of the SVDbased method presented in this paper. An interesting exercise would be to compare their behavior in practical applications.
Besides these technical questions, several conceptual questions regarding spectral learning and its relations to EM remain open. In particular, we would like to have a deeper understanding of the relations between EM, spectral learning and splitmerge algorithms, both from a theoretical perspective and from a practical point of view. On the other hand, the principles that underlie spectral learning can be applied to any computational or probabilistic model with some notion of locality, in the sense that the model admits some strong Markovlike conditional independence assumptions. Several extensions along these lines can already be found in the literature, but the limits of these techniques remain largely unknown. From the perspective of grammatical inference, learning beyond stochastic rational languages is the most promising line of work.
Footnotes
 1.
A similar notion can be defined for suffixes as well.
 2.
This is not always the case, see Denis and Esposito (2008) for details.
 3.
Throughout the paper we assume we can distinguish the words in a derivation, irrespective of whether two words at different positions correspond to the same symbol.
 4.
It must be clear that f=1 is not equivalent to a Σ′ basis. While both have the same basis size, the Σ′ basis only has sequences of length ≤1, while the extended model may include longer sequences and discard unfrequent symbols.
 5.
For f>10 we did not see significant improvements in the performance.
 6.
Technically, when working with the projected operators the statedistribution vectors will not be distributions in the formal sense. However, they correspond to a projection of a state distribution, for some projection that we do not recover from data (namely a change of basis as discussed in Sect. 2.2). This projection has no effect on the computations because it cancels out.
Notes
Acknowledgements
We are grateful to the anonymous reviewers for providing us with helpful comments. This work was supported by a Google Research Award, and by projects XLike (FP7288342), BASMATI (TIN201127479C0403), SGRLARCA (SGR20091428), and by the EU PASCAL2 Network of Excellence (FP7ICT216886). Borja Balle was supported by an FPU fellowship (AP200802064) from the Spanish Ministry of Education. Xavier Carreras was supported by the Ramón y Cajal program of the Spanish Government (RYC200802223). Franco M. Luque was supported by the National University of Córdoba and by a Postdoctoral fellowship of CONICET, Argentinian Ministry of Science, Technology and Productive Innovation. Ariadna Quattoni was supported by a Juan de la Cierva contract from the Spanish Government (JCI200904240).
References
 Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S., & Liu, Y. K. (2012a). A spectral algorithm for latent Dirichlet allocation. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), NIPS (pp. 926–934). Google Scholar
 Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., & Telgarsky, M. (2012b). Tensor decompositions for learning latent variable models. arXiv:1210.7559.
 Anandkumar, A., Hsu, D., & Kakade, S. M. (2012c). A method of moments for mixture models and hidden Markov models. Journal of Machine Learning Research—Proceedings Track, 23, 33.1–33.34. Google Scholar
 Bailly, R. (2011). Quadratic weighted automata: spectral algorithm and likelihood maximization. Journal of Machine Learning Research—Proceedings Track, 20, 147–163. Google Scholar
 Bailly, R., Denis, F., & Ralaivola, L. (2009). Grammatical inference as a principal component analysis problem. In L. Bottou & M. Littman (Eds.), Proceedings of the 26th international conference on machine learning (pp. 33–40). Montreal: Omnipress. Google Scholar
 Bailly, R., Habrard, A., & Denis, F. (2010). A spectral approach for probabilistic grammatical inference on trees. In M. Hutter, F. Stephan, V. Vovk, & T. Zeugmann (Eds.), Lecture notes in computer science (Vol. 6331, pp. 74–88). Berlin: Springer. Google Scholar
 Baker, J. K. (1979). Trainable grammars for speech recognition. In D. H. Klatt & J. J. Wolf (Eds.), Speech communication papers for the 97th meeting of the Acoustical Society of America (pp. 547–550). Google Scholar
 Balle, B. (2013). Learning finitestate machines: algorithmic and statistical aspects. PhD thesis, Universitat Politècnica de Catalunya. Google Scholar
 Balle, B., & Mohri, M. (2012). Spectral learning of general weighted automata via constrained matrix completion. In Advances in neural information processing systems (Vol. 25, pp. 2168–2176). Google Scholar
 Balle, B., Quattoni, A., & Carreras, X. (2011). A spectral learning algorithm for finite state transducers. In D. Gunopulos, T. Hofmann, D. Malerba, & M. Vazirgiannis (Eds.), Lecture notes in computer science: Vol. 6911. ECML/PKDD (1) (pp. 156–171). Berlin: Springer. Google Scholar
 Balle, B., Quattoni, A., & Carreras, X. (2012). Local loss optimization in operator models: a new insight into spectral learning. In J. Langford & J. Pineau (Eds.), Proceedings of the 29th international conference on machine learning (ICML2012), ICML’12 (pp. 1879–1886). New York: Omnipress. Google Scholar
 Balle, B., Castro, J., & Gavaldà, R. (2013). Learning probabilistic automata: a study in state distinguishability. Theoretical Computer Science, 473, 46–60. zbMATHMathSciNetCrossRefGoogle Scholar
 Beimel, A., Bergadano, F., Bshouty, N., Kushilevitz, E., & Varricchio, S. (2000). Learning functions represented as multiplicity automata. Journal of the ACM, 47, 506–530. zbMATHMathSciNetCrossRefGoogle Scholar
 Berstel, J., & Reutenauer, C. (1988). Rational series and their languages. Berlin: Springer. zbMATHCrossRefGoogle Scholar
 Boots, B., Siddiqi, S., & Gordon, G. (2011). Closing the learning planning loop with predictive state representations. The International Journal of Robotics Research, 30(7), 954–966. CrossRefGoogle Scholar
 Carlyle, J. W., & Paz, A. (1971). Realizations by stochastic finite automata. Journal of Computer and System Sciences, 5(1), 26–40. zbMATHMathSciNetCrossRefGoogle Scholar
 Castro, J., & Gavaldà, R. (2008). Towards feasible PAClearning of probabilistic deterministic finite automata. In Proceedings of the 9th international colloquium on grammatical inference (ICGI) (pp. 163–174). Google Scholar
 Clark, A., & Thollard, F. (2004). PAClearnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research, 5, 473–497. zbMATHMathSciNetGoogle Scholar
 Clark, S., & Curran, J. R. (2004). Parsing the WSJ using CCG and loglinear models. In Proceedings of the 42nd meeting of the association for computational linguistics (ACL’04), main volume, Barcelona, Spain (pp. 103–110). Google Scholar
 Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., & Ungar, L. (2012). Spectral learning of latentvariable PCFGS. In Proceedings of the 50th annual meeting of the association for computational linguistics (Volume 1: Long papers) (pp. 223–231). Jeju Island: Association for Computational Linguistics. Google Scholar
 Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., & Ungar, L. (2013). Experiments with spectral learning of latentvariable pcfgs. In Proceedings of the 2013 conference of the North American chapter of the Association for Computational Linguistics: human language technologies (pp. 148–157). Atlanta: Association for Computational Linguistics. Google Scholar
 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1–38. zbMATHMathSciNetGoogle Scholar
 Denis, F., & Esposito, Y. (2008). On rational stochastic languages. Fundamenta Informaticae, 86(1–2), 41–77. zbMATHMathSciNetGoogle Scholar
 Dhillon, P., Rodu, J., Collins, M., Foster, D., & Ungar, L. (2012). Spectral dependency parsing with latent variables. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 205–213). Jeju Island: Association for Computational Linguistics. Google Scholar
 Eisner, J. (2000). Bilexical grammars and their cubictime parsing algorithms. In H. Bunt & A. Nijholt (Eds.), Advances in probabilistic and other parsing technologies (pp. 29–62). Norwell: Kluwer Academic. CrossRefGoogle Scholar
 Eisner, J., & Satta, G. (1999). Efficient parsing for bilexical contextfree grammars and headautomaton grammars. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics (ACL), University of Maryland (pp. 457–464). Google Scholar
 Eisner, J., & Smith, N. A. (2010). Favor short dependencies: parsing with soft and hard constraints on dependency length. In H. Bunt, P. Merlo, & J. Nivre (Eds.), Trends in parsing technology: dependency parsing, domain adaptation, and deep parsing (Vol. 8, pp. 121–150). Berlin: Springer. CrossRefGoogle Scholar
 Fliess, M. (1974). Matrices de Hankel. Journal de Mathématiques Pures et Appliquées, 53, 197–222. zbMATHMathSciNetGoogle Scholar
 Goodman, J. (1996). Parsing algorithms and metrics. In Proceedings of the 34th annual meeting of the Association for Computational Linguistics (pp. 177–183). Santa Cruz: Association for Computational Linguistics. CrossRefGoogle Scholar
 Hsu, D., Kakade, S. M., & Zhang, T. (2009). A spectral algorithm for learning hidden Markov models. In Proceedings of the annual conference on computational learning theory (COLT). Google Scholar
 Hsu, D., Kakade, S. M., & Liang, P. (2012). Identifiability and unmixing of latent parse trees. Advances in neural information processing systems (NIPS). Google Scholar
 Kearns, M., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R. E., & Sellie, L. (1994). On the learnability of discrete distributions. STOC ’94. In Proceedings of the twentysixth annual ACM symposium on theory of computing (pp. 273–282). New York: ACM. Google Scholar
 Luque, F. M., Quattoni, A., Balle, B., & Carreras, X. (2012). Spectral learning for nondeterministic dependency parsing. In Proceedings of the 13th conference of the European chapter of the Association for Computational Linguistics (pp. 409–419). Avignon: Association for Computational Linguistics. Google Scholar
 Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval (1st ed.). Cambridge: Cambridge University Press. zbMATHCrossRefGoogle Scholar
 Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19, 313–330. Google Scholar
 McDonald, R., & Pereira, F. (2006). Online learning of approximate dependency parsing algorithms. In Proceedings of the 11th conference of the European chapter of the Association for Computational Linguistics (pp. 81–88). Google Scholar
 McDonald, R., Pereira, F., Ribarov, K., & Hajic, J. (2005). Nonprojective dependency parsing using spanning tree algorithms. In Proceedings of human language technology conference and conference on empirical methods in natural language processing (pp. 523–530). Vancouver: Association for Computational Linguistics. CrossRefGoogle Scholar
 Mohri, M. (2009). Weighted automata algorithms. In M. Droste, W. Kuich, & H. Vogler (Eds.), Monographs in theoretical computer science. An EATCS series. Handbook of weighted automata (pp. 213–254). Berlin: Springer. CrossRefGoogle Scholar
 Mossel, E., & Roch, S. (2005). Learning nonsingular phylogenies and hidden Markov models. In Proceedings of the 37th annual ACM symposium on theory of computing (STOC) (pp. 366–375). Google Scholar
 Palmer, N., & Goldberg, P. W. (2007). PAClearnability of probabilistic deterministic finite state automata in terms of variation distance. Theoretical Computer Science, 387(1), 18–31. zbMATHMathSciNetCrossRefGoogle Scholar
 Parikh, A., Song, L., & Xing, E. (2011). A spectral algorithm for latent tree graphical models. In Proceedings of the 28th international conference on machine learning, ICML 2011 (ICML) (pp. 1065–1072). Google Scholar
 Park, J. D., & Darwiche, A. (2004). Complexity results and approximation strategies for map explanations. Journal of Artificial Intelligence Research, 21, 101–133. zbMATHMathSciNetGoogle Scholar
 Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. Proceedings of the main conference, Association for Computational Linguistics. In Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics (pp. 404–411). Rochester: Association for Computational Linguistics. Google Scholar
 Petrov, S., Das, D., & McDonald, R. (2012). A universal partofspeech tagset. In Proceedings of LREC. Google Scholar
 Ron, D., Singer, Y., & Tishby, N. (1998). On the learnability and usage of acyclic probabilistic finite automata. Journal of Computing Systems Science, 56(2), 133–152. zbMATHMathSciNetCrossRefGoogle Scholar
 Salomaa, A., & Soittola, M. (1978). Automatatheoretic aspects of formal power series. New York: Springer. zbMATHCrossRefGoogle Scholar
 Schützenberger, M. (1961). On the definition of a family of automata. Information and Control, 4, 245–270. zbMATHMathSciNetCrossRefGoogle Scholar
 Siddiqi, S. M., Boots, B., & Gordon, G. J. (2010). Reducedrank hidden Markov models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS) (pp. 741–748). Google Scholar
 Song, L., Siddiqi, S. M., Gordon, G., & Smola, A. (2010). Hilbert space embeddings of hidden Markov models. In J. Fürnkranz & T. Joachims (Eds.), Proceedings of the 27th international conference on machine learning (ICML10) (pp. 991–998). Haifa: Omnipress. Google Scholar
 Titov, I., & Henderson, J. (2006). Loss minimization in parse reranking. In Proceedings of the 2006 conference on empirical methods in natural language processing (pp. 560–567). Sydney: Association for Computational Linguistics. CrossRefGoogle Scholar
 Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. zbMATHCrossRefGoogle Scholar
 Wiewiora, E. (2005). Learning predictive representations from a history. In Proceedings of the 22nd international conference on machine learning (pp. 964–971). New York: ACM. Google Scholar