1 Introduction

A recurring task in the context of parsing and neural sequence to sequence models—such as machine translation (Ilya et al. 2011; Sutskever et al. 2014), natural language processing (Schmidhuber 2014) and generative models (Graves 2013)—is to find an optimal path of tokens (e.g. words or letters) from a sequential list of probability distributions. Such a distribution can for instance be produced at the output layer of a recurrent neural network, e.g. a long short-term memory (LSTM). The goal is to decode these distributions by scoring all viable output sequences (paths) under some language model, and finding the path with the highest score.

Nowadays, the de facto standard solution is to use a variant of beam search (Steinbiss et al. 1994; Vijayakumar et al. 2016; Wiseman and Rush 2016; Kulikov et al. 2018; Pratap et al. 2020) to traverse the list of all possible output strings. Beam search stores and explores a constant sized list of possible decoded hypotheses at each step, compared to a greedy algorithm that only considers the top element at each step. Beam search thus interpolates between a simple greedy algorithm and best-first search; but just like greedy search, beam search is not guaranteed to find a global optimum. Furthermore, beam search suffers from sensitivity to the predicted sequence length. Improving the algorithm itself (Murray and Chiang 2018; Yang et al. 2018), as well as finding new decoding strategies (Fan et al. 2018; Holtzman et al. 2020), is an ongoing field of research.

A related task is found in transition based parsing of formal languages, such as context-free grammars (Hopcroft et al. 2001; Zhang and Clark 2008; Zhang and features 2011; Zhu et al. 2015; Dyer et al. 2015). In this model, an input string is processed token by token, and a heuristic prediction (which can be based on various types of classifiers, such as feed forward networks) is made on how to apply a transition at any one point. As in generative models and decoding tasks, heuristic parsing employs beam search, where a constant sized list of possible parse trees is retained in memory at any point in time, and at the end the hypothesis optimising a suitable objective function is chosen. Improvements of beam search-based parsing strategies are an active field of research (Buckman et al. 2016; Bohnet et al. 2016; Vilares and Gȯmez-Rodri̇guez 2018).

In essence, the problem of decoding a probabilistic sequence with a language model—or probabilistically parsing a formal grammar—becomes one of searching for paths in an exponentially growing tree: since at each step or node the list of possible sequence hypotheses branches, with maximum degree equal to the number of predictions for the next tokens. The goal is to find a path through this search space with the highest overall score. Due to runtime and memory constraints, a tradeoff has to be made which limits any guarantees on the performance of the search strategy.

Quantum computing has shown promise as an emerging technology to efficiently solve some instances of difficult computing tasks in fields ranging from optimisation (Gilyén et al. 2019; Montanaro 2020), linear algebra (Harrow et al. 2009; Berry et al. 2017), number theory and pattern matching (Montanaro 2016; 2017), language processing (Aaronson et al. 2019; Wiebe et al. 2019), machine learning (McClean et al. 2016; Bausch 2018; Wang et al. 2019; Li et al. 2019), to quantum simulation (Lloyd 1996; Babbush et al. 2018; Childs and Su 2019). While quantum computers are not yet robust enough to evaluate any of these applications on sample sizes large enough to claim an empirical advantage, a structured search problem such as language decoding is a prime candidate for a quantum speedup.

Although most naïve search problems can be sped up using Grover’s search algorithm (or one of its variants, such as fixed point search or oblivious amplitude amplification), finding good applications for quantum algorithms remains challenging, and super-quadratic (i.e. faster than Grover) speedups—such as Shor’s for prime factorisation (Shor 1999)—are rare. Recently, several exponentially faster algorithms (such as quantum recommender systems (Kerenidis and Prakash 2016), or dense low rank linear algebra (Wossnig et al. 2018)) have been proven to rely on a quantum random access memory model which, if classically available, can yield an exponential speedup without the need for quantum computing (Tang 2019).

In this work, we develop a quantum search decoder for parsing probabilistic token sequences with a super-quadratic speedup as compared to its classical counterpart. The algorithm can be seen as a generalisation of classical beam search, with potentially infinite beam width; for finite beam width, the list of hypotheses is pruned only once at the very end—after all possible parsing hypotheses have been generated—instead of performing continuous pruning during decoding, resulting in higher accuracy guarantees.

We develop two variants of the decoder. The first one is for finding the most likely parsed string. The more realistic use case is where the input sequence simply serves as advice on where to find the top scoring parse under a secondary metric—i.e. where the element with the highest decoder score is not necessarily the one with the highest probability of occurring when sampled. In this variant the speedup becomes more pronounced (i.e. the runtime scales less and less quickly in the input length) the better the advice (i.e. the steeper the power law falloff of the input, see Fig. 1).

Fig. 1
figure 1

Exponent f(R, k) of expected runtime of QuantumSearchDecode, when fed with a power law input with exponent k, over R alphabet tokens; plotted are individual curves for the values R ∈{3,5,10,15,20,30,40,60,100}, from top to bottom. For all R, f(R, k) drops off exponentially with growing k

Our novel algorithmic contribution is to analyse a recently developed quantum maximum finding algorithm (Apeldoorn et al. 2017) and its expected runtime when provided with a biased quantum sampler that we developed for formal grammars, under the premise that at each step the input tokens follow a power-law distribution; for a probabilistic sequence obtained from Mozilla’s DeepSpeech (which we show satisfies the premise), the quantum search decoder is a power of ≈ 4–5 faster than possible classically (Fig. 2).

Fig. 2
figure 2

Runtime of quantum beam search decoding the output of Mozilla’s DeepSpeech LSTM with a grammar, assuming an average branching ratio of R = 5, a token power law distribution with exponent k = 2.91, and post-amplification of the quantum search decoder with a constant number of retained hypotheses Nhyp ∈{101,…,1015}, plotted in rainbow colors from purple to red, bottom to top. In the left region, where full QuantumSearchDecoding is performed (as the beam comprises all possible hypotheses), a super-Grover speedup is obtained (Corollary 2). Where the beam width saturates, a Grover speedup is retained, and hypotheses are pruned only after all hypotheses have been constructed

In the following we assume basic familiarity with the notion of quantum computation, but provide a short overview for the reader in Appendix 1.

2 Main results

In this paper, we address the question of decoding a probabilistic sequence of words, letters, or generally tokens, obtained, e.g., from the final softmax layer of a recurrent neural network, or given as a probabilistic list of heuristic parse transitions. These models are essentially identical from a computational perspective. Hence, we give the following formal setup, and will speak of a decoding task, leaving implicit the two closely related applications.

Given an alphabet Σ, we expect as input a sequence of random variables X = (X1,X2,…,Xn), each distributed as \(X_{i}\sim \mathcal {D}_{i}^{{{\varSigma }}}\). The distributions \(\mathcal {D}_{i}^{{{\varSigma }}}\) can in principle vary for each i; furthermore, the Xi can either be independent, or include correlations. The input model is such that we are given this list of distributions explicitly, e.g. as a table of floating point numbers; for simplicity of notation we will continue to write Xi for such a table. The decoding machine M is assumed to ingest the input (a sample of the Xi) one symbol at a time, and branch according to some factor R at every step; for simplicity we will assume that R is constant (e.g. an upper bound to the branching ratio at every step). As noted, M can for instance be a parser for a formal grammar (such as an Earley parser (Earley 1970)) or some other type of language model; it can either accept good input strings, or reject others that cannot be parsed. The set of configurations of M that lead up to an accepted state is denoted by Ω; we assume that everything that is rejected is mapped by the decoder to some type of sink state ωΩ. For background details on formal grammars and automata we refer the reader to Hopcroft et al. (2001), and we provide a brief summary over essential topics in Appendix 2.

While we can allow M to make use of a heuristic that attempts to guess good candidates for the next decoding step, it is not difficult to see that a randomised input setting is already more generic than allowing extra randomness to occur within M itself: we thus restrict our discussion to a decoder M that processes a token sequence step by step, and such that its state itself now simply becomes a sequence (Mi)in of random variables. More precisely, described as a stochastic process, the Mi are random variables over the set Ω of internal configurations after the automaton has ingested Xi, given that it has ingested Xi− 1,…,X1 prior to that, with a distribution \(\mathcal {D}_{i}^{{{\varOmega }}}\). The probability of decoding a specific accepted string x = (x1,…,xn) is then given by the product of the conditional probabilities

$$ \begin{array}{@{}rcl@{}} \text{Pr}(M_{n} = x):\!&=&\! \mathcal{N}\text{Pr}(X=x) \\ \!&=&\! \frac{1}{\mathcal{N}}\prod\limits_{i=1}^{n}\text{Pr}(X_{i} = x_{i}|X_{j} = x_{j}, j\!\le\! i - 1) \end{array} $$
(1)

where \(\mathcal {N}= {\sum }_{x \in {{\varOmega }}} \text {Pr}(X=x)\). In slight abuse of notation we write Mn = x when we mean Mn = y(x), where y(x) is the configuration of the parser M that was provided with some input to produce the parsed string x (which is unambiguous as there is a one-to-one mapping between accepted strings and parser configurations y(x)). Similarly, we write xΩ for an accepted string/decoded path.

The obvious question is: which final accepted string of the decoder is the most likely? This is captured in the following computational problem.

Most Likely Parse

Input:

Decoder M over alphabet Σ, set of accep-

 

ting configurations Ω. Sequence of random

 

variables (Xi)in over sample space Σ.

Question:

Find σ = argmaxxΩPr(Mn = x).

Classically, it is clear that if we have a procedure that can sample the random variable Mn efficiently, then we can find the most likely element with an expected runtime of 1/Pr(Mn = σ), as this is the number of samples we are expected to draw to see the element once. While such sampling algorithms might be inefficient to construct in general, we emphasise that the question of drawing samples from strings over a formal language is an active field of research, and algorithms to sample uniformly are readily available for a large class of grammars: in linear time for regular languages (Bernardi and Giménez 2012; Oudinet et al. 2013), but also context-free grammars/restrictions thereof (McKenzie 1997; Goldwurm et al. 2001; Hickey and Cohen 1983; Gore et al. 1997; Denise 1996), potentially with global word bias (Reinharz et al. 2013; Lorenz and Ponty 2013; Denise et al. 2000; Ponty 2012).

In Theorem 3 and Section 3.1, we lift such a classical uniform sampler to a quantum sampler (denoted Uμ) with local (instead of global) word bias, which we can use to obtain a quantum advantage when answering Most Likely Parse. We note that the techniques used to prove Theorem 3 may well be used to obtain a (potentially faster) classical Monte Carlo procedure to sample from Mn. In what follows, we will therefore keep the decoder’s time complexity separate from the sampler’s runtime and simply speak of the decoder’s query complexity to Uμ, but we emphasise that constructing such an Uμ is efficiently possible, given a classical description of an automaton that parses the grammar at hand.

We start with the following observation, proven in Section 4.1.

Theorem 1

For an input sequence of n random variables to a parser with sampling subroutine Uμ, there exists a quantum search algorithm answering Most Likely Parse, using \(\pi /4\sqrt {\text {Pr}(M_{n}=\sigma )}\) queries to Uμ.

As explained, this theorem formalises the expected quadratic speedup of the runtime as compared to a classical algorithm based on sampling from Mn. Given the input to the parser is power-law distributed (see Definition 1), this allows us to formulate the following corollary.

Corollary 1

If the \(X_{i}\sim \text {Power}_{R}(k)\), answering Most Likely Parse requires at most 1/HR(k)n/2 queries; where \(H_{R}(k)={\sum }_{i=1}^{R} i^{-k}\).

Yet a priori, it is not clear that the weight of a decoded path (e.g. the product of probabilities of the input tokens) also corresponds to the highest score we wish to assign to such a path. This becomes obvious in the setting of a heuristic applied to a live translation: while at every point in time the heuristic might be able to guess a good forward transition, it might well be that long range correlations strongly affect the likelihood of prior choices. Research addressing these long-distance “collocations” indicates that LSTM models are capable of using about 200 tokens of context on average, but that they sharply distinguish nearby context (≈ 50 tokens) from the distant past. Furthermore, such models appear to be very sensitive to word order within the most recent context, but ignore word order in the long-range context (more than 50 tokens away) (Zhu et al. 2015; Dabrowska 2008; Khandelwal et al. 2018). Similarly, transformer-type architectures with self-attention—while outperforming LSTMs—feature a fixed-width context window; extensions thereof are an active field of research (Al-Rfou et al. 2019; Dai et al. 2019; Kitaev et al. 2020).

To address this setting formally, we assume there exists a scoring function , which assigns scores to all possible decoded paths. Without loss of generality, there will be one optimal string which we denote with τ = argmaxxΩF(x). Furthermore, we order all decoded strings Ω in some fashion, and index them with numbers i = 1,…,|Ω|. Within this ordering, τ can now be in different places—either because the heuristic guesses differently at each step, or because the input sequence varied a little. We denote the probability that the marked element τ is at position i with pi. In essence, the position where τ is found is now a random variable itself, with probability mass Pr(finding τ at index i) = pi.

For the decoder probabilities Pr(Mn = x) to serve as good advice on where to find the highest-score element under the metric F, we demand that the final distribution over the states of the decoder puts high mass where the highest-scoring element often occurs; or formally that

$$ \text{Pr}(M_{n} = \text{string with index \textit{i}} ) = p_{i}. $$
(2)

To be precise, we define the following problem.

Highest Score Parse

Input:

Decoder M over alphabet Σ and with state

 

space Ω. Sequence of random variables

 

(Xi)in over sample space Σ. Scoring

 

function .

Promise:

Eq. (2).

Question:

Find τ = argmaxxΩF(x).

What is the classical baseline for this problem? As mentioned in Montanaro (2011), if px is the probability that x is the highest-scoring string, then in expectation one has to obtain 1/px samples to see x at least once. Any procedure based on sampling from the underlying distribution px thus has expected runtime \({\sum }_{x\in {{\varOmega }}}\frac {1}{p_{x}}\times p_{x} = |{{\varOmega }}|\). In a sense this is as bad as possible; the advice gives zero gain over iterating the list item by item and finding the maximum in an unstructured fashion. Yet provided with the same type of advice, a quantum computer can exhibit tremendous gains over unstructured search, such as the following statement, formally proven in Section 4.2.

Theorem 2

With the same setup as in Theorem 1 but under the promise that the input tokens are iid with \(X_{i}\sim \text {Power}_{|{{\varSigma }}|}(k)\) over alphabet Σ (Definition 1), that the decoder has a branching ratio R ≤|Σ|, and that we can uniformly sample from the grammar to be decoded, there exists a quantum algorithm QuantumSearchDecode (Algorithm 1) answering Highest Score Parse with an expected number of iterations

$$ \begin{array}{@{}rcl@{}} \text{RT}_{1}(R,k,n) &=& \mathrm{O}\left( R^{nf(R,k)}\right),\\ \text{where}\quad f(R,k) &=& \log\left.\left( \frac{H_{R}(k/2)}{H_{R}(k)^{1/2}} \right) \right/ \log R, \end{array} $$

and where HR(k) is defined in Corollary 1.

There exists no classical algorithm to solve this problem based on taking stochastic samples from the decoder M that requires less than Ω(Rn) samples.

The exponent f(R, k) indicates the speedup over a classical implementation of the decoding algorithm (which would have to search over Rn elements). We find that f(R, k) < 1/2 for all R, k > 0, and in fact f(R, k)→0 exponentially quickly with k; we formulate the following corollary.

Corollary 2

For k > 0, QuantumSearchDecode is always faster than plain Grover search (with runtime ∝ Rn/2); the extent of the speedup depends on the branching ratio R and the power law exponent k (see Fig. 1).

Finally, in Section 5 we modify the full quantum search decoder by only searching over the paths with likelihood above some given threshold (that we allow to depend on n in some fashion), turning the decoder into a type of beam search, but where the pruning only happens at the very end (Algorithm 2). This means that in contrast to beam search, the top scoring element is found over the globally most likely parsed paths, avoiding the risk early beam pruning brings. We analyse the runtime of Algorithm 2 for various choices of beam width numerically, and analyse its performance on a concrete example—Mozilla’s DeepSpeech implementation, a speech-to-text LSTM which we show to follow a power-law token distribution at each output frame (see Appendix 8 for an extended discussion). F or DeepSpeech, we empirically find that input sequence lengths of up to 500 tokens can realistically be decoded, with an effective beam width of 1015 hypotheses—while requiring ≈ 3 × 106 search iterations (cf. Fig. 2). As expected, the super-Grover speedup from Corollary 2 is achieved in the regime where full QuantumSearchDecoding happens; once the beam width saturates, the speedup asymptotically approaches a quadratic advantage as compared to classical beam search.

3 Quantum search decoding

In this section, we give an explicit algorithm for QuantumSearchDecode. As mentioned before (see Section 2), we assume we have access to a classical sampling algorithm that, given a list of transition probabilities determined by the inputs X1,…,Xn, yields a random sample drawn uniformly from the distribution. Since this sampler is given as a classical probabilistic program, we first need to translate it to a quantum algorithm. We start with the following lemma.

Lemma 1

For a probabilistic classical circuit with runtime T(n) and space requirement S(n) on an input of length n, there exists a quantum algorithm that runs in time \(\mathrm {O}(T(n)^{\log _{2} 3})\) and requires \(\mathrm {O}(S(n)\log T(n))\) qubits.

Proof

Follows from Thm. 1 in Buhrman et al. (2001); see Appendix 7. □

3.1 Biased quantum sampling from a regular or context-free grammar

Given a sampler that can yield uniformly distributed strings si of a language, we want to raise it to a quantum circuit Uμ that produces a quantum state which is a biased superposition over all such strings si = ai1ai2ain, where each string is weighted by the probability pij of the symbol aij occurring at index j (i.e. by Eq. (1)). In addition to the weighted superposition, we would like to have the weight of each state in the superposition spelled out as an explicit number in an extra register (e.g. as a fixed precision floating point number), i.e. as

$$ \mathbf{U}_{\mu}|0\rangle = |{\mu}\rangle \propto \underset{q\in{{\varOmega}}}{\sum} \sqrt{p_{q}}|{h_{q}}\rangle|{p_{q}}\rangle|{q}\rangle, $$
(3)

where Ω is the set of accepted strings reachable by the decoder in n steps, |hq〉 is an ancillary state that depends on q and is contained in the decoder’s work space, where q is a state reached by reading the input sequence aq1,aq2,…,aqn. The weights \(p_{q} = {\prod }_{j=1}^{n} p_{qj}\).

As outlined in the introduction, we know there exist uniform classical probabilistic samplers for large classes of grammars, e.g. for regular languages in linear time (e.g. Oudinet et al. 2013) and polynomial time for variants of context free grammars (e.g. Goldwurm et al. 2001). Keeping the uniform sampler’s runtime separate from the rest of the algorithm, we can raise the sampler to a biased quantum state preparator for |μ〉.

Theorem 3

Assume we are given a classical probabilistic algorithm that, in time T(n), produce a uniform sample of length n from a language, and we are also given a list of independent random variables X1,…,Xn with pdfs pi, j for i = 1,…,n and j = [Σ]. Then we can construct a quantum circuit \(\mathbf {U}_{\mu ^{\prime }}\) that produces a state \(|{\mu ^{\prime }}\rangle \epsilon \)-close (in total variation distance) to the one in Eq. (3). The algorithm runs in time O(T(n)1.6 × n3κ/𝜖2), where κ is an upper bound on the relative variance of the conditional probabilities \(Pr(a|s_{1} {\dots } s_{i})\), for a, siΣ, for the variable Xi+ 1 given the random string XiXi− 1X1.

Proof

See Appendix 3. □

Getting a precise handle on κ strongly depends on the grammar to be parsed and the input presented to it; it seems unreasonable to claim any general bounds as it will most likely be of no good use for any specific instance. However, we note that it is conceivable that if the input is long and reasonably independent of the language to be sampled, then κ should be independent of n, and \(\kappa \approx 1/p(r_{\min \limits })\), where p(r) is the distribution of the input tokens at any point in time—e.g. p(r) ∝ rk as in a power law.Footnote 1

3.2 The quantum search decoder

The quantum algorithm underlying the decoder is based on the standard maximum finding procedure developed by Dürr and Høyer (1996) and Ahuja and Kapoor (1999), and its extension in Apeldoorn et al. (2017) used in the context of SDP solvers.

The procedure takes as input a unitary operator Uμ which prepares the advice state, and a scoring function F which scores its elements, and returns as output the element within the advice state that has the maximum score under F. As in Section 3.1, we assume that F can be made into a reversible quantum circuit to be used in the comparison operation. We also note that reversible circuits for bit string comparison and arithmetic are readily available (Oliveira and Ramos 2007), and can, e.g., be implemented using quantum adder circuits (Gidney 2018).

figure c

Algorithm 1 lists the steps in the decoding procedure. As a subroutine within the search loop, we perform exponential search with oblivious amplitude amplification (Berry et al. 2014). As in the maximum finding algorithm, the expected query count for quantum search decoding is given as follows.

Theorem 4

If xΩ is the highest-scoring string and px its score in Eq. (3), the expected number of iterations in QuantumSearchDecode to find it maximum is \(\mathrm {O}(\min \limits \{ 1/\sqrt {p_{x}},\sqrt n\})\).

Proof

Immediate by Apeldoorn et al. (2017). □

In the following we will for simplicity say |x〉 is the highest-scoring string (including ancilliary states given in Eq. (3)), and write |〈x|μ〉| instead of \(\sqrt {p_{x}}\). It is clear that the two notions are equivalent.

4 Power law decoder input

In this section we formally prove that if the decoder is fed independent tokens that are distributed like a power law, then the resulting distribution over the parse paths yields a super-Grover speedup—meaning the decoding speed is faster than applying Grover search, which itself is already quadratically faster than a classical search algorithm that traverses all possible paths individually.

A power law distribution is the discrete variant of a Pareto distribution, also known as Zipf’s law, which ubiquitously appears in the context of language features (Jäger 2012; Stella and Brede 2016; Egghe 2000; Piantadosi 2014). This fact has already been exploited by some authors in the context of generative models (Goldwater et al. 2011).

Formally, we define it as follows.

Definition 1

Let A be a finite set with |A| = R, and k > 1. Then PowerR(k) is the power law distribution over R elements: for \(X\sim \text {Power}_{R}(k)\) the probability density function Pr(X = x) = rk/HR(k) for an element of rank r (i.e. the rth most likely element), where HR(k) is the Rth harmonic number of order k (Corollary 1).

We are interested in the Cartesian product of power law random variables, i.e. sequences of random variables of the form (X1,…,Xn). Assuming the random variables \(X_{i}\sim \text {Power}_{R}(k)\) are all independent and of rank ri with pdf \(q(r_{i})=r_{i}^{-k}/H_{R}(k)\), respectively, it is clear that

$$ p(r_{1},\ldots,r_{n}) = \prod\limits_{i=1}^{n} q(r_{i}) = \frac{1}{H_{R}(k)^{n}}\frac{1}{(r_{1}{\cdots} r_{n})^{k}}. $$
(4)

As in Montanaro (2011), we can upper bound the number of decoder queries in QuantumSearchDecode by calculating the expectation value of the iterations necessary—given by Theorem 4—with respect to the position of the top element.

We assume that at every step, when presented with choices from an alphabet Σ, the parsed grammar branches on average R ≤|Σ| times. Of course, even within a single time frame, the subset of accepted tokens may differ depending on what the previously accepted tokens are. This means that if the decoder is currently on two paths β1 (e.g. corresponding to “I want”) and β2 (“I were”), where the next accepted token sets are Σ1,Σ2Σ (each different subsets of possible next letters for the two presented sentences), respectively, then we do not necessarily have that the total probability of choices for the two paths—Pr(Σ1) and Pr(Σ2)—are equal. But what does this distribution over all possible paths of the language, weighted by Eq. (1), look like?

Certainly this will depend on the language and type of input presented. Under a reasonable assumption of independence between input and decoded grammar, this becomes equivalent to answering the following question: let X be a product-of-powerlaw distribution with pdf given in Eq. (4), where every term is a powerlaw over Σ. Let Y be defined as X, but with a uniformly random subset of elements deleted; in particular, such that Rn elements are left, for some R < |Σ|. Is Y distributed as a product-of-powerlaws as in Eq. (4), but over R elements at each step? In the case of continuous variables this is a straightforward calculation (see Appendix 5); numerics suggest it also holds true for the discrete case.

But even if the input that the parser given is independent of the parsed grammar, it is not clear whether the sample distribution over R (i.e. sampling R out of |Σ| power-law distributed elements) follows the same power law as the original one over Σ; this is in fact not the case in general (Zhu et al. 2015). However, it is straightforward to numerically estimate the changed power law exponent of a sample distribution given R and |Σ|—and we note that the exponent shrinks only marginally when R < |Σ|.

In this light and to simplify the runtime analysis, we therefore assume the decoder accepts exactly R tokens at all times during the parsing process (like an R-ary tree over hypotheses) with a resulting product-of-powerlaw distribution, and give the runtimes in terms of the branching ratio, and not in terms of the alphabet’s size. This indeed yields a fair runtime for comparison with a classical variant, since any classical algorithm will also have the aforementioned advantage (i.e. we assume the size of final elements to search over is Rn, which precisely corresponds to the number of paths down the R-ary tree).

4.1 Most likely parse: query bound

In this case F simply returns pq as the score in Eq. (3). If x labels the highest-mass index of the probability density function (neglecting the ancilliary states in Eq. (3) for simplicity), it suffices to calculate the state overlap |〈x|μ〉|. By Eq. (4), we then have \(|{\langle {x}|{\mu }\rangle }|^{2} = H_{R}^{-n}(k)\). The claim of Corollary 1 follows from these observations.

4.2 Highest score parse: simple query bound

We aim to find a top element scored under some function F under the promise that |μ〉 (given in Eq. (3)) presents good advice on where to find it, in the sense of Eq. (2). The expected runtimes for various power law falloffs k can be obtained by taking the expectation with respect to px as in Montanaro (2011).

In order to do so, we need to be able to calculate expecation values of the cartesian product of power law random variables, where we restrict the domain to those elements with probability above some threshold. We start with the following observation.

Lemma 2

If QuantumSearchDecode receives as input iid random variables X1,…,Xn, with \(X_{i}\sim \text {Power}_{R}(k)\), then the number of queries required to the parser is \( \text {RT}_{1}(R,k,n) = \mathrm {O}\left (H_{R}(k/2)^{n} / H_{R}(k)^{n/2} \right ). \)

Proof

The expectation value of 1/〈x|μ〉 is straightforward to calculate; writing r = (r1,…,rn), by Eq. (4), we have

As \(\mathrm {O}(\min \limits \{ 1/\langle {x}|{\mu }\rangle , \sqrt n \}) \le \mathrm {O}(1/\langle {x}|{\mu }\rangle )\) the claim follows. □

We observe that the runtime in Lemma 2 is exponential in n. Nevertheless, as compared to a Grover algorithm—with runtime Rn/2—the base is now dependent on the power law’s falloff k. We can compare the runtimes if we rephrase RT1(R, k, n) = Rnf(R, k), by calculating

$$ \begin{array}{@{}rcl@{}} \left( \frac{H_{R}(k/2)}{H_{R}(k)^{1/2}} \right)^{n} &=& R^{n f(R,k)} \\ \Longleftrightarrow f(R,k) &=& \log\left.\left( \frac{H_{R}(k/2)}{H_{R}(k)^{1/2}} \right)\right/\log R. \end{array} $$

We observe that the exponent f(R, k) ∈ (0,1/2), i.e. it is always faster than Grover, and always more than quadratically faster than classically. The exponent’s precise dependency on k for a set of alphabet sizes R is plotted in Fig. 1. For growing k, f(R, k) falls off exponentially.

4.3 Most likely parse: full query bound

A priori, it is unclear how much we lose in Lemma 2 by upper-bounding \(\mathrm {O}(\min \limits \{ 1/\langle {x}|{\mu }\rangle , \sqrt n \})\) by O(1/〈x|μ〉)—so let us be more precise. In order to evaluate the expectation value of the minimum, we will break up the support of the full probability density function p(r) into a region where p(r) > 1/Rn, and its complement. Then, for two constants C1 and C2, we have for the full query complexity

(5)

In order to calculate sums over sections of the pdf p(r), we first move to a truncated Pareto distribution by making the substitutions

$$ \begin{array}{@{}rcl@{}} \underset{r\in A}{\sum} \frac{1}{r^{k}} \longrightarrow {\int}_{A} \frac{1}{r^{k}} \mathrm{d} r,\ H_{R}(k) \longrightarrow h_{R}(k) := {{\int}_{1}^{R}} \frac{1}{r^{k}}\mathrm{d} r. \end{array} $$

While this does introduce a deviation, its magnitude is minor, as can be verified numerically throughout (see Fig. 4, where we plot both RT1 and the continuous variant \(\text {RT}_{1'}(R,k,n):={h_{R}^{n}}(k/2)/h_{R}^{n/2}(k)\)).

The type of integral we are interested in thus takes the form

$$ M_{c,n}^{R,k_{1},k_{2}} \!:=\! \frac{1}{{h_{R}^{n}}(k_{1})}{\iiint_{1}^{R}}\frac{\chi(r_{1}{\cdots} r_{n}\!\le\! c)}{(r_{1}{\cdots} r_{n})^{k_{2}}} \mathrm{d} r_{1}{\cdots} \mathrm{d} r_{n}, $$
(6)

where k1 is not necessarily equal to k2, and typically \(c=(R/h_{R}(k_{1}))^{n/k_{1}}\), which would reduce to the case we are seeking to address in Eq. (5). Here, χ(⋅) denotes the characteristic function of a set, i.e. it takes the value 1 where the premise is true, and 0 otherwise. We derive the following closed-form expression.

Lemma 3

For k ≠ 1, Eq. (6) becomes

$$ \begin{array}{@{}rcl@{}} M_{c,n}^{R,k_{1},k_{2}} &=& \frac{(-1)^{n}}{k^{\prime n} {h_{R}^{n}}(k_{1})}\!\!\! \sum\limits_{j=0}^{\min\{n,\lfloor c^{\prime}/a^{\prime}\rfloor\}} \binom{n}{j}\\ &&\times \left( \mathrm{e}^{a^{\prime}k^{\prime}j} - \mathrm{e}^{-c^{\prime}k^{\prime}}\sum\limits_{l=0}^{n-1}\frac{(a^{\prime}k^{\prime}j-c^{\prime}k^{\prime})^{l}}{l!} \right), \end{array} $$

where \(k^{\prime }=1-k_{2}\), \(c^{\prime }=\log c\), \(a^{\prime }=\log R\).

Proof

See Appendix 4. □

5 Quantum beam search decoding

The goal of this section is to modify the QuantumSearchDecoder such that it behaves more akin to a classical beam search algorithm. More specifically, instead of searching for the top scored element which could sit anywhere within the advice distribution, we make the assumption that wherever the advice probability lies below some threshold p(x) < p0—where p0 can be very small—we discard those hypotheses. This is done by dovetailing a few rounds of amplitude amplification to suppress all beam paths with probability less than p0 (which we can do, since we have those probabilities written out as numbers within the advice state |μ〉 in Eq. (3)); a schematic of the algorithm can be found in Algorithm 2.

Of course we only want to do this if the number of amplification rounds, given as the squareroot of the inverse of the leftover probability \({\sum }_{x:p(x)\ge p_{0}}p(x)\), is small (i.e. constant, or logarithmic in n). We note that this expression is, as before, well-approximated by \(M_{p_{0},n}^{R,k,k}\) given in Lemma 3.

figure f

In beam search, only the top scoring hypotheses are kept around at any point in time; the difference to our method is of course that we can score the elements after every hypothesis has been built. This is not possible in the classical case, since it would require an exponential amount of memory, or postselection. As in Section 3, we have the two cases of finding the top scoring path and the most likely parse. Deriving a runtime bound for Most Likely Parse is straightforward—and does not, in fact, gain anything. This is because when finding the maximum-likelihood path τ, one performs amplitude amplification on that element anyhow, and p(τ) > p0—so it is within the set of elements with probability kept intact by the post-amplification.Footnote 2

The only interesting case of amplifying the advice state in QuantumSearchDecode to raise it to a beam search variant is thus for the case of Highest Score Parse, using the decoder’s output as advice distribution. Instead of listing a series of results for a range of parameters, we provide an explicit example of this analysis with real-world parameters derived from Mozilla’s DeepSpeech neural network in the next section, and refer the reader to Appendix 6 for a more in-depth analysis of variants of a constant and non-constant amount of post-amplification.

6 DeepSpeech

6.1 Analysis of the output rank frequency

To support the applicability of our model, we analysed our hypothesis that the output probabilities of an LSTM used to transcribe voice to letters—which can then be used, e.g., in a dialogue system with an underlying parser—is distributed in a power-law like fashion. More specifically, we use DeepSpeech, Mozilla’s implementation of Baidu’s DeepSpeech speech recognition system (Hannun et al. 2014; Mozilla 2019b); our hypothesis was that these letter probabilities follow a power-law distribution; our data supports this claim (see Appendix 8, also for a discussion of the LSTM’s power-law output—a model feature—vs. the power-law nature of natural language features).

6.2 Runtime bounds for quantum beam search decoding

We take the power law exponent derived from Mozilla’s DeepSpeech neural network, k = 3.03 (cf. Appendix 8), and derive runtime bounds for decoding its output with a parser under the assumption that, on average, we take R = 5 branches in the parsing tree at every time step. As discussed in Section 4, the sampling distribution over five elements only yields a slightly lower exponent of k = 2.91. How does quantum beam search perform in this setting, and how many hypotheses are actually searched over? And what if we fix the beam’s width to a constant, and increase the sequence length? We summarise our findings in Figs. 27 and 8).

7 Summary and conclusions

We have presented a quantum algorithm that is modelled on and extends the capabilities of beam search decoding for sequences of random variables. Studies of context sensitivity of language models have shown that state-of-the-art LSTM models are able to use about 200 tokens of context on average while working with standard datasets (WikiText2, Penn Treebank) (Khandelwal et al. 2018); state of the art transformer-based methods level off at a context window of size 512 (Al-Rfou et al. 2019). On the other hand, under the premise of biased input tokens, our quantum search decoding method is guaranteed to find—with high constant success probability—the global optimum, and it can do so in expected runtime that is always more than quadratically faster than possible classically. As demonstrated empirically (cf. Fig. 2), our quantum beam search variant features a runtime independent of the sequence length: even for token sequences of length > 500 the top 1014 global hypotheses can be searched for an optimal prediction, within 107 steps.

We have further shown that neural networks used in the real world—concretely DeepSpeech—indeed exhibit a strong power law distribution on their outputs, which in turn supports the premise of our algorithm; how the performance scales in conjunction with a native recurrent quantum neural network such as in Bausch (2020) is an interesting open question.